search and retrieval of multi-modal data … · search and retrieval of multi-modal data associated...

SEARCH AND RETRIEVAL OF MULTI-MODAL DATA ASSOCIATED WITH IMAGE-PARTS

Niloufar Pourian, S. Karthikeyan, B.S. Manjunath

Department of Electrical and Computer Engineering, University of California, Santa Barbara, USA npourian, karthikeyan, manj @ ece.ucsb.edu

ABSTRACT

We present a novel framework for querying multi-modal datafrom a heterogeneous database containing images, textualtags, and GPS coordinates. We construct a bi-layer graphstructure using localized image-parts and associated GPS lo-cations and textual tags from the database. The first layergraphs capture similar data points from a single modalityusing a spectral clustering algorithm. The second layer of ourmulti-modal network allows one to integrate the relationshipsbetween clusters of different modalities. The proposed net-work model enables us to use flexible multi-modal queries onthe database.

Index Terms— Search and Retrieval, Multi-Modal, Com-munity Detection

1. INTRODUCTION

In recent years, the amount of information with variousmodalities has rapidly increased. This has been more no-ticeable in digital image collections. Every day, millions ofimages associated with noisy multi-modal labels are gen-erated. In many scenarios in social media mining, medicalimage analysis and surveillance, analysts benefit from rep-resenting queries using multiple information sources. Forexample, radiologists interpreting Traumatic Brain Injury(TBI) have access to MRI, CT imaging data and patient at-tributes such as age, location, and ethnicity. It would be ofa significant help if they can query such a heterogeneousdatabase using a combination of these features, such as extentof the injury for a patient age group of 16-20 with a specificethnicity. As another example, in a surveillance scenario,the analyst might be interested in determining geo-localizedinstances of a specific car within a specific time-interval. Toaddress such “multi-modal” queries new methods need to bedeveloped that effectively integrate information derived fromheterogeneous datasets.

Most of the multimedia retrieval techniques focus on theretrieval of single modality data, such as images [1, 2, 3] andtext documents [4, 5]. Some recent works have focused oncross-media retrieval where the query examples and the queryresults have different modalities [6, 7, 8, 9]. The main diffi-

This work was supported by ONR grant N00014-12-1-0503.

culty in cross-media retrieval is to define a similarity measureamong heterogeneous low-level features.

In order to simultaneously search and retrieve data frommultiple modalities, other approaches have been considered[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. For instancein [16], it is experimentally shown that multimodal queriesachieve higher retrieval accuracy than mono-modal ones. Thework in [10] suggests using a combination of ontology brows-ing and keyword-based querying. The methods presented in[11, 14, 15, 16] use a similar approach and rely on the as-sumption that every document has an equal number of nearestneighbors for each of the modalities. However, such an as-sumption might degrade the retrieval performance as a docu-ment containing “image+text” may have many nearest neigh-bors in image modality, but not as many relevant textual data.Finally, all these approaches are generally based on a globalimage understanding rather than a localized one. Therefore,they are unable to provide the most relevant multi-modal datafor different image-parts.

This paper focuses on information retrieval associatedwith image-parts. Understanding the relevance between animage and data points of different modalities such as tex-tual tags or GPS locations strongly depends on how wellthe image itself is understood. The proposed formulationintroduces a way of understanding different image-parts andassociates each image-part with its most relevant multi-modalinformation. Our approach utilizes the co-occurrence of datapoints within a modality and across multiple modalities us-ing a graph structure. We note that the proposed work doesnot impose relevance between each document and an equalnumber of media objects from all modalities as assumed by[11, 14, 15, 16]. In summary, our contributions are threefold:

• We introduce a retrieval system by constructing mono-modal graph structures and detecting communities ofrelated data points from a single modality. In particular,a graphical image structure enables the system to learndifferent image-parts.

• By learning coherent communities among graph struc-tures of different modalities, we create a network to in-tegrate the relationships between communities.

• Relevant multi-modal data for a given image-part is re-trieved using a hierarchical search.

2. APPROACH

In this paper, different types of metadata are treated as differ-ent modalities. For illustration purposes and without loss ofgenerality, we focus on three different modalities: 2D images,textual tags, and GPS locations. However, proposed methodcan be generalized to an arbitrary number of modalities.

2.1. Graph Structure: Single ModalityWe incorporate the similarity between the data points of asingle modality m by creating a graph structure G(m) =(V (m), A(m)) with V (m) representing the nodes of graphG(m) from modality m, and A(m) being the correspondingadjacency matrix. In what follows, we describe how each ofthese graph structures are created in details.

Modality 1 - Image: The purpose of creating a graphicalimage structure is to group similar objects together. Follow-ing the work of [21, 22], we integrate the visual characteris-tics along with the spatial information of image-parts acrossall database images in a graph structure. To provide spatialinformation, we utilize a segmentation algorithm based oncolor and texture [23]. The graphical image structure G(1) =

(V (1), A(1)) containsD∑I=1

|vI | number of nodes, with |vI | de-

noting the number of segmented regions of image I , and Drepresenting the total number of images in the database. Twonodes i and j are connected if they are spatially adjacent or ifthey are visually similar:

A(1)ij = I(i ∈ Fj or i ∈ Hj), ∀i, j ∈ V (1) (1)

where I(x) is equal to 1 if x holds true and zero otherwise.In addition, Fj indicates the set of all nodes (segmented re-gions) that are visually similar to node j and Hj is the setof all nodes in the spatial neighborhood of node j. To rep-resent each image, DenseSift features [24] are extracted andthen quantized into a visual codebook. To form a regional sig-nature hi for a node i, features are mapped to the segmentedregions that they belong to. Then, a histogram representationof the codebook is created. Two nodes i and j are consid-ered visually similar if the normalized distance between thetwo nodes’ visual features is less than a threshold αI ∈ R+.The distance between two regional histograms is measured bythe Hellinger distance, as it has been shown to be a good met-ric for computing the distance between histograms in retrievalproblems [25].

Modality 2 - Textual tags: We use graph G(2) =(V (2), A(2)) to encode the co-appearance of textual tags.Two nodes (textual tags) are connected if they are highlyrelated. The adjacency matrix for this graph structure A(2) isdefined by:

A(2)ij = I(i ∈ Tj), ∀i, j ∈ V (2) (2)

with Tj denoting the set of all nodes that have a co-appearancedistance of less than αt ∈ R+ from node j. To measure the

m=1

1

2

3

4

6

5

1

2

7

3

45

r=1 r=2 Multi-modal Network

1

13

2

14

35

4

16

11

15

7

108

6

9

C(1)(1)

12

8

910

1st Layer

2nd Layer

m=2 m=3

G(3) G(2) G(1)

C(1)(2)

C(1)(3)

C(1)(4)

C(2)(2)

C(2)(1) C(2)(3)

C(3)(2)

C(3)(1)

C(1)(1) C(1)(2)C(2)(2)

C(2)(1)

C(3)(1)

C(1)(3)

C(2)(3) C(3)(2)

C(1)(4)

Fig. 1: Illustration of the graphical structures in the first layer and the multi-modal network representation in the second layer. Figure is best viewed incolor.

distance between two textual tags i and j, binary vectors tiand tj are defined to capture the presence or absence of eachtag across the database. The distance between the two binarytag vectors is computed using the Cosine distance [5].

Modality 3 - GPS coordinates: GraphG(3) = (V (3), A(3))illustrates the GPS data locations in the dataset. Two nodesare connected if and only if (iff) their locations have a closedistance. Here, A(3) is defined by:

A(3)ij = I(i ∈ Gj), ∀i, j ∈ V (3) (3)

with Gi indicating the set of all nodes (GPS locations) close tonode i. Two GPS locations i and j are considered to be closeif their Euclidean distance is less than a threshold αg ∈ R+.

2.2. Community Detection: Single ModalityOur goal is to find groups of similar/related single modalitydata in each of the graph representations. This is done by ap-plying a graph partitioning algorithm to each of the graphicalstructures. Each of these groups resembles a bag of relatedsingle modality data and is referred to as a “community”.

To perform the graph partitioning, we use a normal-ized cut method as described in [26]. In this algorithm, thequality of the partition (cut) is measured by the density ofthe weighted links inside communities as compared to theweighted links between communities. The objective is tomaximize sum of the weighted links within each communitywhile minimizing sum of the weighted links across commu-nity pairs.

Let M be the number of modalities. A graph G(m) withm ∈ 1, . . . ,M consists of C(m) detected communities.Each detected community c(m)(ν) with ν ∈

1, . . . , C(m)

is a collection of related nodes in that particular modality. Forexample, in the graphical image structure, each communitycontains all the pieces/image-parts of an object and mappingthese back to each segmented image would highlight/detectthat particular object [21, 22]. A detected community for thetextual tags’ graph corresponds to highly related/correlatedtags. Finally, a community for the GPS graph contains all theGPS locations that are relatively close to each other.

Community Representation: Let P(m)(ν) denote theset of all nodes in community c(m)(ν), with ν ∈

1, . . . , C(m)

and m ∈ 1, . . . ,M. Each detected community in a graph-ical image structure is represented by the average of thehistograms representations of the nodes it contains. This issummarized by c(1)(ν) =

∑p∈P(1)(ν) hp/|P(1)(ν)|. Com-

munity c(2)(ν) in the textual-tag graph structure is repre-sented by a binary vector of length Kt with each bin k beingequal to 1 iff the corresponding tag exists within that cluster.This can be summarized by c(2)k (ν) = I(

∑p∈P(2)(ν) tpk >

0) , k = 1, . . .Kt. Finally, each community c(3)(ν) isrepresented by the average of all the nodes (GPS coordinates)that it contains: c(3)(ν) =

∑p∈P(3)(ν) gp/|P(3)(ν)|.

2.3. Network Representation of Multi-Modal DataA second layer multi-modal network is constructed to inte-grate the relation among the multi-modal data. This networkis denoted by W = (V,A) with V and A representing thenodes and the adjacency matrix of W , respectively. Eachnode in this network is a detected community in a mono-modal graph structure, and itself is a collection of nodes inthe first layer (mono-modal data points). This can be summa-rized by: V =

⋃Mm=1

⋃C(m)

ν=1 c(m)(ν), where M is the totalnumber of modalities.

In this network, node c(m)(ν) from modality m andnode c(m

′)(ν′) from modality m′ are connected if the co-occurrence rate of the nodes fallen into community c(m)(ν) inthe first layer with the nodes fallen into community c(m

′)(ν′)in the first layer is larger than a threshold γ. Thus, adjacencymatrix A of the second layer can be summarized as:

Ac(m)(ν),c(m

′)(ν′) = I

∑

i∈c(m)(ν),j∈c(m′)(ν′)

Em,m′

i,j

∑i∈G(m),j∈G(m′)

Em,m′

i,j

≥ γ

(4)

We set Em,m′

i,j = 1 if node node i from modality m co-occurswith node j from modality m′ and zero otherwise. To makea fair comparison among different modalities, we normalizeeach co-occurrence rate by the maximum co-occurrence ratethat can occur between the two particular modalities.

2.4. Multi-modal Community DetectionTo group highly similar/related nodes across different modal-ities (second layer), we use a graph partitioning algorithmon the multi-modal network. It is worth noting that the firstlayer of clustering allows one to combine the similar singlemodality data points together. The second layer of cluster-ing is introduced to integrate/learn highly related data acrossdifferent modalities. Our approach is favorable as it does notdepend on bringing multi-modal data into a common featurespace. In addition, the proposed approach integrates the infor-mation from each of the modalities independent of its otherco-existing modalities. This implies that if a collection ofimages are accompanied by textual tags and GPS locationsand a similar image is accompanied by a similar textual tagbut no GPS information, our model is expected to group such

multi-modal information together and consequently associatethe image with the corresponding GPS location.

In the remainder of this paper, let R denote the total num-ber of communities in the second layer network and r ∈1, . . . , R represent each of the detected multi-modal com-munities.

3. QUERY RETRIEVAL

When a multi-modal query q is presented to the system, themost relevant multi-modal data are retrieved and ranked basedon three different factors: the strength of relevance of eachmulti-modal cluster to the query, the strength of relevanceof each mono-modal cluster within that multi-modal cluster,and finally the degree of similarity between each data point ineach mono-modal cluster and the query.

The level of relevance between each multi-modal clusterr and a query q is measured by computing the similarity be-tween the query data and the mono-modal clusters in r. Thisis summarized by:

Λr =

∑m∈Mq

∑U(m)r

u=1

∑Q(m)

x=1 e−d(c(m)(u),q(m)(x))

|Mq|×U(m)r ×Q(m)

, (5)

with r ∈ 1, . . . , R,Mq representing the set of all modalitiespresent in the query,U (m)

r denoting the total number of mono-modal clusters of modality m within the multi-modal clusterr, andQ(m) representing the total number of data points frommodality m within the query. In addition, q(m)(i) denotes theith query data from modality m and d(., .) is the Euclideandistance function. A multi-modal cluster with a higher simi-larity score has the highest priority score.

The strength of relevance of each mono-modal clusterwithin the multi-modal cluster r is measured by taking intoaccount the amount of similarity between the query and themono-modal cluster, the similarity of the query with othermono-modal clusters in r, as well as the density of the linksbetween that cluster and other clusters. The strength of rele-vance of each mono-modal cluster c(m)(ν) is given by:

Ωc(m)(ν) = 1

Q(m)U(m′)r Q(m′)

∑Q(m)

x=1 e−d(q(m)(x),c(m)(ν))×∑U

(m′)r

u=1

∑Q(m′)

y=1 f(c(m)(ν), c(m′)(u)) . e−d(q

(m′)(y),c(m′)(u)),

(6)with ν ∈

1, . . . , U

(m)r

, and m′ representing all modali-

ties except modality m. The function f(c(m)(ν), c(m′)(u))

denotes the normalized density of the links between clustersc(m)(ν) and c(m

′)(u) and is defined by the total number oflinks between cluster c(m)(ν) and cluster c(m

′)(u) dividedby the maximum number of links between cluster c(m)(ν)and any other cluster from modality m′. This is followed bynormalizing each Ωc(m)(ν) by the maximum possible valueof Ωc(m)(ν) for all existing modalities. Single modality datapoints within a mono-modal cluster c(m)(ν) with larger val-ues of Ωc(m)(ν) are ranked higher.

Finally, the degree of similarity between a data point i(m)

in mono-modal cluster c(m)(ν) and the query q is computedbased on the similarity between i(m) and its matching modal-ity data given in q as well as the similarity between data pointsfrom modality m′ within the multi-modal cluster r that areconnected to i(m) and the query. This formulation allows oneto not only include the similarity of q with point i(m), but alsothe relativity of connected nodes to i(m) with q. We give ahigher priority to data points from modalities that exist withinthe query q. Data points from modalities not present in q areranked based on their strength of connectivity with data pointsfrom modalities present within q. Such a ranking process issummarized in Figure 3.

4. EXPERIMENTAL RESULTS

Datasets: The experiments are conducted on two challeng-ing datasets: VOC07, and MMFlickr. The VOC07 dataset ispublicly available and contains 2 different modalities: 9, 9632D images and 20 textual tags. We created the MMFlickrdataset to simulate scenarios in real world where data objectfrom modality m is not necessarily accompanied by dataobject from all other modalities. This dataset contains threedifferent modalities of 2D images, textual tags, and GPS in-formation and is consisted of 1, 320 data objects across threedifferent modalities. MMFlickr was created by crawling datafrom Flickr photo sharing web page using the Flickr API[27].Performance: In our experiments, a set of multi-modalqueries are defined and the accuracy of the retrieval systemis measured using f-measure [28]. The performance is com-pared with the state-of-the-art multi-modal retrieval systempresented in [16].

1 6 11 160.5

0.55

0.6

0.65

0.7

0.75

0.8

Depth

Accuracy

proposed work

baseline

1 6 11 160.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Depth

Accuracy

proposed work

baseline

Fig. 2: Comparison of the performance of the proposed algorithm with thebaseline approach. Results are reported for VOC07 (left) and MMFlickr(right) databases.

As shown in Figure 2, our approach is able to achievea higher retrieval accuracy than the baseline method with-out imposing the unnecessary constraint that every documenthas an equal number of nearest neighbors for each of themodalities. It is also interesting to note that the accuracygap between the proposed work and the baseline approachis slightly larger at higher depths for both VOC07 and MM-Flickr datasets. These results emphasizes the applicability ofthe proposed method.

Figure 4 illustrates some sample queries along with their

From the network, find the most

relevant multi-modal cluster

From each multi-modal cluster, find the most relevant

mono-modal cluster

Within each mono-modal cluster, find

the closest data points to the query

Ranked results

Query

Fig. 3: Ranking process for a given query as descibed in Section 3.

corresponding results of the proposed multi-modal retrievalsystem. As shown, our method is able to return multi-modaldata related to the presented queries.

, “person”

image modality

image modality textual tag modality

Query set Retrieval results

, ,

,,

, ,

image modality image modality image modality



image modality textual tag modality

“person”,

, ,

image modality image modality image modalityimage modality textual tag modality

“person”,

Fig. 4: Sample queries and their corresponding top 3 multi-modal retrievalresults using the proposed system on VOC database. Figure is best viewed incolor.

We speculate that the best retrieval results are achievedfor data points that have co-occurred in the dataset often. Onthe other hand, if a query set is a combination of multi-modaldata points that rarely co-occur, the top retrieval results onlyreflect those particular data points in the query set that co-occur often. Figure 5 demonstrates such query along withthe top 3 retrieval results of the proposed approach. In thisexample, although retrieval results are relevant to some querydata points, they do not correspond well to all data points inthe query set. In order to see retrieval results from multiplemodalities, larger number of top retrieved results needs to beshown.

,

image modality image modality

Query set Retrieval results

, ,


Fig. 5: A sample query set with datapoints rarely co-occuring and its cor-responding top 3 multi-modal retrieval results using the proposed system.Figure is best viewed in color.

5. CONCLUSION

This paper presented a new model to provide a rich set ofmulti-modal data for localized image-parts. We introducedmono-modal graph structures and applied a spectral cluster-ing algorithm to find related data points of a single modal-ity. In addition, the relationship between clusters of differ-ent modalities were integrated using a multi-modal network.Experimental results conducted on two challenging datasetsdemonstrated that the proposed approach compares favorablywith current state of the art methods. In future, we plan toexplore the applicability of the proposed approach for datacompression.

6. REFERENCES

[1] E. Attalla and P. Siy, “Robust shape similarity retrieval basedon contour segmentation polygonal multiresolution and elasticmatching,” in Pattern Recognition, 2005.

[2] M. Kokare, P. Biswas, and B. Chatterji, “Texture image re-trieval using new rotated complex wavelet filters,” in Trans.Systems, Man, and Cybernetics, 2005.

[3] M. Worring and T. Gevers, “Interactive retrieval of color im-ages,” in International Journal of Image and Graphics, 2001.

[4] David C Blair and ME Maron, “An evaluation of retrieval ef-fectiveness for a full-text document-retrieval system,” Commu-nications of the ACM, 1985.

[5] G. Salton and C. Buckley, “Term-weighting approaches in au-tomatic text retrieval,” in Information processing & manage-ment, 1988.

[6] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle,G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approachto cross-modal multimedia retrieval,” in Proc. internationalconference on Multimedia, 2010.

[7] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Rankingwith local regression and global alignment for cross media re-trieval,” in Proc. ACM international conference on Multimedia,2009.

[8] H. Zhang and J. Weng, “Measuring multi-modality similaritiesvia subspace learning for cross-media retrieval,” in Advancesin Multimedia Information Processing-PCM. 2006.

[9] Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modalitysimilarity for multinomial data,” in ICCV, 2011.

[10] G. Hubert and J. Mothe, “An adaptable search engine for mul-timodal information retrieval,” in Journal of the American So-ciety for Information Science and Technology, 2009.

[11] M. Lazaridis, A. Axenopoulos, D. Rafailidis, and P. Daras,“Multimedia search and retrieval using multimodal annotationpropagation and indexing techniques,” in Signal Processing:Image Communication, 2013.

[12] S. Sabetghadam, M. Lupu, and A. Rauber, “Astera-a genericmodel for semantic multimodal information retrieval,” Avail-able at http://ceur-ws.org/.

[13] M. Bokhari and F. Hasan, “Multimodal information re-trieval: Challenges and future trends,” Available athttp://researchgate.net/.

[14] D. Rafailidis, S. Manolopoulou, and P. Daras, “A unifiedframework for multimodal retrieval,” in Pattern Recognition,2013.

[15] A. Axenopoulos, P. Daras, S. Malassiotis, V. Croce, M. Laz-zaro, J. Etzold, P. Grimm, A. Massari, A. Camurri, andT. Steiner, “I-search: a unified framework for multimodalsearch and retrieval,” in The Future Internet. 2012.

[16] P. Daras, S. Manolopoulou, and A. Axenopoulos, “Search andretrieval of rich media objects supporting multiple multimodalqueries,” in Trans. Multimedia, 2012.

[17] Edward Chang, Kingshy Goh, Gerard Sychay, and Gang Wu,“Cbsa: content-based soft annotation for multimodal image re-trieval using bayes point machines,” IEEE Transactions onCircuits and Systems for Video Technology, 2003.

[18] Stefan Romberg, Rainer Lienhart, and Eva Horster, “Multi-modal image retrieval,” International Journal of MultimediaInformation Retrieval, 2012.

[19] Ruofei Zhang, Zhongfei Zhang, Mingjing Li, Wei-Ying Ma,and Hong-Jiang Zhang, “A probabilistic semantic model forimage annotation and multimodal image retrieval,” in ICCV,2005.

[20] Duc-Tien Dang-Nguyen, Giulia Boato, Alessandro Moschitti,and Francesco GB De Natale, “Supervised models for mul-timodal image retrieval based on visual, semantic and geo-graphic information,” in International Workshop on Content-Based Multimedia Indexing (CBMI), 2012.

[21] Niloufar Pourian and B.S. Manjunath, “PixNet: A LocalizedFeature Representation for Classification and Visual Search,”To appear in IEEE Transactions on Multimedia.

[22] Niloufar Pourian and B.S. Manjunath, “Retrieval of imageswith objects of specific size, location and spatial configura-tion,” in Proc. IEEE Winter Conference on Applications ofComputer Vision (WACV), 2015.

[23] Y. Deng and B.S. Manjunath, “Unsupervised segmentation ofcolor-texture regions in images and video,” in IEEE Trans.Pattern Analysis and Machine Intelligence, 2001.

[24] D. Lowe, “Object recognition from local scale-invariant fea-tures,” in ICCV, 1999.

[25] M. S. NIKULIN, Hellinger Distance, Springer, 2001.

[26] J. Shi and J. Malik, “Normalized cuts and image segmenta-tion,” in IEEE Trans. Pattern Analysis and Machine Intelli-gence, 2000.

[27] “Flickr-Open-API,” http://www.flickr.com/services/api/.

[28] David Martin Powers, “Evaluation: from precision, recall andf-measure to roc, informedness, markedness and correlation,”2011.

search and retrieval of multi-modal data … · search and retrieval of multi-modal data associated...

Documents