a multimodal information retrieval system: mechanism and ...juny/prof/papers/tmm.pdf · a...

A Multimodal Information Retrieval System: Mechanism and Interface

Jun Yang 1,2 *Qing Li 1 Yueting Zhuang 2

1 Department of Computer Engineering and Information Technology

City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, HKSAR, CHINA

{itjyang, itqli} @cityu.edu.hk

2 Department of Computer Science

Zhejiang University, Hangzhou, 310027, CHINA

[email protected]

ABSTRACT

General-purpose information retrieval is among the prevailing forms of daily human-computer interaction. The

proliferation of multimedia data in various types of modality creates two challenges regarding the convenience

and effectiveness of information retrieval: (1) multimodal information (i.e., a collection of text documents, still

images, video clips, etc) is generally preferred by users over single-modal information as the retrieval result, and

(2) effective queries for multimodal information need to specified using a synergy of multiple query methods (e.g.,

query keywords, sample images). In this paper, we describe a multimodal information retrieval system, Octopus,

to attack both of the challenges, which are inadequately addressed by existing information retrieval technologies.

Specifically, Octopus employs an aggressive search mechanism proposed for the retrieval of multimodal

information in response to multimodal user queries by investigating a broad range of knowledge. Moreover, a

cooperative user interface is designed to support multimodal query specification, presentation of multimodal

retrieval result, and most importantly, close collaboration with users and active learning from user behaviors for

improving the quality of retrieval results. Observations are made regarding the qualitative performance of

Octopus to demonstrate its feasibility and effectiveness.

* Qing Li is the contact author of this paper.

1. I NTRODUCTI ON

Information retrieval, defined in a broad sense, refers to a user’s behavior of searching for relevant information of

various types to satisfy his requirement1. According to this definition, user behaviors in a variety of computer-

based applications and environments can be interpreted as generic information retrieval, such as looking for

interested information using Web-based search engines like Google [8], consulting the help system of a software

product (e.g., Microsoft Word) about its usage, or browsing the directory of a digital image database (e.g., Corel

Image Gallery) for favorite photos. Undoubtedly, general information retrieval is among the prevailing forms of

human-computer interaction, and therefore its quality directly relates to the convenience and productivity of

users’ daily interactions with computers. From the perspective of interaction, the quality of information retrieval

consists of two major aspects: (1) effectiveness and ease of query specification, i.e., how conveniently and

accurately a user can express his information need; and (2) users’ satisfaction with the retrieval results, i.e., to

which extent the retrieved information satisfies the user’s need.

Recent years witness a phenomenal growth of multimedia data in various types of modality, such as

image, video, audio, and graphic, in a number of multimedia repositories ranging from the Web to digital libraries.

The proliferation of such multimodal information poses two significant challenges that correspond to the two

aspects regarding the quality of information retrieval. On the one hand, rather than monotonous, single-modal

information, users would like to see their desired information presented in a more diversified and multimedia-rich

manner, preferably, as a collection of media objects in diverse modalities. As an example, a music fan who

searches for a song, say, Michael Jackson’s “ Heal the world” , would like to receive not only the audio file of the

song, but also its lyrics, its MTV, and perhaps the news of the singer. On the other hand, to express users’

requests for multimodal information precisely and conveniently, any existing method of query specification alone

is unlikely to be sufficient; rather, different means of query specification are required by different users, for

1 The information retrieval defined here is more general than the traditional information retrieval (IR), which refers to the

retrieval of textual information only. In this paper, the term “ information retrieval” refers to its generic definition.

different purposes, and in different situations. Again, consider the scenario of searching for favorite songs: a user

who knows the name of the desired song and/or the name of the singer can express his query using keywords,

whereas for those who only remember the rhythm of the song, query-by-humming method is more desirable. In

certain cases, a particular means of query specification is predominantly preferred to others. For example, to

search for images displaying a specific pattern that is too complex and subtle to be described by keywords, query

by sample image becomes almost the only choice of conducting an effective query.

In the past decade, retrieval techniques have been extensively studied for various types of information,

including traditional information retrieval (IR) technique for textual data [18], content-based retrieval (CBR)

technique for multimedia data such as image, video and audio [6] [7], and database retrieval technique for

structured data [5]. Unfortunately, opposite to what are required by multimodal retrieval, each existing retrieval

technique can deal with only a single type of information by queries specified in a particular manner. Therefore,

one of these single-modal retrieval techniques is applicable to the retrieval of multimodal information, which calls

for specialized multimodal retrieval approaches.

In this paper, we describe a multimodal information retrieval system, Octopus, as an attempt to tackle the

aforementioned two challenges raised by multimodal retrieval. Octopus exhibits a significant departure from

traditional retrieval systems in terms of both retrieval approach and user interface. Specifically, it employs an

aggressive search mechanism for retrieving multimodal information (i.e., a mixture of text documents, still

images, video clips, etc) in response to multimodal user queries (e.g., keyword query, sample image) based on a

synergy of multifaceted knowledge. Furthermore, a cooperative user interface is designed that supports

specification of multimodal queries, presentation of multimodal retrieval result, and most importantly, close

collaboration with users and active learning from user behaviors for the purpose of retrieving better results.

Although the user interface of Octopus is designed for multimodal retrieval, it is not a multimodal

interface defined in the traditional sense, which supports multiple ways of communication between human and

computer, such as keyboard, speech, gesture, lip motion, facial expression, etc. Actually, the term “ multimodal”

in this paper refers to multiple types of retrieved information and multiple means of query specification, whereas

for multimodal interfaces it refers to multiple channels of communication. Nevertheless, the interface of Octopus

addresses an orthogonal and equally important aspect of human-computer interaction as conventional multimodal

interface does. While the emphasis of conventional multimodal interfaces is the flexibility and naturalness of user

interaction, the design of our interface is essential to the convenience, effectiveness, and productivity of

multimodal information retrieval as a major form of user interaction.

The remainder of this paper is organized as follows. In Section 2, we present a brief review of the related

works on information retrieval, multimodal interface, and cooperative interface. We elaborate on the aggressive

search mechanism and the cooperative user interface of Octopus respectively in Section 3 and Section 4.

Observations on the qualitative performance of Octopus are presented in Section 5. Concluding remarks and

future works are given in Section 6.

2. RELATED WORK S

This section reviews the previous works related to Octopus from different perspectives, including information

retrieval technology, multimodal interface technology, and cooperative interface technology.

2.1 Information Retrieval

Previous works on generic information retrieval can be classified into two categories: retrieval techniques for

single-modal information and retrieval techniques involving the integration of multi-modality.

• Single-modal information r etr ieval: Retrieval techniques in this category can only deal with information of

a single modality. Among them, text-based information retrieval (IR) technique [18] is mainly used for searching

large text collections using query expressed as keywords. IR technique has been extensively studied for decades

and successfully applied in many commercial systems such as Web-based search engines [8]. Content-based

retrieval (CBR) technique is invented by the Computer Vision community to retrieve multimedia objects based on

low-level features that can be automatically extracted from the objects. CBR techniques have been widely used

for image retrieval (e.g., QBIC system [6], VisualSEEK system [19]), video retrieval (e.g., VideoQ system [4]),

and audio retrieval [7]. The low-level features used in retrieval vary from one type of media to another, such as

color and texture feature for images, MFCCs (mel-frequency cepstral coefficients) for audio clips. In addition,

database-style query using declarative query language (such as SQL) [5] is widely used by the Database

community to retrieve structured data based on predefined attributes, such as author of images, title of documents.

Despite the difference on target data type, search condition, and query method, the aforementioned

retrieval techniques share two common attributes: (1) each technique is used for the retrieval of single-modal

information 2, and (2) each technique adopts a specific means for query specification. For example, keyword-

based query is common in IR techniques, and query-by-example (QBE) is widely used in CBR techniques. Such

single-modal retrieval techniques, when applied to the world of multimodal information, suffer from problem of

ineffective query specification as well as insufficient and monotonous retrieval results.

• Integration of multi-modality: Research work in the past few years has investigated the integration of

multiple data types, mostly between text and image, in the context of information retrieval. For example, the iFind

[11] system proposes a unified framework under which descriptive keywords and low-level image features are

seamlessly combined for image retrieval, and the 2M2Net [23] system extends this framework to the retrieval of

other data types such as video and audio. WebSEEK system [20] extracts keywords from the surrounding text of

images and videos (in web pages) as their indexes to support keyword-based retrieval. The commonality of these

systems is to use one type of data (usually text) as index to retrieve another type of data (such as image).

Although they involve more than one type of data, none of them is a genuine multimodal retrieval system that

supports retrieval of various types of data at the same level.

More recently, the concept of MediaNet 8 and multimedia thesaurus (MMT) [21] have been proposed,

both of which seek to compose multimedia representation of semantic concepts—concepts described by diverse

media objects such as text descriptions, image illustrations, etc—and establish relationships among the concepts.

2 Note that this observation is not contradictory with the fact that certain retrieval techniques is applicable to multiple types of

data (e.g., CBR technique can be used for image and video), because none of them can deal with more than one type of data

at one time in a single retrieval system.

Although both of them support retrieval of multimodal data using the semantic concepts as the clue, the

construction of such multimedia concept representations is completely a manual process according to 8 and [21].

2.2 Multimodal Interface

The objective of multimodal interfaces is to achieve flexible and natural human-computer interaction by

supporting multiple channels of communication in addition to the predominantly used keyboard and mouse. The

models of communication supported in multimodal interfaces exhibit a great variety. For example, the multimodal

interface developed at Carnegie Mellon University [22] employs a synergetic use of visual, acoustic, and textual

cues, such as speech, gesture, lip-movement, handwriting. Reilly et al. [16] propose an adaptive non-contact

gesture-based communication system. Otuska et al. [14] use both facial expression and hand gesture recognition

techniques in designing man-machine interface. Fusion techniques for integrating the multimodal data generated

from various communication channels are addressed by Oviatt et al. [15] and Nigay et al. [12]. As mentioned in

Section 1, our work on multimodal retrieval is of equal significance with multimodal interfaces in terms of

achieving better human-computer interaction, and most likely, it requires multimodal interface as an “ enabling

technique” to support multimodal query methods (which is not within the scope of this paper).

2.3 Cooperative Interface

The concept of cooperative interface has been addressed, to a varying extent, in the research of database systems

[13], computer-supported cooperative work (CSCW) [2], information filtering and recommendation [17], etc.

Nevertheless, to the best of our knowledge, there is no agreed-upon definition of cooperative interface, while its

de facto understanding is an interface that is capable of assisting and collaborating with computer users to solve

certain problems. In the context of multimodal retrieval, a cooperative interface is particularly desired for two

reasons: (1) users can rarely compose precise queries; rather, a series of user-system interactions (such as

relevance feedback) is greatly needed and preferred for users to articulate their needs; (2) as generally recognized,

primitive features of multimedia objects (e.g., color and texture of images) are inadequate to capture their

semantic meanings, which are of essential importance in retrieval. In comparison, user behaviors usually imply a

wealth of knowledge on the semantics of multimedia objects, which serve as an indispensable complement to

their primitive features to achieve high retrieval performance. The cooperative interface of Octopus is designed to

meet both of the objectives.

3. AGGRESSIVE SEARCH M ECHANI SM FOR M ULTIM ODAL INFORMATION

The aggressiveness of the proposed search mechanism lies in its ability of embracing multimodal information,

exploring multifaceted knowledge, and learning from user interactions for improving retrieval results. As

illustrated in Fig.1, the architecture of the search mechanism consists of three major components: (1) multifaceted

knowledge base (MKB), which models different levels of knowledge on the relevance among multimodal media

objects3; (2) l ink analysis based retrieval approach, which retrieves relevant multimodal information for user

queries by analyzing the relevance links among multimodal objects in the MKB; (3) learning-from-interaction

strategy, which solicits useful knowledge from user behaviors (e.g., browsing, relevance feedback) in user-system

interactions and improves the quality of retrieval results progressively based on the derived knowledge. The loop

constituted by the three components (as shown in Fig.1) reveals the “ hill-climbing” nature of the mechanism: the

more interactions are performed by users, the better their information needs are satisfied by the retrieval results.

The details of the three components are described in the following subsections.

Mul ti facetedKnowl edge

Base

Li nk Analysisbased Retr i eval

Lear n ing-fr om-In teracti ons

Cooper ativeIn ter face

desiredinformation

relevancelinks

userbehaviors

derivedknowledge

multimodalquery

Fig.1: Architecture of the aggressive search mechanism

3 By multimodal media objects (or multimodal objects for short), we refer to media objects of various types of modality, such

text document, image, video clip, audio, etc.

3.1 Multifaceted Knowledge Base

Exploring knowledge on multiple aspects is a natural requirement of multimodal retrieval. To search multimodal

objects in an integrated manner, not only the features for each type of media objects are needed, but also the

knowledge describing the correlation between media objects of diverse types becomes indispensable. The

multifaceted knowledge base models three categories of knowledge that are available in most multimedia

repositories: (1) primitive features that are directly computable from multimodal objects, such as keyword of

textual documents, color and texture for images; (2) structural relationships among objects in the form of

hyperlinks, spatial adjacency, composition relationships (e.g., a video clip comprises several image frames), etc;

(3) user-perceived relevance among objects that is articulated by users or implied in their behaviors (e.g.,

relevance feedback).

StructureLayer

FeatureLayer

PerceptionLayer

text

audio

video

image

* The same legend is used for theother figures in this paper.

Legend*

Fig.2: Layered network model in the multifaceted knowledge base

To represent the three types of knowledge in a uniform manner, a layered network model is developed as

the “ skeleton” of the MKB. As depicted in Fig.2, this model is constituted by a set of superimposed layers, with

each layer modeling the relevance relationships among multimodal objects defined from a certain perspective.

Specifically, each layer is represented by a network G = (N, L), where N is a finite set of nodes and L is a finite set

of links. Each node in N denotes a media object Oi�

O of any modality (typically, text, image, video, audio) in the

database, and each link in L is a triple of the form <Oi, Oj, r> denoting the relevance relationship between object

Oi and Oj, with r as the link weight indicating the strength of the relevance.

The layered network model shown in Fig.2 comprises three layers, namely feature layer, structure layer,

and perception layer, whose links (correspondingly, feature link, structure link, and perception link) represent the

relevance among the same set of multimodal objects derived from the three types of knowledge modeled by the

MKB. Specifically, a feature link represents the similarity between two media objects in terms of their primitive

features. Especially, as primitive features are normally media-dependent, feature links exist only between objects

of the same modality. A structure link connects two objects that have structural relationship between them in the

form of hyperlink, spatial adjacency, composition relationship, etc. Perception links denote the user-perceived

relevance among media objects. Please note the relative positions of the three layers reflect the priority of the

three types of knowledge in terms of reliability (confidence) in suggesting the relevance among multimodal

objects, specifically, user perception is more reliable than structural relationship, which is in turn more reliable

than primitive features of multimedia objects.

The construction of the MKB requires both “ offline” and “ online” processing. In particular, feature links

are built offline by calculating the similarity between any two media objects (usually of the same modality) based

on their primitive features. A feature link is created between two objects with similarity above a predefined

threshold, with the link weight set to the similarity score. Techniques for extracting primitive features and

computing similarity for various media types are readily available in existing research such as text-based IR [18]

and content-based retrieval [6][7]. To construct structure links, we inspect the environment where media objects

are collected and identify the structural relationships among them by offline processing. Consider a web page that

contains an embedded image and points to a video clip by a hyperlink in the page. Using our model, the image,

the video clip, and the textual part of the page are all regarded as media objects, and the image and video clip are

connected with the textual page by structure links. In comparison, perception links by definition are not initially

available before any user interactions are performed. Hence, perception links are gradually obtained during the

online interaction with users by the learning-from-interaction strategy to be described in Section 3.3.

3.2 Link Analysis based Retrieval Approach

The MKB provides a wealth of relevance links4 among multimodal objects, by exploring which multimodal

queries can be processed effectively and efficiently. On the other hand, hyperlink analysis technique has been

extensively studied in the domain of Web-based information retrieval [3][9], and successfully applied in many

commercial applications, such as the Web search engine Google [8]. By a close examination, one can easily see

the analogy between the Web, which is a huge network of web pages and various multimedia documents

connected by hyperlinks, and our MKB, which is a network of multimodal objects connected by relevance links.

Therefore, we “ borrow” the idea of hyperlink analysis and develop a suite of link analysis algorithms that work

closely to support multimodal retrieval.

Seed objects

Candidate objects

Result objects

Discover Distilling

Resultpresentation

Queryspecification

Relevancefeedback

Perception link

Structure link

Feature link

Legend

Fig.3: L ink analysis based retrieval approach

As illustrated in Fig.3, the whole retrieval process can be divided into four steps: (1) a user specifies his

query by designating or submitting some multimodal objects as “ seed objects” ; (2) a relatively large number of

“ candidate objects” are discovered based on the seed objects via the relevance links in the MKB; (3) a set of

4 We call the perception links, structure links, and feature links in the MKB collectively as relevance links, since all the three

types of links indicate the relevance among multimodal media objects.

relevant objects (to the query) are “ distilled” from the candidate objects by analyzing the link structure among

them and presented to the user as the retrieval result; (4) if not satisfied with the current result, the user can inform

the system to recalculate the retrieval result based on his feedback opinion given to the current result. The details

of the four steps are described below.

3.2.1 Multimodal quer y specification

In Octopus, a user query is composed by one or more multimodal seed objects (seeds for short) as the

representation or hint of the user’s information need. In this sense, the role of a seed object is similar to that of a

sample media object in the query-by-example (QBE) paradigm widely used in CBR approaches [6][19]. However,

our query specification method differs from QBE on two nontrivial aspects: (a) a query in Octopus can consist of

seed objects of arbitrary number and in arbitrary modality, while QBE normally allows only a single sample

object of a specific modality; (b) owing to the broad range of knowledge modeled in the MKB, based on which

even the relationship between loosely related objects can be established, seed objects are not required to be

precise representation of the desired information. Therefore, the query specification method in Octopus relieves

users from the burden of f inding highly representative sample objects that often frustrate the users of CBR

systems.

A seed object can be either chosen from existing media objects in the collection or submitted as a new

media object. In the latter case, the new media objects can be “ created” by the user, such as inputting query

keywords or humming a melody to the microphone, or introduced from external data collections, such as selecting

a sample image stored in the local computer. The new media objects are immediately registered into the MKB

with their feature links and structure links (if any) with existing objects constructed. Therefore, in both cases a

user query is f inally represented as one or multiple media objects in the MKB. Here, the notion of multimodal

query has two levels of interpretation: a query composed by seed objects of multiple media types, and a query

specified using a synergy of multiple methods (e.g., inputting query keywords, submitting example image).

3.2.2 Candidate discover y

A basic premise of our retrieval approach is that, the media objects desired by a user are connected with the seed

objects representing the user’s query through relevance links of various types in the MKB. Therefore, by

analyzing the structure of links around the seeds, we are able to figure out the media objects most relevant to the

user request. However, computations (e.g., traversing) involving links in a network model are quite expensive,

especially when the number of nodes and links in the network is large, as in the case of the MKB. To make

sophisticated link analysis computationally affordable, as a pre-processing step we cut down the “ search space” to

a small locality around the seeds by discovering a set of candidate objects (candidates for short). Specifically, the

set of candidates C must meet the following two criteria:

1) C contains a majority of media objects that are potentially relevant to the user query.

2) C is small enough to afford the distillation algorithm subsequently applied on it (see Section 3.2.3).

Both criteria favor the media objects connected with the seeds through short paths (i.e., paths constituted by a

small number of links) in the MKB. On the one hand, short paths imply smaller “ cumulative error” in reasoning

the relevance relationship between candidates and seeds; on the other hand, the number of media objects that can

be reached from seeds through short paths is likely to be small. Therefore, we specify the scope of candidates as

the media objects reached from seeds through path with length (viz. number of links) below a predefined

threshold, including the seeds themselves. Since the relevance links in the MKB exist among multimodal objects,

the candidates discovered by these links are by default multimodal. However, even if the maximum path length is

specified, the number of candidates is still very unpredictable, especially when the seeds are heavily linked with

surrounding objects. To deal with such cases, we place a second threshold on the total number of candidates,

which, when exceeded, is used to cut the exceeding portion of the discovered candidate objects.

The process of candidate discovery is summarized by the algorithm in Fig.4. This algorithm also takes

into account the priority of different types of links. Specifically, among the objects reached through paths of the

same length, those reached by perception links are added into candidates prior to those reached by structure links,

which are in turn prior to those reached by feature links. Thus, if the number of discovered objects exceeds the

maximum number of candidates, the objects reached by higher-priority links are considered as candidates in

preference to those reached by lower-priority links.

Discover (S, M, N) S: set of seed objects M: maximum length of path N: maximum number of candidates random(C, n): a routine returning n random objects from set C return: set of candidate objects Set C equal to S For i=1 to M

Set Cp as the set of objects reachable from any object in C via one perception link

Set Cs as the set of objects reachable from any object in C via one structure link

Set Cf as the set of objects reachable from any object in C via one feature link

If |C � Cp| < N

If |C � Cp�

Cs| < N

If |C � Cp � Cs � Cf| < N

C = C � Cp � Cs � Cf Else

Return C � Cp � Cs � random(Cf, N - |C � Cp � Cs|) Else

Return C � Cp � random(Cs, N - |C � Cp|)

Else Return C � random(Cp, N - |C |)

Next Return C

Fig.4: Algorithm for discovering candidate objects

The candidate discovery algorithm is based on a rough heuristic that shorter paths (between candidates

and seeds) suggest stronger relevance. As a pre-processing step, it does not consider the type and weight of

relevance links, which are also important factors indicating the relevance among media objects. However, chances

are that an object reached by several high-priority (or high-weighted) links is more relevant than an object reached

by less low-priority (or low-weighted) links, and in such cases, relevant objects could be “ missed” by this

algorithm. In fact, there exists a tradeoff between the two criteria of choosing candidates stated at the beginning of

this subsection, which is balanced by the maximum number of candidates allowed. The more candidates are

discovered, the more costly the subsequent distillation process is, but the less likely relevant candidates are

missed, and vice versa.

3.2.3 Result distillation

The distillation process aims to “ distill” the most relevant media objects out of the potentially relevant candidates

by a sophisticated examination of the link structure among these candidates. For this purpose, the relevance of

each candidate (to the query) is calculated based on the notion of relevance propagation — the relevance can be

propagated from one object to another via the link(s) between them in the MKB. Specifically, we pump the initial

relevance score into the seed objects (which are within the candidates) and allow the relevance to flow through

links among the candidates, with the “ amount” of the relevance flow adjusted according to the weight of links.

The propagation process ends when the relevance scores of all candidate objects converge.

Propagate(C, S, M) C: set of candidate objects S: set of seed objects M= [mij]: adjacency matrix of the sub-network corresponding to C at a specific layer of the MKB, where

mi j is equal to the weight of the link between Oi and Oj (mij=0 if there is no link between them)R= [ri]: a vector with each element ri being the relevance score of object Oi in C return: converged relevance scores of candidate objects For each object Oi in C If Oi is in S, then ri=1; Else ri=0. Next Normalize R such that � ri

2=1 While R has not been converged

For each object Oi in C

ri = � j=1,…,|C| (r j · mi j)

Next Normalize R such that � ri

2=1 Return R

Fig.5: Algor ithm for computing relevance scores of candidates at a single layer

As the MKB consists of three superimposed layers, each of which has different semantics and priority, the

relevance scores of candidates are calculated separately at each layer by relevance propagation and then merged to

give their overall relevance scores. The relevance propagation on each layer is performed within the sub-network

corresponding to the candidates at this layer. Suppose each candidate has a relevance score, which is initialized to

one if it is a seed object, or zero otherwise. In each round of propagation, the relevance score of a candidate is set

to the sum of the relevance flowed to it from its neighboring objects via the links, with each “ relevance flow”

being the product of the link weight and the relevance score of the neighbor in the previous round of propagation.

(The neighboring objects of an object are those that have link with this object in the MKB.) Such propagation

proceeds until the relevance score of each candidate converges to a fixed value, which indicates its relevance to

the query according to the knowledge modeled by this layer. Please note that the convergence of relevance scores

(i.e., termination of the propagation process) has been proven by many previous works on link analysis [3] [9].

The algorithm describing the propagation process at a single layer of the MKB is given in Fig.5.

After applying the propagation algorithm on all the three layers of the MKB, we combine the candidates’

relevance scores computed at each layer to obtain their overall relevance scores. However, since the three layers

deal with different types of knowledge, designing a fair combination strategy is extremely difficult. We take a

simple strategy by linearly combining the relevance scores of a candidate computed at three layers to give its

overall relevance score, as described by the algorithm shown in Fig.6. Intuitively, the weights for different layers

are assigned in a way that reflects the priority of each layer (i.e., wP > wS > wF). The candidates are ranked by their

overall relevance score returned by this algorithm and those with high scores are presented to the user as the

retrieval result.

Distill (C, S) C: set of candidate objects S: set of seed objects R = [r i]: a vector with each element ri being the overall relevance score of object Oi in C wP, wS, wF: weights of the perception layer, structure layer, and feature layer of the MKB

MP, MS, MF: adjacency matrices of the sub-networks corresponding to C at the perception layer,

structure layer, and feature layer of the MKB return: converged overall relevance scores of candidates RP = Propagate (C, S, MP)

RS = Propagate (C, S, MS)

RF = Propagate (C, S, MF)

R = wP ·RP + wS ·RS + wF

·RF

Return R

Fig.6: Algor ithm for computing overall relevance scores of candidates

3.2.4 Relevance feedback

Octopus allows a user who is not satisfied with the retrieval result to perform relevance feedback by labeling

some of the retrieved media objects as positive (feedback) examples or negative (feedback) examples. The

labeling can be done to any media object irrespective of its modality. The labeled feedback examples serve as

further indication and qualification of the user’s information need, towards which the current retrieval result is

refined by Octopus. The refinement is based on an intuitive notion that a media object desired by the user must be

related to the positive examples and meanwhile far away from the negative examples. As illustrated in Fig.7, the

refinement process is executed in a parallel manner: f irstly, the positive examples are treated as positive seeds,

from which a set of positive candidates are discovered and their relevance scores to the positive seeds are

calculated, using the algorithm described in Section 3.2.2 and 3.2.3 respectively; in the meantime, negative

candidates are discovered and their relevance to negative examples are calculated; finally, the media objects as the

refined result are obtained by integrating the positive and negative candidates as well as their relevance scores, as

described by the algorithm in Fig.8.

Negativecandidates

Negativeexamples

Positiveexamples

Merge

discover

discover

distilling

Refinedresults

Positivecandidates

distilling

Fig.7: Refinement of retrieval results by relevance feedback

Feedback (P, N) P: objects as positive examples N: objects as negative examples R: a vector giving the overall relevance scores of the refined results return: overall relevance scores of the refined results CP = Discover (P)

CN = Discover (N)

RP = Distil l (CP, P)

RN = Distill (CN, P)

For each object Oi in CP

If Oi is in CN

R(Oi) = RP (Oi) - RN (Oi)

Else R(Oi) = RP (Oi)

Next Return R

Fig.8: Algor ithm for calculating refined retr ieval result by relevance feedback

3.3 Learning-from-Interaction Strategy

In Octopus, user interactions around information retrieval not only drain the knowledge stored in the MKB but

also supply it. Specifically, the learning-from-interaction strategy is devised to derive from various user behaviors

the relevance relationships among multimodal objects and incorporate the derived relationships into the MKB by

updating the perception links. Two forms of “ informative” user behavior are currently supported by this strategy,

one of which is the relevance feedback explicitly conducted by users. Specifically, when a user submits a query as

seed objects and then labels some retrieved media objects as positive and negative examples, the relevance

relation between the seeds and the positive examples, as well as the irrelevance relation between the seeds and the

negative examples, are apparent. In addition, relevance relationships are also implied when the user browses

through the media objects retrieved for a specific query. Specifically, if the user’s attention is captured by an

object for a period of time long enough to justify his interest in this object, its relevance with the query (seed

objects) can be deduced. However, this heuristic rule does not hold the other way around, as objects without user

attention are not necessarily irrelevant to the query.

Learning (S, P, N) S: set of seed objects P: set of positive examples (explicit and implicit) N: set of negative examples M= [mi j]: adjacency matrix of the network at perception layer s, t: positive real numbers For each object Oi in S

For each object Oj in P If there is no perception link between Oi and Oj

Create a perception link between Oi and Oj with mi j=s Else mij = mi j + s Next

For each object Ok in N If there is a perception l ink between Oi and Oj

mik = mik – t If mik < 0, remove the link

Next Next

Fig.9: Algor ithm for updating perception links by learning from user interactions

The derived relevance relationships are integrated into the MKB by creating perception links or

increasing the link weight (if the link to be created already exists), and vice versa for irrelevance relationships.

This is consistent with the semantics of perception links, since these derived relationships reflect users’ perception

that underlies their behaviors. The algorithm for updating perception links is given in Fig.9. The relevant objects

deduced from the user’s browsing behavior are regarded as “ implicit” positive examples and treated uniformly by

this algorithm.

The contribution of this learning strategy is mainly on the long-term retrieval performance. As users

interact with Octopus, this strategy solicits useful knowledge from the interactions of the whole population of

users to enrich the underlying MKB. Based on such reliable and ever-growing knowledge, the link analysis based

retrieval approach is expected to produce retrieval results that better satisfy user queries, and therefore the

effectiveness of information retrieval is improved.

4. COOPERATIVE USER INTERFACE

This section elaborates on the user interface of Octopus, with emphasis on its ability of accepting multimodal

query, displaying multimodal result, and close cooperation with users for improving retrieval performance. The

interface is developed using Active Server Pages (ASP) technology, which communicates with the underlying

search mechanism and the database through a COM interface, and creates dynamic web pages to display the

multimodal information retrieved by the search mechanism in standard Web browsers. This architecture takes the

advantage of allowing multiple users to access Octopus through web browsers simultaneously and remotely.

A screenshot of the main interface is illustrated in Fig.10. The interface integrates basic user functions

including query specification, result presentation, and relevance feedback. The upper pane of the interface

provides the facilities for users to compose a multimodal query by means of inputting relevant keywords,

submitting sample images or videos, or a combination of them. The query is submitted by clicking the button

labeled as “Go Search” , and as the response a set of relevant media objects are retrieved by the search mechanism

and displayed in the lower pane of the interface. The user can designate feedback examples among the retrieved

objects and click the “ Feedback” button to let the system refine the retrieval result. The button labeled as “ Browse

It” is provided for navigation purpose, by clicking which a set of media objects randomly selected from the

database are displayed. For all the three types of operation, users can specify the preferred type(s) of the received

media objects as image, video, text, or a mixture of them, which are the three types of data supported by the

current “ version” of Octopus.

The lower pane of the interface displays the retrieved multimodal objects in a layout that well suits the

visual characteristics of their respective modality. As shown in Fig.10, each image is shown as a thumbnail about

1/4 wide of a row, each text document is represented by a title and a short abstract spanning over a row, and a

video clip is denoted by a set of representative frames extracted from the clip, which also occupy a complete row.

By showing media objects in such “ condensed” forms, the interface can accommodate a large number of objects

in one page, while the user can still catch the main semantics of each object. Beneath each displayed object

juxtapose two icons with the symbol of “�

” and “ ×” as well as two hyperlinks labeled as “ Detail” and “ Similar” .

Videoobject

Textobject

Imageobject

Fig. 10: The cooperative user interface of Octopus

Among them, the hyperlink “ Similar” is used to initiate a separate query using the corresponding object as a seed

object. (The functions of other icons and hyperlinks are described below.) The media objects are ordered basically

by their estimated relevance to the query, with some local adjustment for the sake of arranging every four images

into a single row. Moreover, as a benefit of using Web browser, the interface is naturally scrollable when the

media objects in one page cannot be fit into the screen size. If multiple pages are needed given the number of

media objects to display and the page size, a simple navigation bar containing hyperlinks of “ First” , “ Last” ,

“ Previous” , and “ Next” is shown to allow users to switch among different pages.

As its most distinctive feature, the user interface, coupled with the underlying search mechanism, facilitates

smooth cooperation with users to help them find high-quality results. From the perspective of user intention, two

forms of cooperation are currently supported in Octopus:

• Explicit cooperation by feedback: Since an average user can hardly clarify his request in the initial query,

relevance feedback provides him with a second chance to refine/qualify his query in order to get improved

results. The feedback opinion is expressed by marking the retrieved objects that the user consider relevant to

the query as positive examples, and marking the irrelevant objects as negative examples. The marking is

performed using the icons of “�

” and “ ×” beneath each displayed object, which, when clicked, will be shown

in highlight color to denote positive negative example respectively. After marking all the positive and

negative examples, the user can inform the system to refine the current results by clicking the “ Feedback”

button shown in the top.

Upon the acceptance of the user’s feedback request, the system takes two actions in parallel. On the one

hand, the relevance feedback algorithm described in Section 3.2.4 is executed to recalculate the retrieval

result and presents the refined result to the user. On the other hand, user’s feedback behavior is analyzed by

the learning-from-interaction strategy (see Section 3.3) to update the perception links in the MKB. Further

rounds of feedback can be conducted based on the results refined by previous feedbacks, until the user finds

the desired information or gives up without finding anything relevant.

• Implicit cooperation by browsing: As mentioned in Section 3.3, relevance among multimodal objects can

be deduced from a user’s focus of attention when he browses through the retrieval results of a given query. To

this end, certain interface facilities are demanded to track the user’s focus of attention from his browsing

behavior. In our interface, the hyperlink labeled as “ Details” beneath each object is used for this purpose, by

clicking which a separate window is opened to display the original version of the object (instead of its

condensed version), i.e., a full-size image, a complete text document, etc. Naturally, the time during which

this separate window exists is regarded as the time that the user’s attention is attracted by the corresponding

object, and the length of this time is a good indicator of the degree of his interest in the object. Intuitively, if

the time that a user focuses on a specific object is above a certain length, very likely that the user regards it as

a relevant result to the query he conducts. Based on this assumption, the learning strategy in Section 3.3 is

executed to update the perception links between the “assumed” relevant objects and the seed objects that

constitute the query.

The merit of the user-system cooperation described above is twofold. On the one hand, cooperation in the form of

relevance feedback relieves a user from the diff iculty of formulating precise initial queries; rather, he can express

and refine his request gradually by a succession of feedback operations. Furthermore, a user has the flexibility to

express his query and feedback opinions using media objects of any modality that is most convenient to him and

most effective in the particular situation. Both advantages help even nonprofessional users to compose queries

effectively and conveniently (which addresses the first challenge raised in Section 1). On the other hand,

knowledge regarding the relevance among multimodal objects can be solicited from user behaviors in both

explicit and implicit cooperation to enrich the knowledge in the MKB. Based on the growing knowledge, the

quality of retrieval results and thus the degree of user satisfaction can be significantly improved (which solves the

second challenge in Section 1).

5. QUALI TATI VE PERFORMANCE EVALUATION

To demonstrate the feasibility, usability, and effectiveness of Octopus, qualitative performance evaluations have

been conducted with the help of human subjects and some observations are interpreted in this section.

We choose multimedia data collected from web pages as our test dataset based on two considerations: (1)

Web is essentially an unlimited repository of multimedia data of various types, including images, animations,

videos; (2) there exist enormous hyperlinks among web pages and multimedia documents, which can be modeled

as structure links. In particular, we populate the database of Octopus using a crawler with the multimedia data

extracted from several multimedia-rich websites, including NBA.com, EPSN.com, Hollywood.com. In the data

collection process, each web page is modeled as a text object, and the multimedia documents embedded in it

(such as images) are modeled as media objects of various modalities. Hyperlinks and the composition

relationships (e.g., an image is embedded in a web page) are modeled as structure links. “ Non-informative”

images that widely exist in web pages, such as banners, icons, logos, are filtered out by heuristic rules based on

the frame size, file size, f ile name of the images. Finally, our database contains around 6,000 text objects, 850

images, and only a few video clips. The deficiency of video clips is because a majority of video in web pages is

provided as streaming video that can be only viewed online.

The evaluation is conducted under the help of five students from the Department of Computer Engineering

and Information Technology in City University of Hong Kong as testees. The students, who have no expertise

knowledge on information retrieval, are taught about the usage of Octopus. After getting acquaintance with the

current dataset, they are asked to perform random queries using various query specification methods provided by

Octopus. User behaviors including query and the follow-up browsing and relevance feedbacks, as well as their

impression on the closeness of the retrieved results to their needs, are recorded and the following observations are

made based on these statistics:

• Relevance of retr ieval results: The quality of the retrieval results in terms of relevance to the user’s request,

to a large extent, depends on the method used for query specification, particularly, whether the query is

composed by existing media objects or new objects. The retrieval result is usually of high quality if the query

is formed by the seed objects selected from the database, regardless of their modalities. This can be attributed

to the abundant structure links around the existing (seed) objects, which faithfully reflect the relevance

relationships among media objects. However, if the query is constituted by new media objects (as seeds), the

quality of the retrieval result varies greatly with the modality of the seeds. Specifically, queries by keywords

can usually generate reasonable results, whereas queries by sample images or videos often fail to receive any

desired result. This observation can be explained as follows. Since the query is formed by new objects, which

normally do not have structure links with other objects, the feature links constructed based on their primitive

features become the major (only) type of knowledge utilized in calculating retrieval results from seed objects.

Primitive features for various data types differ in terms of their descriptive power, and as generally believed,

keywords of text are more powerful than the low-level features of images and videos. Consequently, queries

expressed as keywords can produce better results than queries composed by sample media objects.

• Relevance feedback: It is observed from the experiment that if the desired result does not appear in the first

page, most testees choose to turn to the next page(s) to look for relevant results, instead of conducting

relevance feedbacks to improve the current result. The reason, as we estimate, lies in the user’s unfamiliarity

with the feedback function (which is not available in most commercial retrieval systems), as well as the

failure of finding suitable objects as positive examples to perform feedbacks. The average number of

feedbacks conducted for a single query is less than 0.3, and the average number of feedback examples

appointed in each feedback is slightly above 4. As most testees agree, feedback can achieve noticeable

improvement on the quality of retrieval results, especially when the previous result is of low quality.

• Efficiency: Efficiency is probably the most severe problem faced by Octopus. Running on a desktop with

Pentium III 1.13G CPU and 256 MB memory, Octopus spends over 6 seconds in average to process a single

query by retrieving 300 relevant objects, which is rather slow given the scale of the data collection. Moreover,

the processing time varies significantly from one query to another, depending on whether the seed objects are

heavily linked with other objects or not. The processing time for relevance feedback is roughly the double of

that for initial query, which is consistent with the fact that feedback involves two parallel retrieval processes

based on positive and negative examples respectively. The long latency also accounts for the reluctance of the

testees to perform relevance feedback. This low efficiency drawback is mainly due to the network structure of

the MKB, whose storage and access entails great computational cost. Both the query processing and relevance

feedback take long time because the underlying link analysis algorithms involve a great deal of link

manipulations. More seriously, since the number of links increases exponentially with the number of objects

in the MKB, the retrieval efficiency will degrade sharply as the size of the database grows.

In this primitive experiment, we have not conducted any quantitative performance evaluation in terms of

retrieval precision and recall. The difficulty mainly comes from the lack of a standard multimodal dataset serving

as a benchmark for multimodal information retrieval. An ideal benchmark for Octopus should contain a set of

sample queries expressed by a variety of means, as well as the ground truth that tells the correct result as a set of

multimodal objects for each sample query. Obviously, a multimedia data collection eligible as a benchmark is not

yet available and very diff icult to construct. Moreover, although the current test dataset of Octopus is collected

from the Web, we cannot compare the retrieval results of Octopus with that returned by commercial web search

engines, since (1) Octopus models only a portion of the Web, which is much smaller than the coverage of a

normal search engine; (2) the types of queries accepted by Octopus as well as the types of the information

retrieved are fundamentally different from those of any search engine.

6. CONCLUSION

In this paper, we have presented a multimodal information retrieval system, Octopus, to address the challenges of

information retrieval posed by the proliferation of multimodal data. Specifically, an aggressive search mechanism

has been described, which supports the retrieval of multimodal information in response to multimodal user

queries by exploring a broad range of knowledge. A cooperative interface has been introduced that is capable of

conducting multimodal queries, presenting multimodal retrieval results, and close collaboration with users for

improving retrieval performance. Observations regarding the qualitative performance of Octopus have been

interpreted.

As discussed in Section 5, the efficiency drawback has become the bottleneck that prevents Octopus from

its practical use in real-world data collections, such as digital libraries and the Web, each of which contains over

thousands or millions of multimedia objects. As the low efficiency of Octopus is mainly due to the

computationally expensive network structure of the MKB, efficient data structures and access strategies of the

MKB will be investigated as our future work in order to improve the efficiency of Octopus. As another important

future work, we plan to extend the current version of Octopus along several dimensions. On the one hand, more

types of prevailing media format, such as audio, GIF animation, and PPT slide, will be gradually supported by

Octopus. On the other hand, popular functionalities on multimodal data collection other than retrieval, such as

navigation, classification, and clustering, will be implemented.

7. ACK NOWLEDGEM ENT

The authors would like to express their thanks to Dr.Liu Wenyin for a fruitful discussion on the issue of user-

system cooperation and interaction, which helped our presentation on the cooperative interface in this paper. The

research has been supported, in part, by the Research Grant Council of the HKSAR, China (Project no. CityU

1119/99E).

8. REFERENCE

[1] A. B. Benitez, J. R. Smith, and S. F. Chang, “MediaNet: a multimedia information network for knowledge

representation,” in Proc. the SPIE 2000 Conf. Internet Multimedia Management Systems, vol. 4210, 2000.

[2] R. Bentley, T. Rodden, P. Sawyer, and I. Sommerville, “An architecture for tailoring cooperative multi-user displays,” in

Proc. Conf. Computer-Supported Cooperative Work, pp. 187-194, 1992.

[3] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” in Proc. 7th Int. World Wide Web

Conf., pp. 107-117, 1998.

[4] S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong, “VideoQ: an automated content based video search

system using visual cues,” in Proc. ACM Multimedia Conf., pp. 313-324, 1997.

[5] R. Elmasri and B. Navathe, Fundamentals of database systems, 2 Edition. The Benjamin/Cummings Publishing

Company, Inc., Redwood City, CA, 1994.

[6] M. Flickner, H. Sawhney, W. Niblack, and J. Ashley, “Query by image and video content: The QBIC system,” IEEE

Computer, pp. 23-32, 1995.

[7] J. Foote, “An overview of audio information retrieval,” ACM Multimedia Systems, 7: 2-10, 1999.

[8] Google Search Engine. http://www.google.com.

[9] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. ACM-SIAM Symposium on Discrete

Algorithms, pp. 668-677, 1998.

[10] R. Lempel and A. Soffer, “PicASHOW: pictorial authority search by hyperlinks on the Web,” in Proc. 10th Int. World

Wide Web Conf., pp. 438-448, 2001.

[11] Y. Lu, C.H. Hu, X.Q. Zhu, H.J. Zhang, and Q. Yang, “A unified framework for semantics and feature based relevance

feedback in image retrieval systems,” in Proc. ACM Multimedia Conf., pp. 31- 38, 2000.

[12] L. Nigay and J. Coutaz, “A generic platform for addressing the multimodal challenge,” in Proc. Int. Conf. Computer-

Human Interaction, pp. 98-105, 1995.

[13] A. Motro, “FLEX: A tolerant and cooperative user interface to databases,” IEEE Trans. Knowledge and Data

Engineering, 2(2): 231-246, 1990.

[14] T. Otsuka, A. Utsumi and J. Ohya, “Advanced man-machine interfaces based on computer vision technologies –

Recognizing facial expressions and hand gestures,” in IEEE Int. Workshop on Robot and Human Communication, Vol.1,

pp.56-63, 1998.

[15] S. Oviatt, A. DeAngeli, and K. Kuhn, “Integration and synchronization of input modes during multimodal human-

computer interaction,” in Proc. Int. Conf. Computer-Human Interaction, vol. 1, pp. 415--422, 1997.

[16] R.B. Reilly, M.J. O’Malley, “Adaptive gesture based interfaces for augmentative communication” , IEEE Trans.

Rehabilitation Engineering, 7(2): 174-183, 1999.

[17] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “GroupLens: an open architecture for collaborative

filtering of netnews,” in Proc .of ACM Conf. on Computer-Supported Cooperative Work, pp. 175-186, 1994.

[18] G. Salton and M.J. McGill, “Introduction to modern information retrieval,” McGraw-Hill Book Company, 1983.

[19] J. R. Smith and S. F. Chang, “VisualSEEk: a fully automated content-based image query system,” in Proc. ACM

Multimedia Conf., pp. 87-98, 1996.

[20] J. R. Smith and S. F. Chang, “Visually searching the Web for content,” IEEE Multimedia, 4(3): 12-20, 1997.

[21] R. Tansley, “The Multimedia Thesaurus: An aid for multimedia information retrieval and navigation” , Master Thesis,

Computer Science, University of Southampton, UK, 1998.

[22] A. Waibel, M.T. Vo, P. Duchnowski, and S. Manke, “Multimodal interfaces,” Artificial Intelligence Review, Special

Volume on Integration of Natural Language and Vision Processing, 10(3-4): 299-319, 1995.

[23] J. Yang, Y. T. Zhuang, and Q. Li, “Search for multi-modality data in digital libraries” , in Proc. 2nd IEEE Pacific-Rim

Conf. on Multimedia, pp. 482-489, Oct. 2001.

a multimodal information retrieval system: mechanism and ...juny/prof/papers/tmm.pdf · a...

Documents