a multimodal information retrieval system: mechanism and ...juny/prof/papers/tmm.pdf · a...
TRANSCRIPT
A Multimodal Information Retrieval System: Mechanism and Interface
Jun Yang 1,2 *Qing Li 1 Yueting Zhuang 2
1 Department of Computer Engineering and Information Technology
City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, HKSAR, CHINA
{itjyang, itqli} @cityu.edu.hk
2 Department of Computer Science
Zhejiang University, Hangzhou, 310027, CHINA
ABSTRACT
General-purpose information retrieval is among the prevailing forms of daily human-computer interaction. The
proliferation of multimedia data in various types of modality creates two challenges regarding the convenience
and effectiveness of information retrieval: (1) multimodal information (i.e., a collection of text documents, still
images, video clips, etc) is generally preferred by users over single-modal information as the retrieval result, and
(2) effective queries for multimodal information need to specified using a synergy of multiple query methods (e.g.,
query keywords, sample images). In this paper, we describe a multimodal information retrieval system, Octopus,
to attack both of the challenges, which are inadequately addressed by existing information retrieval technologies.
Specifically, Octopus employs an aggressive search mechanism proposed for the retrieval of multimodal
information in response to multimodal user queries by investigating a broad range of knowledge. Moreover, a
cooperative user interface is designed to support multimodal query specification, presentation of multimodal
retrieval result, and most importantly, close collaboration with users and active learning from user behaviors for
improving the quality of retrieval results. Observations are made regarding the qualitative performance of
Octopus to demonstrate its feasibility and effectiveness.
* Qing Li is the contact author of this paper.
1. I NTRODUCTI ON
Information retrieval, defined in a broad sense, refers to a user’s behavior of searching for relevant information of
various types to satisfy his requirement1. According to this definition, user behaviors in a variety of computer-
based applications and environments can be interpreted as generic information retrieval, such as looking for
interested information using Web-based search engines like Google [8], consulting the help system of a software
product (e.g., Microsoft Word) about its usage, or browsing the directory of a digital image database (e.g., Corel
Image Gallery) for favorite photos. Undoubtedly, general information retrieval is among the prevailing forms of
human-computer interaction, and therefore its quality directly relates to the convenience and productivity of
users’ daily interactions with computers. From the perspective of interaction, the quality of information retrieval
consists of two major aspects: (1) effectiveness and ease of query specification, i.e., how conveniently and
accurately a user can express his information need; and (2) users’ satisfaction with the retrieval results, i.e., to
which extent the retrieved information satisfies the user’s need.
Recent years witness a phenomenal growth of multimedia data in various types of modality, such as
image, video, audio, and graphic, in a number of multimedia repositories ranging from the Web to digital libraries.
The proliferation of such multimodal information poses two significant challenges that correspond to the two
aspects regarding the quality of information retrieval. On the one hand, rather than monotonous, single-modal
information, users would like to see their desired information presented in a more diversified and multimedia-rich
manner, preferably, as a collection of media objects in diverse modalities. As an example, a music fan who
searches for a song, say, Michael Jackson’s “ Heal the world” , would like to receive not only the audio file of the
song, but also its lyrics, its MTV, and perhaps the news of the singer. On the other hand, to express users’
requests for multimodal information precisely and conveniently, any existing method of query specification alone
is unlikely to be sufficient; rather, different means of query specification are required by different users, for
1 The information retrieval defined here is more general than the traditional information retrieval (IR), which refers to the
retrieval of textual information only. In this paper, the term “ information retrieval” refers to its generic definition.
different purposes, and in different situations. Again, consider the scenario of searching for favorite songs: a user
who knows the name of the desired song and/or the name of the singer can express his query using keywords,
whereas for those who only remember the rhythm of the song, query-by-humming method is more desirable. In
certain cases, a particular means of query specification is predominantly preferred to others. For example, to
search for images displaying a specific pattern that is too complex and subtle to be described by keywords, query
by sample image becomes almost the only choice of conducting an effective query.
In the past decade, retrieval techniques have been extensively studied for various types of information,
including traditional information retrieval (IR) technique for textual data [18], content-based retrieval (CBR)
technique for multimedia data such as image, video and audio [6] [7], and database retrieval technique for
structured data [5]. Unfortunately, opposite to what are required by multimodal retrieval, each existing retrieval
technique can deal with only a single type of information by queries specified in a particular manner. Therefore,
one of these single-modal retrieval techniques is applicable to the retrieval of multimodal information, which calls
for specialized multimodal retrieval approaches.
In this paper, we describe a multimodal information retrieval system, Octopus, as an attempt to tackle the
aforementioned two challenges raised by multimodal retrieval. Octopus exhibits a significant departure from
traditional retrieval systems in terms of both retrieval approach and user interface. Specifically, it employs an
aggressive search mechanism for retrieving multimodal information (i.e., a mixture of text documents, still
images, video clips, etc) in response to multimodal user queries (e.g., keyword query, sample image) based on a
synergy of multifaceted knowledge. Furthermore, a cooperative user interface is designed that supports
specification of multimodal queries, presentation of multimodal retrieval result, and most importantly, close
collaboration with users and active learning from user behaviors for the purpose of retrieving better results.
Although the user interface of Octopus is designed for multimodal retrieval, it is not a multimodal
interface defined in the traditional sense, which supports multiple ways of communication between human and
computer, such as keyboard, speech, gesture, lip motion, facial expression, etc. Actually, the term “ multimodal”
in this paper refers to multiple types of retrieved information and multiple means of query specification, whereas
for multimodal interfaces it refers to multiple channels of communication. Nevertheless, the interface of Octopus
addresses an orthogonal and equally important aspect of human-computer interaction as conventional multimodal
interface does. While the emphasis of conventional multimodal interfaces is the flexibility and naturalness of user
interaction, the design of our interface is essential to the convenience, effectiveness, and productivity of
multimodal information retrieval as a major form of user interaction.
The remainder of this paper is organized as follows. In Section 2, we present a brief review of the related
works on information retrieval, multimodal interface, and cooperative interface. We elaborate on the aggressive
search mechanism and the cooperative user interface of Octopus respectively in Section 3 and Section 4.
Observations on the qualitative performance of Octopus are presented in Section 5. Concluding remarks and
future works are given in Section 6.
2. RELATED WORK S
This section reviews the previous works related to Octopus from different perspectives, including information
retrieval technology, multimodal interface technology, and cooperative interface technology.
2.1 Information Retrieval
Previous works on generic information retrieval can be classified into two categories: retrieval techniques for
single-modal information and retrieval techniques involving the integration of multi-modality.
• Single-modal information r etr ieval: Retrieval techniques in this category can only deal with information of
a single modality. Among them, text-based information retrieval (IR) technique [18] is mainly used for searching
large text collections using query expressed as keywords. IR technique has been extensively studied for decades
and successfully applied in many commercial systems such as Web-based search engines [8]. Content-based
retrieval (CBR) technique is invented by the Computer Vision community to retrieve multimedia objects based on
low-level features that can be automatically extracted from the objects. CBR techniques have been widely used
for image retrieval (e.g., QBIC system [6], VisualSEEK system [19]), video retrieval (e.g., VideoQ system [4]),
and audio retrieval [7]. The low-level features used in retrieval vary from one type of media to another, such as
color and texture feature for images, MFCCs (mel-frequency cepstral coefficients) for audio clips. In addition,
database-style query using declarative query language (such as SQL) [5] is widely used by the Database
community to retrieve structured data based on predefined attributes, such as author of images, title of documents.
Despite the difference on target data type, search condition, and query method, the aforementioned
retrieval techniques share two common attributes: (1) each technique is used for the retrieval of single-modal
information 2, and (2) each technique adopts a specific means for query specification. For example, keyword-
based query is common in IR techniques, and query-by-example (QBE) is widely used in CBR techniques. Such
single-modal retrieval techniques, when applied to the world of multimodal information, suffer from problem of
ineffective query specification as well as insufficient and monotonous retrieval results.
• Integration of multi-modality: Research work in the past few years has investigated the integration of
multiple data types, mostly between text and image, in the context of information retrieval. For example, the iFind
[11] system proposes a unified framework under which descriptive keywords and low-level image features are
seamlessly combined for image retrieval, and the 2M2Net [23] system extends this framework to the retrieval of
other data types such as video and audio. WebSEEK system [20] extracts keywords from the surrounding text of
images and videos (in web pages) as their indexes to support keyword-based retrieval. The commonality of these
systems is to use one type of data (usually text) as index to retrieve another type of data (such as image).
Although they involve more than one type of data, none of them is a genuine multimodal retrieval system that
supports retrieval of various types of data at the same level.
More recently, the concept of MediaNet 8 and multimedia thesaurus (MMT) [21] have been proposed,
both of which seek to compose multimedia representation of semantic concepts—concepts described by diverse
media objects such as text descriptions, image illustrations, etc—and establish relationships among the concepts.
2 Note that this observation is not contradictory with the fact that certain retrieval techniques is applicable to multiple types of
data (e.g., CBR technique can be used for image and video), because none of them can deal with more than one type of data
at one time in a single retrieval system.
Although both of them support retrieval of multimodal data using the semantic concepts as the clue, the
construction of such multimedia concept representations is completely a manual process according to 8 and [21].
2.2 Multimodal Interface
The objective of multimodal interfaces is to achieve flexible and natural human-computer interaction by
supporting multiple channels of communication in addition to the predominantly used keyboard and mouse. The
models of communication supported in multimodal interfaces exhibit a great variety. For example, the multimodal
interface developed at Carnegie Mellon University [22] employs a synergetic use of visual, acoustic, and textual
cues, such as speech, gesture, lip-movement, handwriting. Reilly et al. [16] propose an adaptive non-contact
gesture-based communication system. Otuska et al. [14] use both facial expression and hand gesture recognition
techniques in designing man-machine interface. Fusion techniques for integrating the multimodal data generated
from various communication channels are addressed by Oviatt et al. [15] and Nigay et al. [12]. As mentioned in
Section 1, our work on multimodal retrieval is of equal significance with multimodal interfaces in terms of
achieving better human-computer interaction, and most likely, it requires multimodal interface as an “ enabling
technique” to support multimodal query methods (which is not within the scope of this paper).
2.3 Cooperative Interface
The concept of cooperative interface has been addressed, to a varying extent, in the research of database systems
[13], computer-supported cooperative work (CSCW) [2], information filtering and recommendation [17], etc.
Nevertheless, to the best of our knowledge, there is no agreed-upon definition of cooperative interface, while its
de facto understanding is an interface that is capable of assisting and collaborating with computer users to solve
certain problems. In the context of multimodal retrieval, a cooperative interface is particularly desired for two
reasons: (1) users can rarely compose precise queries; rather, a series of user-system interactions (such as
relevance feedback) is greatly needed and preferred for users to articulate their needs; (2) as generally recognized,
primitive features of multimedia objects (e.g., color and texture of images) are inadequate to capture their
semantic meanings, which are of essential importance in retrieval. In comparison, user behaviors usually imply a
wealth of knowledge on the semantics of multimedia objects, which serve as an indispensable complement to
their primitive features to achieve high retrieval performance. The cooperative interface of Octopus is designed to
meet both of the objectives.
3. AGGRESSIVE SEARCH M ECHANI SM FOR M ULTIM ODAL INFORMATION
The aggressiveness of the proposed search mechanism lies in its ability of embracing multimodal information,
exploring multifaceted knowledge, and learning from user interactions for improving retrieval results. As
illustrated in Fig.1, the architecture of the search mechanism consists of three major components: (1) multifaceted
knowledge base (MKB), which models different levels of knowledge on the relevance among multimodal media
objects3; (2) l ink analysis based retrieval approach, which retrieves relevant multimodal information for user
queries by analyzing the relevance links among multimodal objects in the MKB; (3) learning-from-interaction
strategy, which solicits useful knowledge from user behaviors (e.g., browsing, relevance feedback) in user-system
interactions and improves the quality of retrieval results progressively based on the derived knowledge. The loop
constituted by the three components (as shown in Fig.1) reveals the “ hill-climbing” nature of the mechanism: the
more interactions are performed by users, the better their information needs are satisfied by the retrieval results.
The details of the three components are described in the following subsections.
Mul ti facetedKnowl edge
Base
Li nk Analysisbased Retr i eval
Lear n ing-fr om-In teracti ons
Cooper ativeIn ter face
desiredinformation
relevancelinks
userbehaviors
derivedknowledge
multimodalquery
Fig.1: Architecture of the aggressive search mechanism
3 By multimodal media objects (or multimodal objects for short), we refer to media objects of various types of modality, such
text document, image, video clip, audio, etc.
3.1 Multifaceted Knowledge Base
Exploring knowledge on multiple aspects is a natural requirement of multimodal retrieval. To search multimodal
objects in an integrated manner, not only the features for each type of media objects are needed, but also the
knowledge describing the correlation between media objects of diverse types becomes indispensable. The
multifaceted knowledge base models three categories of knowledge that are available in most multimedia
repositories: (1) primitive features that are directly computable from multimodal objects, such as keyword of
textual documents, color and texture for images; (2) structural relationships among objects in the form of
hyperlinks, spatial adjacency, composition relationships (e.g., a video clip comprises several image frames), etc;
(3) user-perceived relevance among objects that is articulated by users or implied in their behaviors (e.g.,
relevance feedback).
StructureLayer
FeatureLayer
PerceptionLayer
text
audio
video
image
* The same legend is used for theother figures in this paper.
Legend*
Fig.2: Layered network model in the multifaceted knowledge base
To represent the three types of knowledge in a uniform manner, a layered network model is developed as
the “ skeleton” of the MKB. As depicted in Fig.2, this model is constituted by a set of superimposed layers, with
each layer modeling the relevance relationships among multimodal objects defined from a certain perspective.
Specifically, each layer is represented by a network G = (N, L), where N is a finite set of nodes and L is a finite set
of links. Each node in N denotes a media object Oi�
O of any modality (typically, text, image, video, audio) in the
database, and each link in L is a triple of the form <Oi, Oj, r> denoting the relevance relationship between object
Oi and Oj, with r as the link weight indicating the strength of the relevance.
The layered network model shown in Fig.2 comprises three layers, namely feature layer, structure layer,
and perception layer, whose links (correspondingly, feature link, structure link, and perception link) represent the
relevance among the same set of multimodal objects derived from the three types of knowledge modeled by the
MKB. Specifically, a feature link represents the similarity between two media objects in terms of their primitive
features. Especially, as primitive features are normally media-dependent, feature links exist only between objects
of the same modality. A structure link connects two objects that have structural relationship between them in the
form of hyperlink, spatial adjacency, composition relationship, etc. Perception links denote the user-perceived
relevance among media objects. Please note the relative positions of the three layers reflect the priority of the
three types of knowledge in terms of reliability (confidence) in suggesting the relevance among multimodal
objects, specifically, user perception is more reliable than structural relationship, which is in turn more reliable
than primitive features of multimedia objects.
The construction of the MKB requires both “ offline” and “ online” processing. In particular, feature links
are built offline by calculating the similarity between any two media objects (usually of the same modality) based
on their primitive features. A feature link is created between two objects with similarity above a predefined
threshold, with the link weight set to the similarity score. Techniques for extracting primitive features and
computing similarity for various media types are readily available in existing research such as text-based IR [18]
and content-based retrieval [6][7]. To construct structure links, we inspect the environment where media objects
are collected and identify the structural relationships among them by offline processing. Consider a web page that
contains an embedded image and points to a video clip by a hyperlink in the page. Using our model, the image,
the video clip, and the textual part of the page are all regarded as media objects, and the image and video clip are
connected with the textual page by structure links. In comparison, perception links by definition are not initially
available before any user interactions are performed. Hence, perception links are gradually obtained during the
online interaction with users by the learning-from-interaction strategy to be described in Section 3.3.
3.2 Link Analysis based Retrieval Approach
The MKB provides a wealth of relevance links4 among multimodal objects, by exploring which multimodal
queries can be processed effectively and efficiently. On the other hand, hyperlink analysis technique has been
extensively studied in the domain of Web-based information retrieval [3][9], and successfully applied in many
commercial applications, such as the Web search engine Google [8]. By a close examination, one can easily see
the analogy between the Web, which is a huge network of web pages and various multimedia documents
connected by hyperlinks, and our MKB, which is a network of multimodal objects connected by relevance links.
Therefore, we “ borrow” the idea of hyperlink analysis and develop a suite of link analysis algorithms that work
closely to support multimodal retrieval.
Seed objects
Candidate objects
Result objects
Discover Distilling
Resultpresentation
Queryspecification
Relevancefeedback
Perception link
Structure link
Feature link
Legend
Fig.3: L ink analysis based retrieval approach
As illustrated in Fig.3, the whole retrieval process can be divided into four steps: (1) a user specifies his
query by designating or submitting some multimodal objects as “ seed objects” ; (2) a relatively large number of
“ candidate objects” are discovered based on the seed objects via the relevance links in the MKB; (3) a set of
4 We call the perception links, structure links, and feature links in the MKB collectively as relevance links, since all the three
types of links indicate the relevance among multimodal media objects.
relevant objects (to the query) are “ distilled” from the candidate objects by analyzing the link structure among
them and presented to the user as the retrieval result; (4) if not satisfied with the current result, the user can inform
the system to recalculate the retrieval result based on his feedback opinion given to the current result. The details
of the four steps are described below.
3.2.1 Multimodal quer y specification
In Octopus, a user query is composed by one or more multimodal seed objects (seeds for short) as the
representation or hint of the user’s information need. In this sense, the role of a seed object is similar to that of a
sample media object in the query-by-example (QBE) paradigm widely used in CBR approaches [6][19]. However,
our query specification method differs from QBE on two nontrivial aspects: (a) a query in Octopus can consist of
seed objects of arbitrary number and in arbitrary modality, while QBE normally allows only a single sample
object of a specific modality; (b) owing to the broad range of knowledge modeled in the MKB, based on which
even the relationship between loosely related objects can be established, seed objects are not required to be
precise representation of the desired information. Therefore, the query specification method in Octopus relieves
users from the burden of f inding highly representative sample objects that often frustrate the users of CBR
systems.
A seed object can be either chosen from existing media objects in the collection or submitted as a new
media object. In the latter case, the new media objects can be “ created” by the user, such as inputting query
keywords or humming a melody to the microphone, or introduced from external data collections, such as selecting
a sample image stored in the local computer. The new media objects are immediately registered into the MKB
with their feature links and structure links (if any) with existing objects constructed. Therefore, in both cases a
user query is f inally represented as one or multiple media objects in the MKB. Here, the notion of multimodal
query has two levels of interpretation: a query composed by seed objects of multiple media types, and a query
specified using a synergy of multiple methods (e.g., inputting query keywords, submitting example image).
3.2.2 Candidate discover y
A basic premise of our retrieval approach is that, the media objects desired by a user are connected with the seed
objects representing the user’s query through relevance links of various types in the MKB. Therefore, by
analyzing the structure of links around the seeds, we are able to figure out the media objects most relevant to the
user request. However, computations (e.g., traversing) involving links in a network model are quite expensive,
especially when the number of nodes and links in the network is large, as in the case of the MKB. To make
sophisticated link analysis computationally affordable, as a pre-processing step we cut down the “ search space” to
a small locality around the seeds by discovering a set of candidate objects (candidates for short). Specifically, the
set of candidates C must meet the following two criteria:
1) C contains a majority of media objects that are potentially relevant to the user query.
2) C is small enough to afford the distillation algorithm subsequently applied on it (see Section 3.2.3).
Both criteria favor the media objects connected with the seeds through short paths (i.e., paths constituted by a
small number of links) in the MKB. On the one hand, short paths imply smaller “ cumulative error” in reasoning
the relevance relationship between candidates and seeds; on the other hand, the number of media objects that can
be reached from seeds through short paths is likely to be small. Therefore, we specify the scope of candidates as
the media objects reached from seeds through path with length (viz. number of links) below a predefined
threshold, including the seeds themselves. Since the relevance links in the MKB exist among multimodal objects,
the candidates discovered by these links are by default multimodal. However, even if the maximum path length is
specified, the number of candidates is still very unpredictable, especially when the seeds are heavily linked with
surrounding objects. To deal with such cases, we place a second threshold on the total number of candidates,
which, when exceeded, is used to cut the exceeding portion of the discovered candidate objects.
The process of candidate discovery is summarized by the algorithm in Fig.4. This algorithm also takes
into account the priority of different types of links. Specifically, among the objects reached through paths of the
same length, those reached by perception links are added into candidates prior to those reached by structure links,
which are in turn prior to those reached by feature links. Thus, if the number of discovered objects exceeds the
maximum number of candidates, the objects reached by higher-priority links are considered as candidates in
preference to those reached by lower-priority links.
Discover (S, M, N) S: set of seed objects M: maximum length of path N: maximum number of candidates random(C, n): a routine returning n random objects from set C return: set of candidate objects Set C equal to S For i=1 to M
Set Cp as the set of objects reachable from any object in C via one perception link
Set Cs as the set of objects reachable from any object in C via one structure link
Set Cf as the set of objects reachable from any object in C via one feature link
If |C � Cp| < N
If |C � Cp�
Cs| < N
If |C � Cp � Cs � Cf| < N
C = C � Cp � Cs � Cf Else
Return C � Cp � Cs � random(Cf, N - |C � Cp � Cs|) Else
Return C � Cp � random(Cs, N - |C � Cp|)
Else Return C � random(Cp, N - |C |)
Next Return C
Fig.4: Algorithm for discovering candidate objects
The candidate discovery algorithm is based on a rough heuristic that shorter paths (between candidates
and seeds) suggest stronger relevance. As a pre-processing step, it does not consider the type and weight of
relevance links, which are also important factors indicating the relevance among media objects. However, chances
are that an object reached by several high-priority (or high-weighted) links is more relevant than an object reached
by less low-priority (or low-weighted) links, and in such cases, relevant objects could be “ missed” by this
algorithm. In fact, there exists a tradeoff between the two criteria of choosing candidates stated at the beginning of
this subsection, which is balanced by the maximum number of candidates allowed. The more candidates are
discovered, the more costly the subsequent distillation process is, but the less likely relevant candidates are
missed, and vice versa.
3.2.3 Result distillation
The distillation process aims to “ distill” the most relevant media objects out of the potentially relevant candidates
by a sophisticated examination of the link structure among these candidates. For this purpose, the relevance of
each candidate (to the query) is calculated based on the notion of relevance propagation — the relevance can be
propagated from one object to another via the link(s) between them in the MKB. Specifically, we pump the initial
relevance score into the seed objects (which are within the candidates) and allow the relevance to flow through
links among the candidates, with the “ amount” of the relevance flow adjusted according to the weight of links.
The propagation process ends when the relevance scores of all candidate objects converge.
Propagate(C, S, M) C: set of candidate objects S: set of seed objects M= [mij]: adjacency matrix of the sub-network corresponding to C at a specific layer of the MKB, where
mi j is equal to the weight of the link between Oi and Oj (mij=0 if there is no link between them)R= [ri]: a vector with each element ri being the relevance score of object Oi in C return: converged relevance scores of candidate objects For each object Oi in C If Oi is in S, then ri=1; Else ri=0. Next Normalize R such that � ri
2=1 While R has not been converged
For each object Oi in C
ri = � j=1,…,|C| (r j · mi j)
Next Normalize R such that � ri
2=1 Return R
Fig.5: Algor ithm for computing relevance scores of candidates at a single layer
As the MKB consists of three superimposed layers, each of which has different semantics and priority, the
relevance scores of candidates are calculated separately at each layer by relevance propagation and then merged to
give their overall relevance scores. The relevance propagation on each layer is performed within the sub-network
corresponding to the candidates at this layer. Suppose each candidate has a relevance score, which is initialized to
one if it is a seed object, or zero otherwise. In each round of propagation, the relevance score of a candidate is set
to the sum of the relevance flowed to it from its neighboring objects via the links, with each “ relevance flow”
being the product of the link weight and the relevance score of the neighbor in the previous round of propagation.
(The neighboring objects of an object are those that have link with this object in the MKB.) Such propagation
proceeds until the relevance score of each candidate converges to a fixed value, which indicates its relevance to
the query according to the knowledge modeled by this layer. Please note that the convergence of relevance scores
(i.e., termination of the propagation process) has been proven by many previous works on link analysis [3] [9].
The algorithm describing the propagation process at a single layer of the MKB is given in Fig.5.
After applying the propagation algorithm on all the three layers of the MKB, we combine the candidates’
relevance scores computed at each layer to obtain their overall relevance scores. However, since the three layers
deal with different types of knowledge, designing a fair combination strategy is extremely difficult. We take a
simple strategy by linearly combining the relevance scores of a candidate computed at three layers to give its
overall relevance score, as described by the algorithm shown in Fig.6. Intuitively, the weights for different layers
are assigned in a way that reflects the priority of each layer (i.e., wP > wS > wF). The candidates are ranked by their
overall relevance score returned by this algorithm and those with high scores are presented to the user as the
retrieval result.
Distill (C, S) C: set of candidate objects S: set of seed objects R = [r i]: a vector with each element ri being the overall relevance score of object Oi in C wP, wS, wF: weights of the perception layer, structure layer, and feature layer of the MKB
MP, MS, MF: adjacency matrices of the sub-networks corresponding to C at the perception layer,
structure layer, and feature layer of the MKB return: converged overall relevance scores of candidates RP = Propagate (C, S, MP)
RS = Propagate (C, S, MS)
RF = Propagate (C, S, MF)
R = wP ·RP + wS ·RS + wF
·RF
Return R
Fig.6: Algor ithm for computing overall relevance scores of candidates
3.2.4 Relevance feedback
Octopus allows a user who is not satisfied with the retrieval result to perform relevance feedback by labeling
some of the retrieved media objects as positive (feedback) examples or negative (feedback) examples. The
labeling can be done to any media object irrespective of its modality. The labeled feedback examples serve as
further indication and qualification of the user’s information need, towards which the current retrieval result is
refined by Octopus. The refinement is based on an intuitive notion that a media object desired by the user must be
related to the positive examples and meanwhile far away from the negative examples. As illustrated in Fig.7, the
refinement process is executed in a parallel manner: f irstly, the positive examples are treated as positive seeds,
from which a set of positive candidates are discovered and their relevance scores to the positive seeds are
calculated, using the algorithm described in Section 3.2.2 and 3.2.3 respectively; in the meantime, negative
candidates are discovered and their relevance to negative examples are calculated; finally, the media objects as the
refined result are obtained by integrating the positive and negative candidates as well as their relevance scores, as
described by the algorithm in Fig.8.
Negativecandidates
Negativeexamples
Positiveexamples
Merge
discover
discover
distilling
Refinedresults
Positivecandidates
distilling
Fig.7: Refinement of retrieval results by relevance feedback
Feedback (P, N) P: objects as positive examples N: objects as negative examples R: a vector giving the overall relevance scores of the refined results return: overall relevance scores of the refined results CP = Discover (P)
CN = Discover (N)
RP = Distil l (CP, P)
RN = Distill (CN, P)
For each object Oi in CP
If Oi is in CN
R(Oi) = RP (Oi) - RN (Oi)
Else R(Oi) = RP (Oi)
Next Return R
Fig.8: Algor ithm for calculating refined retr ieval result by relevance feedback
3.3 Learning-from-Interaction Strategy
In Octopus, user interactions around information retrieval not only drain the knowledge stored in the MKB but
also supply it. Specifically, the learning-from-interaction strategy is devised to derive from various user behaviors
the relevance relationships among multimodal objects and incorporate the derived relationships into the MKB by
updating the perception links. Two forms of “ informative” user behavior are currently supported by this strategy,
one of which is the relevance feedback explicitly conducted by users. Specifically, when a user submits a query as
seed objects and then labels some retrieved media objects as positive and negative examples, the relevance
relation between the seeds and the positive examples, as well as the irrelevance relation between the seeds and the
negative examples, are apparent. In addition, relevance relationships are also implied when the user browses
through the media objects retrieved for a specific query. Specifically, if the user’s attention is captured by an
object for a period of time long enough to justify his interest in this object, its relevance with the query (seed
objects) can be deduced. However, this heuristic rule does not hold the other way around, as objects without user
attention are not necessarily irrelevant to the query.
Learning (S, P, N) S: set of seed objects P: set of positive examples (explicit and implicit) N: set of negative examples M= [mi j]: adjacency matrix of the network at perception layer s, t: positive real numbers For each object Oi in S
For each object Oj in P If there is no perception link between Oi and Oj
Create a perception link between Oi and Oj with mi j=s Else mij = mi j + s Next
For each object Ok in N If there is a perception l ink between Oi and Oj
mik = mik – t If mik < 0, remove the link
Next Next
Fig.9: Algor ithm for updating perception links by learning from user interactions
The derived relevance relationships are integrated into the MKB by creating perception links or
increasing the link weight (if the link to be created already exists), and vice versa for irrelevance relationships.
This is consistent with the semantics of perception links, since these derived relationships reflect users’ perception
that underlies their behaviors. The algorithm for updating perception links is given in Fig.9. The relevant objects
deduced from the user’s browsing behavior are regarded as “ implicit” positive examples and treated uniformly by
this algorithm.
The contribution of this learning strategy is mainly on the long-term retrieval performance. As users
interact with Octopus, this strategy solicits useful knowledge from the interactions of the whole population of
users to enrich the underlying MKB. Based on such reliable and ever-growing knowledge, the link analysis based
retrieval approach is expected to produce retrieval results that better satisfy user queries, and therefore the
effectiveness of information retrieval is improved.
4. COOPERATIVE USER INTERFACE
This section elaborates on the user interface of Octopus, with emphasis on its ability of accepting multimodal
query, displaying multimodal result, and close cooperation with users for improving retrieval performance. The
interface is developed using Active Server Pages (ASP) technology, which communicates with the underlying
search mechanism and the database through a COM interface, and creates dynamic web pages to display the
multimodal information retrieved by the search mechanism in standard Web browsers. This architecture takes the
advantage of allowing multiple users to access Octopus through web browsers simultaneously and remotely.
A screenshot of the main interface is illustrated in Fig.10. The interface integrates basic user functions
including query specification, result presentation, and relevance feedback. The upper pane of the interface
provides the facilities for users to compose a multimodal query by means of inputting relevant keywords,
submitting sample images or videos, or a combination of them. The query is submitted by clicking the button
labeled as “Go Search” , and as the response a set of relevant media objects are retrieved by the search mechanism
and displayed in the lower pane of the interface. The user can designate feedback examples among the retrieved
objects and click the “ Feedback” button to let the system refine the retrieval result. The button labeled as “ Browse
It” is provided for navigation purpose, by clicking which a set of media objects randomly selected from the
database are displayed. For all the three types of operation, users can specify the preferred type(s) of the received
media objects as image, video, text, or a mixture of them, which are the three types of data supported by the
current “ version” of Octopus.
The lower pane of the interface displays the retrieved multimodal objects in a layout that well suits the
visual characteristics of their respective modality. As shown in Fig.10, each image is shown as a thumbnail about
1/4 wide of a row, each text document is represented by a title and a short abstract spanning over a row, and a
video clip is denoted by a set of representative frames extracted from the clip, which also occupy a complete row.
By showing media objects in such “ condensed” forms, the interface can accommodate a large number of objects
in one page, while the user can still catch the main semantics of each object. Beneath each displayed object
juxtapose two icons with the symbol of “�
” and “ ×” as well as two hyperlinks labeled as “ Detail” and “ Similar” .
Videoobject
Textobject
Imageobject
Fig. 10: The cooperative user interface of Octopus
Among them, the hyperlink “ Similar” is used to initiate a separate query using the corresponding object as a seed
object. (The functions of other icons and hyperlinks are described below.) The media objects are ordered basically
by their estimated relevance to the query, with some local adjustment for the sake of arranging every four images
into a single row. Moreover, as a benefit of using Web browser, the interface is naturally scrollable when the
media objects in one page cannot be fit into the screen size. If multiple pages are needed given the number of
media objects to display and the page size, a simple navigation bar containing hyperlinks of “ First” , “ Last” ,
“ Previous” , and “ Next” is shown to allow users to switch among different pages.
As its most distinctive feature, the user interface, coupled with the underlying search mechanism, facilitates
smooth cooperation with users to help them find high-quality results. From the perspective of user intention, two
forms of cooperation are currently supported in Octopus:
• Explicit cooperation by feedback: Since an average user can hardly clarify his request in the initial query,
relevance feedback provides him with a second chance to refine/qualify his query in order to get improved
results. The feedback opinion is expressed by marking the retrieved objects that the user consider relevant to
the query as positive examples, and marking the irrelevant objects as negative examples. The marking is
performed using the icons of “�
” and “ ×” beneath each displayed object, which, when clicked, will be shown
in highlight color to denote positive negative example respectively. After marking all the positive and
negative examples, the user can inform the system to refine the current results by clicking the “ Feedback”
button shown in the top.
Upon the acceptance of the user’s feedback request, the system takes two actions in parallel. On the one
hand, the relevance feedback algorithm described in Section 3.2.4 is executed to recalculate the retrieval
result and presents the refined result to the user. On the other hand, user’s feedback behavior is analyzed by
the learning-from-interaction strategy (see Section 3.3) to update the perception links in the MKB. Further
rounds of feedback can be conducted based on the results refined by previous feedbacks, until the user finds
the desired information or gives up without finding anything relevant.
• Implicit cooperation by browsing: As mentioned in Section 3.3, relevance among multimodal objects can
be deduced from a user’s focus of attention when he browses through the retrieval results of a given query. To
this end, certain interface facilities are demanded to track the user’s focus of attention from his browsing
behavior. In our interface, the hyperlink labeled as “ Details” beneath each object is used for this purpose, by
clicking which a separate window is opened to display the original version of the object (instead of its
condensed version), i.e., a full-size image, a complete text document, etc. Naturally, the time during which
this separate window exists is regarded as the time that the user’s attention is attracted by the corresponding
object, and the length of this time is a good indicator of the degree of his interest in the object. Intuitively, if
the time that a user focuses on a specific object is above a certain length, very likely that the user regards it as
a relevant result to the query he conducts. Based on this assumption, the learning strategy in Section 3.3 is
executed to update the perception links between the “assumed” relevant objects and the seed objects that
constitute the query.
The merit of the user-system cooperation described above is twofold. On the one hand, cooperation in the form of
relevance feedback relieves a user from the diff iculty of formulating precise initial queries; rather, he can express
and refine his request gradually by a succession of feedback operations. Furthermore, a user has the flexibility to
express his query and feedback opinions using media objects of any modality that is most convenient to him and
most effective in the particular situation. Both advantages help even nonprofessional users to compose queries
effectively and conveniently (which addresses the first challenge raised in Section 1). On the other hand,
knowledge regarding the relevance among multimodal objects can be solicited from user behaviors in both
explicit and implicit cooperation to enrich the knowledge in the MKB. Based on the growing knowledge, the
quality of retrieval results and thus the degree of user satisfaction can be significantly improved (which solves the
second challenge in Section 1).
5. QUALI TATI VE PERFORMANCE EVALUATION
To demonstrate the feasibility, usability, and effectiveness of Octopus, qualitative performance evaluations have
been conducted with the help of human subjects and some observations are interpreted in this section.
We choose multimedia data collected from web pages as our test dataset based on two considerations: (1)
Web is essentially an unlimited repository of multimedia data of various types, including images, animations,
videos; (2) there exist enormous hyperlinks among web pages and multimedia documents, which can be modeled
as structure links. In particular, we populate the database of Octopus using a crawler with the multimedia data
extracted from several multimedia-rich websites, including NBA.com, EPSN.com, Hollywood.com. In the data
collection process, each web page is modeled as a text object, and the multimedia documents embedded in it
(such as images) are modeled as media objects of various modalities. Hyperlinks and the composition
relationships (e.g., an image is embedded in a web page) are modeled as structure links. “ Non-informative”
images that widely exist in web pages, such as banners, icons, logos, are filtered out by heuristic rules based on
the frame size, file size, f ile name of the images. Finally, our database contains around 6,000 text objects, 850
images, and only a few video clips. The deficiency of video clips is because a majority of video in web pages is
provided as streaming video that can be only viewed online.
The evaluation is conducted under the help of five students from the Department of Computer Engineering
and Information Technology in City University of Hong Kong as testees. The students, who have no expertise
knowledge on information retrieval, are taught about the usage of Octopus. After getting acquaintance with the
current dataset, they are asked to perform random queries using various query specification methods provided by
Octopus. User behaviors including query and the follow-up browsing and relevance feedbacks, as well as their
impression on the closeness of the retrieved results to their needs, are recorded and the following observations are
made based on these statistics:
• Relevance of retr ieval results: The quality of the retrieval results in terms of relevance to the user’s request,
to a large extent, depends on the method used for query specification, particularly, whether the query is
composed by existing media objects or new objects. The retrieval result is usually of high quality if the query
is formed by the seed objects selected from the database, regardless of their modalities. This can be attributed
to the abundant structure links around the existing (seed) objects, which faithfully reflect the relevance
relationships among media objects. However, if the query is constituted by new media objects (as seeds), the
quality of the retrieval result varies greatly with the modality of the seeds. Specifically, queries by keywords
can usually generate reasonable results, whereas queries by sample images or videos often fail to receive any
desired result. This observation can be explained as follows. Since the query is formed by new objects, which
normally do not have structure links with other objects, the feature links constructed based on their primitive
features become the major (only) type of knowledge utilized in calculating retrieval results from seed objects.
Primitive features for various data types differ in terms of their descriptive power, and as generally believed,
keywords of text are more powerful than the low-level features of images and videos. Consequently, queries
expressed as keywords can produce better results than queries composed by sample media objects.
• Relevance feedback: It is observed from the experiment that if the desired result does not appear in the first
page, most testees choose to turn to the next page(s) to look for relevant results, instead of conducting
relevance feedbacks to improve the current result. The reason, as we estimate, lies in the user’s unfamiliarity
with the feedback function (which is not available in most commercial retrieval systems), as well as the
failure of finding suitable objects as positive examples to perform feedbacks. The average number of
feedbacks conducted for a single query is less than 0.3, and the average number of feedback examples
appointed in each feedback is slightly above 4. As most testees agree, feedback can achieve noticeable
improvement on the quality of retrieval results, especially when the previous result is of low quality.
• Efficiency: Efficiency is probably the most severe problem faced by Octopus. Running on a desktop with
Pentium III 1.13G CPU and 256 MB memory, Octopus spends over 6 seconds in average to process a single
query by retrieving 300 relevant objects, which is rather slow given the scale of the data collection. Moreover,
the processing time varies significantly from one query to another, depending on whether the seed objects are
heavily linked with other objects or not. The processing time for relevance feedback is roughly the double of
that for initial query, which is consistent with the fact that feedback involves two parallel retrieval processes
based on positive and negative examples respectively. The long latency also accounts for the reluctance of the
testees to perform relevance feedback. This low efficiency drawback is mainly due to the network structure of
the MKB, whose storage and access entails great computational cost. Both the query processing and relevance
feedback take long time because the underlying link analysis algorithms involve a great deal of link
manipulations. More seriously, since the number of links increases exponentially with the number of objects
in the MKB, the retrieval efficiency will degrade sharply as the size of the database grows.
In this primitive experiment, we have not conducted any quantitative performance evaluation in terms of
retrieval precision and recall. The difficulty mainly comes from the lack of a standard multimodal dataset serving
as a benchmark for multimodal information retrieval. An ideal benchmark for Octopus should contain a set of
sample queries expressed by a variety of means, as well as the ground truth that tells the correct result as a set of
multimodal objects for each sample query. Obviously, a multimedia data collection eligible as a benchmark is not
yet available and very diff icult to construct. Moreover, although the current test dataset of Octopus is collected
from the Web, we cannot compare the retrieval results of Octopus with that returned by commercial web search
engines, since (1) Octopus models only a portion of the Web, which is much smaller than the coverage of a
normal search engine; (2) the types of queries accepted by Octopus as well as the types of the information
retrieved are fundamentally different from those of any search engine.
6. CONCLUSION
In this paper, we have presented a multimodal information retrieval system, Octopus, to address the challenges of
information retrieval posed by the proliferation of multimodal data. Specifically, an aggressive search mechanism
has been described, which supports the retrieval of multimodal information in response to multimodal user
queries by exploring a broad range of knowledge. A cooperative interface has been introduced that is capable of
conducting multimodal queries, presenting multimodal retrieval results, and close collaboration with users for
improving retrieval performance. Observations regarding the qualitative performance of Octopus have been
interpreted.
As discussed in Section 5, the efficiency drawback has become the bottleneck that prevents Octopus from
its practical use in real-world data collections, such as digital libraries and the Web, each of which contains over
thousands or millions of multimedia objects. As the low efficiency of Octopus is mainly due to the
computationally expensive network structure of the MKB, efficient data structures and access strategies of the
MKB will be investigated as our future work in order to improve the efficiency of Octopus. As another important
future work, we plan to extend the current version of Octopus along several dimensions. On the one hand, more
types of prevailing media format, such as audio, GIF animation, and PPT slide, will be gradually supported by
Octopus. On the other hand, popular functionalities on multimodal data collection other than retrieval, such as
navigation, classification, and clustering, will be implemented.
7. ACK NOWLEDGEM ENT
The authors would like to express their thanks to Dr.Liu Wenyin for a fruitful discussion on the issue of user-
system cooperation and interaction, which helped our presentation on the cooperative interface in this paper. The
research has been supported, in part, by the Research Grant Council of the HKSAR, China (Project no. CityU
1119/99E).
8. REFERENCE
[1] A. B. Benitez, J. R. Smith, and S. F. Chang, “MediaNet: a multimedia information network for knowledge
representation,” in Proc. the SPIE 2000 Conf. Internet Multimedia Management Systems, vol. 4210, 2000.
[2] R. Bentley, T. Rodden, P. Sawyer, and I. Sommerville, “An architecture for tailoring cooperative multi-user displays,” in
Proc. Conf. Computer-Supported Cooperative Work, pp. 187-194, 1992.
[3] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” in Proc. 7th Int. World Wide Web
Conf., pp. 107-117, 1998.
[4] S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong, “VideoQ: an automated content based video search
system using visual cues,” in Proc. ACM Multimedia Conf., pp. 313-324, 1997.
[5] R. Elmasri and B. Navathe, Fundamentals of database systems, 2 Edition. The Benjamin/Cummings Publishing
Company, Inc., Redwood City, CA, 1994.
[6] M. Flickner, H. Sawhney, W. Niblack, and J. Ashley, “Query by image and video content: The QBIC system,” IEEE
Computer, pp. 23-32, 1995.
[7] J. Foote, “An overview of audio information retrieval,” ACM Multimedia Systems, 7: 2-10, 1999.
[8] Google Search Engine. http://www.google.com.
[9] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. ACM-SIAM Symposium on Discrete
Algorithms, pp. 668-677, 1998.
[10] R. Lempel and A. Soffer, “PicASHOW: pictorial authority search by hyperlinks on the Web,” in Proc. 10th Int. World
Wide Web Conf., pp. 438-448, 2001.
[11] Y. Lu, C.H. Hu, X.Q. Zhu, H.J. Zhang, and Q. Yang, “A unified framework for semantics and feature based relevance
feedback in image retrieval systems,” in Proc. ACM Multimedia Conf., pp. 31- 38, 2000.
[12] L. Nigay and J. Coutaz, “A generic platform for addressing the multimodal challenge,” in Proc. Int. Conf. Computer-
Human Interaction, pp. 98-105, 1995.
[13] A. Motro, “FLEX: A tolerant and cooperative user interface to databases,” IEEE Trans. Knowledge and Data
Engineering, 2(2): 231-246, 1990.
[14] T. Otsuka, A. Utsumi and J. Ohya, “Advanced man-machine interfaces based on computer vision technologies –
Recognizing facial expressions and hand gestures,” in IEEE Int. Workshop on Robot and Human Communication, Vol.1,
pp.56-63, 1998.
[15] S. Oviatt, A. DeAngeli, and K. Kuhn, “Integration and synchronization of input modes during multimodal human-
computer interaction,” in Proc. Int. Conf. Computer-Human Interaction, vol. 1, pp. 415--422, 1997.
[16] R.B. Reilly, M.J. O’Malley, “Adaptive gesture based interfaces for augmentative communication” , IEEE Trans.
Rehabilitation Engineering, 7(2): 174-183, 1999.
[17] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “GroupLens: an open architecture for collaborative
filtering of netnews,” in Proc .of ACM Conf. on Computer-Supported Cooperative Work, pp. 175-186, 1994.
[18] G. Salton and M.J. McGill, “Introduction to modern information retrieval,” McGraw-Hill Book Company, 1983.
[19] J. R. Smith and S. F. Chang, “VisualSEEk: a fully automated content-based image query system,” in Proc. ACM
Multimedia Conf., pp. 87-98, 1996.
[20] J. R. Smith and S. F. Chang, “Visually searching the Web for content,” IEEE Multimedia, 4(3): 12-20, 1997.
[21] R. Tansley, “The Multimedia Thesaurus: An aid for multimedia information retrieval and navigation” , Master Thesis,
Computer Science, University of Southampton, UK, 1998.
[22] A. Waibel, M.T. Vo, P. Duchnowski, and S. Manke, “Multimodal interfaces,” Artificial Intelligence Review, Special
Volume on Integration of Natural Language and Vision Processing, 10(3-4): 299-319, 1995.
[23] J. Yang, Y. T. Zhuang, and Q. Li, “Search for multi-modality data in digital libraries” , in Proc. 2nd IEEE Pacific-Rim
Conf. on Multimedia, pp. 482-489, Oct. 2001.