diogenes: a distributed search agentbelieved in me, provided moral support when ever i needed. my...
TRANSCRIPT
Department of Computer Science and Engineering University of Texas at Arlington
Arlington, TX 76019
DIOGENES: A Distributed Search Agent
Ravishankar P. Mysore, Y. Alp Aslandogan {rmysore,alp}@cse.uta.edu
Technical Report CSE-2003-24
This report was also submitted as an M.S. thesis
DIOGENES: A DISTRIBUTED IMAGE SEARCH AGENT
by
RAVISHANKAR PUTTAIAH MYSORE
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
Masters of Science in Computer Science Engineering
THE UNIVERSITY OF TEXAS AT ARLINGTON
May 2003
iii
ACKNOWLEDGMENTS
First and foremost, I would like to express my sincere thanks to my advisor,
Dr. Alp Aslandogan for his contribution in the development of my research and his
guidance in my professional growth. His strong encouragement and broad vision, made
this work possible.
My parents have been a constant source of inspiration throughout my study and
their prayers have brought me till here. My sister and my brother-in-law, who always
believed in me, provided moral support when ever I needed. My nephew Chinnu’s
bright smile and curious voice over the phone kept me going even during long, sleepless
nights.
I would like to thank all the members of Infolab group who provided a friendly,
humorous and stimulating research environment. Also, I would like to thank my friends
Sriram and Swaroop for their all-round support during my entire master’s education.
This acknowledgement is incomplete without mentioning the support of my friends,
well wishers who helped me throughout my work. Thank you all.
April 1, 2003
iv
ABSTRACT
DIOGENES: A DISTRIBUTED IMAGE SEARCH AGENT
Publication No. ______
Ravishankar Puttaiah Mysore, M.S.
The University of Texas at Arlington, 2003
Supervising Professor: Alp Aslandogan
In this work, we have proposed and implemented a new distributed multi-
threaded architecture for reducing the response time of Diogenes, a meta-search agent
for searching people images on the web.
Initially, Diogenes was designed as a single threaded query processing system
which used only a single system. This new architecture uses multiple machines across
the network for efficient crawling and query processing and creates of a centralized
index to serve user queries. The new architecture is highly scalable and employs the
distributed storage concept. We have also implemented a new URL ordering scheme to
get faster results and a new web interface to server multiple users simultaneously.
v
TABLE OF CONTENTS
ACKNOWLEDGMENTS...................................................................................... iv ABSTRACT .......................................................................................................... v LIST OF ILLUSTRATIONS ................................................................................. ix LIST OF TABLES .............................................................................................. x Chapter 1. INTRODUCTION .................................................................................... 1 1.1 Web based image retrieval............................................................... 1 1.2 Diogenes – Web based image retrieval agent ................................... 2 1.3 Research Summary.......................................................................... 4 1.4 Organization of Thesis .................................................................... 5 2. BACKGROUND: SEARCH ENGINE CONCEPTS .................................. 6 2.1 Search Tools................................................................................... 8 2.1.1 General Purpose Search Engines............................................ 8 2.1.2 Specialized Domain Search Engines ....................................... 11 2.1.3 Meta Search Engines.............................................................. 12 2.1.4 Directories ............................................................................. 14 2.2 Multimedia Search Engines ............................................................. 14 2.3 Search Engine Architectures ........................................................... 17 2.3.1 Centralized Search Architecture............................................. 18
vi
2.3.2 Distributed Search Architecture ............................................. 20 3. DIOGENES – INITIAL ARCHITECTURE............................................... 26 3.1 Overview of Diogenes .................................................................... 26 3.1.1 Web Crawler ......................................................................... 29 3.1.2 Text/HTML Analysis Module ................................................ 29 3.1.3 Image Analysis Module.......................................................... 31 3.1.4 Evidence Combination ........................................................... 32 3.1.5 Indexer .................................................................................. 34 3.2 Result Presentation ......................................................................... 34 3.3 Automatic Feedback ....................................................................... 35 3.4 Advantages of Diogenes.................................................................. 35 3.5 Drawbacks of Diogenes .................................................................. 36 4. DIOGENES – PROPOSED DISTRIBUTED ARCHITECTURE ............... 38
4.1 Design Assumptions, Requirements, and Goals ............................... 38
4.2 Proposed Architecture – Distributed Multi-Threaded Query Processing System................................................................ 42 4.3 Components of proposed architecture ............................................. 46 4.3.1 Web Interface (Web Server) Module...................................... 46 4.3.2 URL Server ........................................................................... 48 4.3.3 Borda’s Positional Method of Ranking................................... 50 4.3.4 Worker Agent........................................................................ 52 4.3.5 Index Server .......................................................................... 54
vii
4.3.6 Index Database ...................................................................... 55 4.4 Other implementation issues............................................................ 56 4.4.1 Concurrency control and Thread Safety.................................. 56 4.4.2 Shared Variable Access .......................................................... 57 4.4.3 Parallel & Distributed Computation........................................ 57 4.4.4 Simultaneous User Queries .................................................... 57 5. PERFORMANCE EVALUATION ............................................................ 59
5.1 Relevance Analysis......................................................................... 59 5.2 Performance analysis: Initial Architecture........................................ 64 5.3 Performance Analysis: Multi-Threaded Architecture on a single Agent 65 5.4 Performance Analysis: Distributed Architecture with 4 Agents ........ 66 5.5 Conclusion...................................................................................... 67 5.6 Future Work .................................................................................. 68 REFERENCES ..................................................................................................... 69 BIOGRAPHICAL INFORMATION ..................................................................... 71
viii
LIST OF ILLUSTRATIONS
Figure Page
2.1 Internet HOST numbers since 1989 to 2002 ............................................... 7
2.2 Working of a typical Web Crawler ............................................................... 10
2.3 Meta-search engine architecture.................................................................... 13
2.4 Centralized Search Architecture.................................................................... 19
2.5 Google’s Query Processing Architecture....................................................... 22
2.6 Napster Distributed Architecture ................................................................. 23
3.1 Working Model of Diogenes ....................................................................... 28
4.1 Distributed Multi-Threaded Query Processing Architecture ........................ 44
4.2 Web Interface Module of Distributed Architecture ..................................... 47
4.3 URL Server architecture............................................................................. 49
4.4 Worker Agent Architecture ........................................................................ 53
4.5 Structure of an Index Database................................................................... 56
5.1 Diogenes results for query: HILLARY CLINTON ..................................... 61
5.2 Google results for query: HILLARY CLINTON ........................................ 61
5.3 Diogenes results for query: ABRAHAM LINCOLN................................... 62
5.4 Google results for query: ABRAHAM LINCOLN...................................... 63
ix
LIST OF TABLES
Table Page
4.1 Contents of an Index file............................................................................... 55
5.1 Search Engine Precision comparison............................................................. 60
1
CHAPTER I
INTRODUCTION
1.1 Web based image retrieval
It is estimated that, as of 2003, Web contains more than 4 billion indexable pages
and nearly as many images [1]. The World Wide Web has become a major information
publishing and retrieving mechanism on the internet. In principle, an immense amount
of information is now accessible to us at our fingertips.
However the major challenge is to efficiently locate the desired information in the
massive amount of data available. A lot of research has been focused on this area
known as “Web Information Retrieval” and has led to the development of powerful
search engines and web directories that allow users to locate the relevant data quickly.
Multimedia data, in particular images constitute important pieces of information on the
Web. Image retrieval has been one of the key aspects of web based search. Because of
the unique properties of the WWW, image retrieval has always been a challenging task
for the researchers. A plethora of smaller specialized search engines and directories,
meta-search engines, image search engines and browsing assistants exists, many of
which are built as meta tools using the data supplied by the major search engines.
Web based image retrieval is the process of extracting and indexing relevant
image and multimedia content from the web using efficient image searching and
indexing techniques. One of the key features of the image contents on the web are that
they are described by text in HTML documents as well as the content of the image
2
itself. An effective image retrieval system should make use of both text and image data,
by integrating text-based and content-based image retrieval techniques.
A number of web image search engines have been built in recent years including
both research prototypes and commercial ones. Among the former category are
WebSeer, WebSEEk, ImageScape, Amore (http://www.ccrl.com/amore), WebHunter,
ImageRover and PicToSeek. Commercial web text search engines such as Google,
Lycos, AltaVista and Yahoo also offer image search facilities.
Both WebSEEk and to a large extent WebSeer rely on the words found in the
image paths and alternate texts. The commercial image search engines like Google
image search engine (www.google.com), Lycos (http://multimedia.lycos.com/) and
AltaVista (http://www.altavista.com/image/default) do not perform any image analysis,
but rely solely on the image paths and alternate texts. Both WebSEEk and WebSeer
organize their images into conceptual categories. A user interested in people images is
directed to the people category. WebSEEk apparently uses only textual information for
this conceptual categorization. WebSeer goes one step further in image analysis by
integrating a face detector. Consequently its accuracy is much better than WebSEEk in
people queries.
1.2 Diogenes – A Web based image retrieval agent
Diogenes is an automated web-based image retrieval agent developed by
Aslandogan et al [2]. This web search agent is designed specifically to retrieve facial
images from web using efficient evidence combination algorithms, face detection
software and other unique approaches in web-based image retrieval.
Diogenes relies on both textual and image evidence for indexing the images on
the web. A face detection module examines the images on the web for facial images. A
3
face recognition module identifies the face by using a database of known person
images. A text/HTML analysis module analyzes the HTML content of the web pages
for establishing the relevance of the query with the web page. Diogenes uses Dempster-
Shafer evidence combination algorithm for combining the results of the all these
evidence and classifies the image based on the evidences [3].
When compared to similar search engines, Diogenes has a very high precision for
the same query. This can be attributed to the fact that Diogenes makes use of both the
reference text and full text of the web pages and also the information about image
content. Diogenes uses both the image path, alternate text and the full text of the web
pages [4]. These assist in establishing the relevance more accurately. Even though other
search engines (WebSeer, WebSEEk.) analyze the content of the image, they serve
different purpose of analysis, the end result of Diogenes combined with both textual and
visual analysis, results in better precision. Since it makes use of face recognition
techniques for identifying the person images on web, it eliminates non-person images
thus greatly improving the precision. It does not make use of any big database for
storing all the images and there by avoiding the huge overhead of large databases.
Even though Diogenes yielded more relevant results when compared to similar
image search engines, the total search time was very high (many hours…) which is
beyond the online users’ tolerance for waiting for a response. The initial architecture of
Diogenes had drawbacks related to search time and scalability. When we analyzed the
performance of Diogenes, we found that nearly 70% of the total search time was spent
in waiting for web pages to be downloaded from remote web servers. This can be
attributed to the slow response of the remote web sites, network congestion or the
remote site being down. This overhead also called as Communication Latency is huge
when compared to the total search time. This time increased exponentially with the
4
number of request sent by the crawler to download the web pages. Our main goal was to
reduce this overhead search time. Also initial architecture of Diogenes was single user
system. It can process only one query at any given time and lacks the ability for
simultaneous processing of multiple user queries. This architecture can run only on one
system and is not scalable.
1.3 Research Summary
In this work, we address the problem of response time and scalability issues
associated with Diogenes image search agent. The Internet has opened up distributed
computations to the world. Just about any computer may be invited to participate in a
given task. Also the latest developments in the field of cluster computing, P2P networks
have paved way for achieving super computing powers just by using a large number of
regular desktop machines. The advancements in the areas of Symmetric Multi
Processing (SMP) have enabled efficient use of system resources through parallel
computing.
All these above factors inspired us to propose and implement a “distributed multi-
threaded” query processing architecture for Diogenes to overcome its drawbacks related
to response time and scalability.
Some of the ideas employed in our work are:
• Exploiting the “data parallel” benefits associated with web-search process to
reduce the communication latency. Using multiple threads to simultaneously
download and process the web pages reduces the total response time.
• Multi-threaded query processing efficiently uses the “system resources” to
their fullest potential.
5
• Distributed search process facilitates the use of idle CPU cycles of freely
available computers over the network. This also makes use of available
network bandwidth.
• Distributed storage concept reduces the need for huge storage capacity at a
centralized location.
• A new URL ordering scheme makes possible to obtain the relevant results
earlier, thus reducing the initial response time.
• A new web interface to allow multiple users to simultaneously use Diogenes
to query the web for person images.
1.4 Organization of the Thesis
Chapter 2 of this document gives a brief introduction about search engine
concepts; multimedia search engines in general, distributed search engines etc. We also
discuss the related work in this chapter. We take look at the initial architecture of
Diogenes in chapter 3. The proposed architecture is discussed in chapter 4. Finally in
chapter 5, we discuss the performance of the new architecture when compared to the
earlier. We conclude the document by noting about the scope for future work.
6
CHAPTER 2
BACKGROUND: SEARCH ENGINE CONCEPTS
It is estimated that, as of 2003, Web contains more than 4 billion indexable
pages and nearly half as many of images [1]. Almost 3 million pages or 59 Giga bytes
of text are added daily, and the average life span of a web page is about 44 days [5].
The number of internet hosts has increased exponentially. Figure 2.1 shows the growth
of the internet hosts over the last decade. Thus the World Wide Web has become a
major information publishing and retrieving mechanism on the internet.
In principle an immense amount of information is now accessible to us at our
fingertips. However the major challenge is to efficiently locate the desired information
in the massive amount of data available. A lot of research has been focused on this area
known as Information Retrieval and has led to the development of powerful search
engines and web directories that allow users to locate the relevant data quickly. In
addition, there is a plethora of smaller specialized search engines and directories,
personal search and browsing assistants many of which are built as meta tools using the
data supplied by the major search engines.
Information retrieval is the process of identifying and retrieving relevant
documents based on user’s query. The document representation provides a formal
7
description of information contained in the documents; the query representation a
formal description of users information need; the similarity measure defines the rules
and procedure for matching the user requirement and relevant documents.
Figure 2.1 Internet HOST numbers since 1989 to 2002 (Data from Internet Software Consortium)
In 1990, Alan Emtage, a student at the University of McGill created "Archie,"
the first search tool, which used anonymous FTP servers to archive a repository of
Internet files. MIT student Matthew Gray created the Web Wanderer in June of 1993,
the earliest widely acclaimed Web robot. During the spring of 1995, scientists at Digital
8
Equipment Corporation's Research lab in Palo Alto, CA, devised a way to store every
word of every page on the entire Internet in a fast, searchable index. This lead to the
development of the first searchable, full-text database on the World Wide Web. In 1996
Daniel Dreilinger at Colorado State University introduced SearchSavvy, the first Meta
search engine [6].
As the web continues to grow, major general-purpose search engines have been
faced with serious problems. They are unable to index all the documents on the web,
because of the rapid growth in the amount of data and the number of documents that are
publicly available. Their results may be out-of-date, and they do not index documents
with authentication requirements or the information behind search forms. As more
people share their information with others, the need for better search services to locate
the interesting information is becoming increasingly important. Let us analyze some of
the different search tools available to us.
2.1 Search Tools
Search tools are sites on the internet which enable users to locate desired
information. Success with searches is dependent upon what kind of tool is being used
and the strategies employed for efficient searching. There are different kinds of search
tools available. Some of them are
2.1.1 General Purpose Search Engines
General purpose search engines such as Google, AltaVista, and Yahoo are
basically used for general queries and they do not specialize on any particular kind of
9
web data. These search engines usually index all kinds of web pages and are not limited
to particular domain. They have many advantages. They posses simple query interfaces
(usually phrase based, with Boolean operators), results are presented in structured
format, typically including match’s title, a text summary and possibly an associated
score. Their internal indexes are relatively current and are usually updated between 60-
90 days even though not every page is updated at the same frequency.
These general purpose search engines employ powerful crawlers to crawl the
web. Web crawler is a program which automatically traverses the web by downloading
documents and following links from page to page. They are mainly used by web search
engines to gather data for indexing. Other possible applications include page validation,
structural analysis and visualization; update notification, mirroring and personal web
assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc [7].
These crawlers are usually multi-threaded applications which are capable of
downloading tens of pages simultaneously, given the required bandwidth.
A typical web crawling process is shown in Figure 2.2. A URL server maintains
a queue of URLs to be processed and forwards them to multiple crawler processes.
Each crawler process runs on a different machine, is single-threaded, and uses
asynchronous I/O to fetch data from up to 300 web servers in parallel. The crawlers first
checks whether the crawling robot is allowed on the remote web server. If it is allowed,
it downloads the page and saves it to disk. It then processes the web page by removing
the HTML tags and indexing the words, phrases and the images. It then extracts the
10
URLs present in those pages and updates the URL queue. At this stage, web crawler can
be designed to perform additional processing on these pages.
Figure 2.2: Working of a typical Web Crawler
Some problems exist with general purpose engines even despite their utilities. A
common one is their weakness with the query language (they never allow regular
11
expressions). As a result of this lack of expressiveness, the chances of getting bad hits
or results in the query result greatly increases.
Another problem that exists with the general purpose search engines is that,
relatively slow update frequency of their internal indexes. This is caused due to vast
amount of web pages which need to be crawled.
The extensive coverage and weak query power of these engines means that
numerous matches are returned (often in the 1000's). The engines attempt to score the
matches, and supply summaries of the pages, to alleviate the user's feeling of being
swamped by data. Even so, the user usually has to perform a manual, second search by
scrolling through the returned matches, trying to select links which are really useful. At
best this will require a reading of the summaries, at worst it involves downloading the
full page for a result to examine it in detail; this is time-consuming and waste of
resources.
But still, these kinds of search engines are the most popular and commercially
viable engines answering millions of queries per day. With the newer algorithms and
refinement, the performance of these search engines is increasing.
2.1.2 Specialized Domain Search Engines
These search engines focus on particular domain, a good example being Medical
Search Engines [4] (www.9-11.com, MedHunt.com, MediSearch), Shopping Search
Engines (www.mySimon.com, http://compare.net), Travel search engines
(www.tripadvisor.com) etc. For example TripAdvisor, Inc. [8] provides a
12
comprehensive travel search engine and directory that helps consumers research their
travel plans via the web. TripAdvisor's search technology filters out irrelevant search
results to deliver only the most relevant and popular links associated with a given travel
search term.
Advantages of specialized domain search engines include the reduction of bad
hits since topics outside the domain are not examined. This increases the chances of
fetching relevant results for the user with a reduction in search time. Also, since the
domain is limited, the crawling can be more frequent which facilitates for more current
index.
2.1.3 Meta Search Engines
A meta-search engine sends queries to multiple search engines and other data
sources, then collates the results in some way and formats them for display. The data
sources can be internal indexes, associated text search engines, database search engines,
message archives, Intranet or web wide web search engines, or even file servers. Meta
search engines simply read each search results page and extract the text from the
HTML.
In a meta-search engine, the keywords submitted in its search box are
transmitted simultaneously to several individual search engines and their databases of
web pages. Within a few seconds, the results from all the search engines are obtained
and presented to the user. Meta-search engines do not own a database of Web pages.
Some of the Meta search engines are MetaCrawler [9], HuskySearch, and Profusion etc.
13
They send the search terms to the databases maintained by other search engines. Good
meta-search engines accept complex searches, integrate results well, eliminate
duplicates, and offer additional features such as intelligent ranking or clustering by
subjects within the search results.
Working of a typical meta-search engine is shown in the figure 2.3.
Figure 2.3 Meta-search engine architecture
The main disadvantage of these Meta search engines is that, they are mainly
dependent on the general purpose search engines and the accuracy of retrieval
depends on the efficiency of the general purpose engines. Some of the other
associated disadvantages are
• Translation of query syntax and fields not exact
Query
UserInterface
QueryDispatcher
Query
SearchEngine 1
SearchEngine 2
SearchEngine 3
Result Mergingand Ranking
Query
Results
Results
Query
14
• May need to decode query forms
• Some sources don't sort by relevance
• No way to compare relevance scores
• Vulnerable to search spam
• Variable quality of search engines
• Must keep up with changes in source engines
2.1.4 Directories
Directories are topic-oriented catalogs of Internet sites. Directories are
categorized based on subjects and are usually built by human selection in contrast to
regular search engines where in the pages are indexed by robots or spiders. Some of the
examples are Yahoo! Directory, Google Directory index etc.
Directories are organized into subject categories and web pages are classified by
subjects. These subjects are never standardized and vary according to the scope of each
directory. Directories are often carefully evaluated and annotated.
Some of the disadvantages are that, the users can only search what they see
(titles, descriptions, subject categories, etc.) -- use broad or general terms. These
directories are usually small and specialized and may not be correctly classified.
2.2 Multimedia Search Engines
The explosive growth of the World Wide Web has made a vast amount of
information available to us especially the multimedia data such as images and graphics.
15
New visual information in the form of images, graphics, animations and videos is being
published on the Web at an incredible rate. Multimedia information is published both as
embedded in Web documents and as stand-alone objects. The visual information takes
the form of images, graphics, bitmaps, animations, videos etc...[10]
Multimedia data, in particular images constitute important pieces of information
on Web. Image retrieval has been one of the key aspects of Web based search. Because
of the unique properties of WWW, image retrieval has always been a challenging task
for the researchers. The task of searching these images is difficult because the images
tend to be poorly indexed. The collections are too large to be manually indexed with
captions according to content. Even if they were indexed in this way, it is very unlikely
that the author of the caption would be able to make the captions detailed enough to
anticipate all of the potential queries posed to the users.
Today the advancement in the database technology has paved way for database
support of multimedia data types along with the traditional ones. There are a lot of
content-based image retrieval systems, which make use of these capabilities in
databases to query the images by content, regions and spatial layout” [11]. IBM’s
QBIC, Virage and Excalibur are some of the most widely used CBIR systems. These
CBIRs analyze the image and based on the relevance rule, assign some weight age to
the image. The new search techniques for textual information have aided the
developments in the web based image retrieval. A wide array of techniques has been
made use in these image retrieval systems. They range from systems asking the user to
16
draw a picture of the desired image to systems just asking a simple query input to give
out the desired images.
Image contents on the WWW are described by text in the HTML documents as
well as contained in image data itself. Thus an effective image retrieval system should
make use of both text and image data, by integrating text-based and content-based
image retrieval techniques.
A number of multimedia search engines are available on the web. Some of them
are AltaVista PhotoFinder (http://www.altavista.com/image/default), Google Image
search engine (http://www.images.google.com), WebSeer, Lycos
(http://multimedia.lycos.com/), AlltheWeb (http://multimedia.alltheweb.com/) etc.
AltaVista, Lycos & AlltheWeb are able to search for audio, video files also. All the
others offer image search capabilities only.
AltaVista [12] Photofinder is one such search engine. It measures the similarity
of images based on visual characteristics such as dominant colors, shapes and textures.
Images are tagged with textual information obtained by the text-analysis of the
corresponding web page. The user cannot set the relative weights of these features, but
judging from the results, it seems that the color is the predominant feature [8].
WebSeer, which was developed at University of Chicago, makes use of a similar
principle. The images collected from the Web are submitted to a number of color tests
in order to separate photographs from drawings. Some simple tests measure the number
of different colors in the image. . Keywords are extracted from the image file name,
17
captions, hyperlinks, alternate text and HTML titles. Depending on image and textual
evidences, the images are indexed [13].
Users searching the World Wide Web have a number of options presently
available to them. Google, Excite and Yahoo! are but a few examples of useful search
engines. All of these systems have been designed primarily to find text-based
information on the Web. This is the major drawback with these types of image search
engines. They rarely analyze the content of the image except with a few exemptions like
WebSeek, WebSeer which analyze the image content. Because of the poor indexing
done based on the textual information associated with the images, search engine users
rarely get relevant images.
2.3 Search Engine Architectures
As the importance of web, as a prime source of information is felt by many
people, internet users have been flocking to search engine web sites to retrieve all sorts
of information from the web. Now days, popular search engines have to answer millions
of queries per day. They employ thousands of computers and huge databases to crawl
the entire web and build an index and efficiently retrieve results for search queries.
Scalability is a major issue. This has led to the development of improved architectures
to serve these above needs.
Search engine architectures vary widely. Due to an extensive research being
done over the past few years, search engines make use of thousands of computers and
large databases for efficient information retrieval. Earlier search engines like Archie,
18
Gopher used less computational resources since the web was small in size at that time.
But as the size of the web increased, the number of web pages found on the web
increased and so as the number of web users. Today, search engines need vast
computational power, huge storage space and high network bandwidth. All these
requirements have led to the development of a new breed of search engines which are
based on distributed architecture comprising of thousands of computers.
Search engine architectures can be generally categorized as follows:
I. Centralized search architecture
II. Distributed search architecture
In simple terms, if the processing of the data is going to happen in numerous places,
then that architecture is said to be distributed architecture; if the vast majority of
processing happens in one place, it is said to be a centralized architecture. These
architectures have their unique strengths and weaknesses. Distributed architectures
arose partly because centralized mainframe models were deemed as inflexible and
expensive. On the other hand, it's harder to manage distributed architectures, fine-tune
their performance and pinpoint trouble spots when something goes wrong. Let’s have a
brief look at these architectures.
2.3.1 Centralized search Architecture
As the name suggests, in centralized search architecture, all the essential
processing of the data is done at a central machine. The crawlers download the web
19
pages and process these pages and store and index at a single place. Some of the
characteristics of the centralized search architecture are
• All computations are done on a local site (crawling, storing, indexing…), not
distributed across different machines over the network
• Robot causes heavy server load and heavy traffic
• Data located centrally
• Low Percentage of document coverage.
A general architecture of Centralized search architecture is as shown in figure.
Figure 2.4: Centralized Search Architecture
Index
Download WebPages
Inde
x Web
Pag
es
Query Retrieval andRanking
Crawler
Indexer
Query Engine
Online
OffLine
EndUsers
WWW
20
Some of the disadvantages are:
• Large indexes require big disks, memory, network connections
• Not fault tolerant
• Scaling up gets harder and harder
2.3.2 Distributed Search Architecture
A Distributed system is “a collection of independent computers that appear to its
user as a single coherent system”. Distributed search architecture makes use of concept
of distributed processing of data and objects across a network of connected systems.
Distributed processing allows us to harness idle CPU cycles and storage space of tens,
hundreds, or thousands of networked systems to work together on a particularly
processing-intensive problem. Increasing desktop CPU power and communication
bandwidth has also helped to make distributed computing a more practical idea for
distributed search process.
Distributed search architectures can be categorized as
• Loosely coupled (P2P, Internet based…)
• Tightly coupled (Cluster computing, Grid…)
Loosely coupled systems communicate with each other over the internet and
usually have a decentralized architecture with one or no controllers. Some examples are
Napster, Gnutella network etc. Tightly coupled systems are connected to each other
with high speed intranet and are centrally managed. Most of the grid computing projects
like Globus, Atlas etc. are tightly coupled distributed architectures.
21
Large crawl-based search engines are typically based on scalable clusters,
consisting of a large number of low-cost servers located at one or few locations and
connected by high-speed LANs. Currently commercial information retrieval systems,
such as web search engines Google [16], AltaVista handle tremendous loads by
exploiting the parallelism implicit and use SMPs to support their services. Although it is
clear that more CPUs and disks you have, more the system can handle, the important
question is how much of hardware and software resources are needed to exploit these
resources.
Google [17] has one of largest distributed search architecture. It houses over
10000 computers connected by a high speed intranet. Its fast search speed can be
attributed to thousands of low cost PCs networked together to create a super fast search
engine. Google employed thousands of linked PCs - one of the world's largest Linux
clusters - to quickly find each query's answer.
In Google, the web crawling (downloading of web pages) is done by several
distributed crawlers. There is a URL server that sends lists of URLs to be fetched to the
crawlers. The web pages that are fetched are then sent to the storeserver. The
storeserver then compresses and stores the web pages into a repository. A multi
threaded indexer analyzes these pages and indexes them based on the PageRank
Algorithm.
22
Figure 2.5 Google’s Query Processing Architecture
The above illustrated distributed architecture has been highly successful for
Google. Even though the entire crawling and indexing process is done in a distributed
manner, there will always be a central coordinator controlling the different operations.
This results in a semi-distributed architecture.
Napster was a popular mp3 search engine. Napster was based on a “centralized
P2P server-client” model. In this model a central server is used to manage traffic
between registered users. These central servers maintain a directory of the shared files
stored on the machines of various users. This directory is updated frequently, mostly
23
every time a user logs on or logs off a network. When a request is obtained, the central
server gets the request and creates a list of all files matching the request and it also cross
checks this with its own database of files. This verified list is then displayed to the user.
The user then just has to directly select the desired file and open a direct HTTP link
with the PC having that file. The download of the data takes place directly, between the
network users. It should be noted that the data is never stored on the central server or on
any other intermediate device.
An over view of Napster architecture is as shown in the figure.
Centralized Indexserver
User Computer
Other Peers
Figure 2.6 Napster Distributed Architecture (Source www.searchtools.com)
24
Advantages:
• The presence of a central server which enable a fast and efficient location of the
desired data.
• Due to constant updating of the index by the central server, latest files are made
immediately available to users for downloading.
• As all the individual users must be registered on the server’s network, the search
request is more comprehensive as all the machines in the network are included
efficiently in the search.
Disadvantages:
• Because of the presence of a central server, the system has a single point of entry.
• The entire network is dependent on the central servers and can collapse if any one
of these dies.
Distributed search process offers a lot of advantages. The most obvious is the
ability to provide access to supercomputer level processing power or better for a
fraction of the cost of a typical supercomputer. The CPU intensive tasks can be
distributed over a large number of processors. Employing large number of individual
computers reduces the network bandwidth needed by a single machine, enabling faster
downloads. Scalability is also a great advantage of distributed computing.
Though they provide massive processing power, super computers are typically not
very scalable once they're installed. A distributed computing installation is virtually
25
infinitely scalable. Simply adding more systems to the environment will increase the
computational capacity of the existing setup. A byproduct of distributed computing is
more efficient use of existing system resources. Also the fault tolerance of the whole
setup is increased. A distributed crawler will continue to work even if some of the
clients die. Other agents can carry on with the assigned work.
In the next chapter we will take a look at the concept & working of Diogenes and
its existing architecture.
26
CHAPTER 3
DIOGENES – INITIAL ARCHITECTURE
In this chapter, we discuss the initial architecture and working of Diogenes. We
begin with an overview about Diogenes. Next, we will analyze the different components
of Diogenes. In section 3.1.4, we take a look at evidence combination and the
Dempster-Shafer approach. We will discuss the advantages and drawbacks of Diogenes
in the subsequent sections.
3.1 OVERVIEW OF DIOGENES
Diogenes is an automated web-based image retrieval agent developed by Dr.Alp
Aslandogan [1] as a part of his doctoral research. This web search agent is designed
specifically to retrieve facial images from web using efficient evidence combination
algorithms, face detection software and other unique approaches in web-based image
retrieval.
Let’s look at the control flow of Diogenes in Figure 3.1. It consists of a Web
Crawler, HTML analyzer, face detection module, evidence combination module,
indexer and a face database to the hold the images. Diogenes retrieves web pages and
associates a person name with each facial image on those pages. The search process is
27
initiated when a user enters the name of a person as a query. User issues the query in
through HTML page and this is taken by the cgi-script for further processing. The query
is sent to different text search engines and related URLs are retrieved. Web pages are
retrieved by a web crawler module. The crawler module issues this query to the text
search engines and obtains the list of URLs from text search engines like Yahoo!,
Lycos, AltaVista etc,. To find images of this person Diogenes relies on two types of
evidence, Visual and textual. A face detection module examines the images on the page
for human faces. A face recognition module identifies the face by using a database of
known person images. A text/HTML analysis module analyzes the body of the text with
the aim of finding clues about the object (person) in each image. The outputs of the face
detection and text/HTML analysis are merged using Dempster-Shafer evidence
combination mechanism to classify each image as relevant or not. These relevant
images are presented back to user.
28
Input query(runsearch)
URL filevisit the
URL
9xxxx Dir,html_file
HTMLTag
Stripper
POSTagger
LEXICON
BIGRAMS
RULE FILE
Face Detector(giftest)
Face ??
Yes
Cropper
HTMLAnalyser
Converter Image Magick
GIF Image
WaveletFace
Recogniser
wfr.cpp
Face Database
tags.out..
EvidenceCombination
Module
fr.out, tf.out,imgid.txt..
AdjacencyAnalyzer
adj.out
adj.pldempster.out
Indexer
.idx file
HTMLcomposer
Auto Feedback Process
WebGoogle,Lycos,etc..
HTML File
Figure 3.1 Working Model of Diogenes – People Search Engine
29
Some of the components of Diogenes are explained in detail below.
3.1.1 Web Crawler
Web crawler formulates and sends the input query to different text search
engines like Yahoo!, AltaVista, HotBot, Lycos etc in a format as needed by the
respective search engines. It collects the search results (URL list) for the input query
from these search engines. HTTP requests to download the web pages in the URL list,
is sent to the remote web sites and the crawler gets back the web pages. It retrieves the
text of the web pages and then extracts each of the images referenced on those pages. It
then saves all this information under a unique directory whose name is generated from a
unique time-stamp.
3.1.2 Text/HTML Analysis Module
The text/HTML analysis module of Diogenes determines a degree of association
between each personal name on a web page and each facial image on that page. This
degree of association is based on two factors: Page level statistics and local (or
structural) statistics [2]. Page level statistics such as frequency of occurrence and
location within the page (title, keyword, body text etc...) are independent of any
particular image. Local statistics are those factors that relate a name to an image.
Diogenes takes advantage of the HTML structure of a web page in determining the
degree of association between a personal name and an image.
30
Some of the keywords used for this purpose are
• Frequency: The significance of a name is proportional to its number of
occurrences within the page and inversely proportional to its number of
occurrences on the whole web. It captures the premise that if a rare word appears
frequently on a page then it is very significant for that page. If a common word on
the other hand appears frequently on a page, it may not be as significant.
• Name or URL Match: A name that is a substring of the image name or the image
URL is assigned a higher significance.
• Shared HTML Tags: Names that are enclosed in the same HTML tags with an
image are more likely to be associated with that image. For instance, a caption for
an image is usually put in the same HTML table on the same column of adjacent
rows.
• Alternate Text: The alternate text identified by the “ALT” HTML tag generally
serves as a suitable textual replacement for an image or a description of it.
When a page is retrieved, a part of speech tagger (Brills tagger [14]) tags all the
words that are part of a proper name on the page. The occurrence frequencies of these
words are recorded. For each such word and for each image on the page, a degree of
association is established. The frequency of the word serves as the starting point for this
score.
31
Then the HTML analysis module analyzes the HTML structure of the page. If an
image and a word share some common tags, their degree of association is increased. If
the word is a substring of the image name or if the word is part of the alternate text for
the image, the association is increased further. Since the text/HTML analysis module
assigns degrees of association to individual words, at the time of evidence combination
the scores of the two words (the first name and the last name) that make up a personal
name are averaged to get a single text/HTML score.
3.1.3 Image Analysis Module
The visual evidences used by the classifier of Diogenes consist of the outputs of
face detection and face recognition modules. The neural network based face detection
module examines an image to find a human face [15]. This face detector converts the
image into gray scale and checks for the presence of eyes in the image. If the eyes are
found, then the distance between the eyes is a measure of the accuracy of person image
being present in that picture. The face location is indicated to an intermediate module
which crops (cuts out) the facial portion and submits it to face recognition modules.
Diogenes uses a face recognition module, which implements the eigen face method.
This module uses a set of known facial images for training. Each of these training
images has an associated personal name with it. At recognition time a set of distance
values between the input image and those of the training set are reported. These
distances indicate how dissimilar the input image is to the training images. In addition a
global distance value called “Distance from Face Space” or DFFS is also reported. This
32
is the global distance of the input image from the facial image space spanned by the
training images. Diogenes uses this latter value to determine the uncertainty of the
recognition.
3.1.4 Evidence Combination
Evidence combination is a powerful tool used for managing uncertainty in fields
such as robot vision, remote surveillance and automated equipment monitoring and
medical diagnosis [3]. In information retrieval, evidence combination has been used
successfully in integrating the evidence of different query representations or different
retrieval and ranking strategies. Bayesian statistical models and Fuzzy sets are among
the means used by researchers to integrate different pieces of evidence. The ultimate
goal of evidence combination is to improve the accuracy of a classier. Depending on the
context, the objects to be classified may be documents, images, landscape or physical
objects.
In Diogenes, Dempster-Shafer evidence combination mechanism is used for
Combining the evidences obtained by image and textual analysis. It also provides a
method for combining independent bodies of evidence using Dempster's rule. It
combines different evidences and gives the relevance of a person image present in that
web page to the input user query. The output of the face recognition module, which
classifies the image, and the output of a text analysis module constitute as the evidence
for this algorithm. These evidences are considered independent since the output of these
modules does not affect one another. The text analysis module assigns a degree of
33
association of the input query with the contents of the web page and the face
recognition module gives a distance value according to the similarity of the image to the
images in the face database. This score is used in the simplified Dempster-Shafer
evidence combination algorithm to get degree of relevance.
We have the output of a face recognition module (FR) which classifies the
image and the output of a text/HTML analysis module (TA) which analyzes the text that
accompanies the image. Both modules attempt to identify the person in the image based
on different media. We designate the two pieces of evidence as FRm and TAm
respectively. The result of face recognition module does not affect the text/HTML score
and vice versa.
Using the simplified formula [3] of Dempster-Shafer evidence combination
algorithm, the image ranking in our case can be represented as
)()()()()()()( θθ TAFRTAFRTAFR mmmmmm PcPcPcPcPcrank ++∝
Here ∝ represents `is proportional to" relationship, )(θFRm and )(θTAm represent
uncertainties in the bodies of evidence FRm and TAm respectively. These are
obtained as follows: For face recognition, we have a “distance from face space" (DFFS)
value for each recognition. This value is the distance of the query image to the space of
eigen-faces formed from the training images. Diogenes uses the DFFS value to estimate
the uncertainty associated with face recognition. If the DFFS value is small, the
recognition is good (uncertainty is low) and vice versa. The following is Diogenes'
formula for the uncertainty in face recognition:
34
))(
1(1)(DFFSeln
mTA
+−=θ
For text analysis, uncertainty is inversely proportional to the maximum value among the
set of degree of association values assigned to name-image combinations.
)1
)(( MDAenl
mTA
+=θ
where MDA is the maximum numeric “degree of association" value assigned to a
personal name with respect to a facial image among other names.
3.1.5 Indexer
This module reads in the degrees of association assigned by the evidence
combination module for each image and person name pair and produces an index file
for each person.
3.2 Result Presentation
The images that are found to be relevant are cropped into thumbnails and an
HTML page is generated with these thumbnails in them. The order of these images
depends on the total score of the images and they are displayed according to the
decreasing order of scores.
35
3.3 Automatic Feedback
One important feature of Diogenes is the automatic feedback of images which
are determined to be relevant by the user. After the searching process is finished, the
user can specify the images that are found to be relevant to him/her in the resulting
HTML page. These images are stored in a face database that is made use for future
queries. In future queries, these images are compared with the downloaded images for
“Similarity”. If the downloaded images are similar to the images in the face database of
that particular person, then they are given higher weight. This process yields better
results.
3.4 Advantages of Diogenes
When compared to similar search engines, Diogenes has a very high precision
for the same query. This can be attributed to the fact that Diogenes makes use of both
the reference text and full text of the web pages. Diogenes uses both the image path,
alternate text and the full text of the web pages. These assist in establishing the
relevance more accurately. Even though some other search engines (WebSeer,
WebSEEk.), analyze the content of the image, they serve different purpose of analysis,
the end result of Diogenes combined with both textual and visual analysis, results in
better precision. Since it makes use of face recognition techniques for identifying the
person images on web, it eliminates non-person images thus greatly improving the
precision. It does not make use of any big database for storing all the images and there
by avoiding the huge overhead of large databases.
36
Since Diogenes is a meta-search agent, it has all the advantages associated with
a meta-search engine. An important feature of Diogenes is the incorporation of face
recognition and the use of Dempster-Shafer evidence combination method with object
recognition and automatic, local uncertainty assessment. When a text/HTML analysis
module or a visual analysis module assigns a degree of similarity between an image and
a person name, there are degrees of uncertainty associated with both these values. In
Diogenes case, both of these modules produce numeric values indicating their degrees
of uncertainty. These values are obtained automatically without user interaction and
locally separately for each retrieval/classification.
3.5 Drawbacks of Diogenes
Diogenes is mainly used for person image search on the web. Other type of
image retrieval is not accurate. Any other type of image search will not yield relevant
images. We need to employ other “object detection” methods to achieve the same. On-
the-fly search of Diogenes delays the searching process. It takes hours to search the web
for obtaining the relevant person images. This is because Diogenes is a single threaded
system and crawling is done upon user query. Also lack of use of any internal database
that would store all the images on the web also causes this delay.
Diogenes is single user system. It can process only one query at any given time
and ability for simultaneous processing lacks. This architecture can run only on one
system and is not scalable.
37
Some technical details about the initial architecture of Diogenes:
• It is currently written in perl
• It was originally built on Solaris platform
• It makes use of a Face Detection Program written by Henry .A. Rowley of
Carnegie- Mellon University [15]
• It makes use of a Wavelet based Face recognition program
• Requires Image Magick software required for image processing
• Makes use of a “Rule Based Tagger” for text processing. This tagger is written
by Eric Brill of MIT [14]
38
CHAPTER 4
Diogenes – Proposed Distributed Architecture In this chapter, we propose a new multi-threaded distributed architecture for
Diogenes. Our aim is to drastically reduce the search time by making use of freely
available computers across the network. In section 4.1, we discuss the design
requirements for the new architecture. Section 4.2 explains the proposed distributed
architecture. We explain the individual components on the new architecture in section
4.3. In section 4.4, we discuss some of the other implementation issues we tackled while
implementing the above proposed architecture.
4.1 Design Assumptions, Requirements, and Goals
In this section we give a brief presentation of the most important design choices
which have guided the implementation of distributed Diogenes. More precisely, we
sketch general design goals and requirements, as well as assumptions made during our
implementations.
As seen in chapter 3, the initial architecture of Diogenes had drawbacks related
to search time and scalability. When we analyzed the performance of Diogenes, we
found that nearly 70% of the total search time was spent in waiting for web pages to be
downloaded from remote web servers. This can be attributed to the slow response of the
39
remote web sites, network congestion or the remote site being down. This overhead also
called as Communication Latency is huge when compared to the total search time. This
time increased exponentially with the number of request sent by the crawler to
download the web pages. Our main goal was to reduce this overhead search time.
During the time between the request and response time for downloading web
pages, the CPU will be idle and this results in waste of CPU cycles. By making
Diogenes, a multi-threaded system, we can drastically reduce the download time. The
principal operating objective for a multi-threaded processor is to have sufficient number
of parallel tasks multiplexed onto hardware so as to eliminate or minimize idle time in
the presence of long-latency operations [18]. It could be accounted to the fact that,
while one thread is waiting to download web pages from a site, the other thread could
process the already downloaded pages & also multi threading is the natural way of
implementing non-blocking communication operations. Also, threads provide greater
concurrency within a single process and allow efficient use of system resources (e.g.:
CPU, Memory, etc.). By efficiently coordinating between these threads, we can greatly
improve Diogenes performance.
Information retrieval is a very computation intensive task. More over, image
retrieval needs even more extensive computational power because of image analysis
process. Even if we are able to reduce to communication latency drastically, the final
response time, which is the total computational time, is still higher compared to similar
web image search agents. The image processing methods which include face detection
face recognition, cropping, conversion to gray scale etc. need a lot of processing power.
40
If we want to achieve a response time which is in the range of several minutes instead
of several hours, we needed to opt for a distributed architecture which would allow us to
distribute this huge processing requirement among different machines to achieve
computational speed up. We can distribute the total computational power required
among different in-expensive PCs to achieve faster response time from Diogenes [19].
In order to achieve this, we need to design a Distributed query processing system
encompassing several machines across the network. Higher the number of machines
involved in computation, faster will be the response time.
For Diogenes to act as an image search engine, it has to maintain its own image
index and store the images locally so that, next time if another user issues the same
query, it should be able to retrieve these images and present to the user. In order to
achieve this, we need to design indexing and storage architecture for Diogenes.
Distributed storage architecture is preferred since the total disk space required is split
across many machines over the network. This will greatly reduce the storage
requirements. For achieving a distributed storage goal, we assume that every worker
agent machine is running a web server. By this, it is possible to avoid the image transfer
back to the central index server.
In the initial architecture, the results obtained by different text search engines
are simply merged and no “re-raking” is performed. This re-ranking is necessary since,
different text search engines have varying precision and simply appending the list one
after the other may put more relevant results at the bottom of URL list. This will result
in more relevant URLs being processed at a later time, thus causing the online user to
41
wait for a longer time. We need to implement a URL merging and ranking scheme
which will take in to account, the precision of different text search engines and also the
relative order the URLs within the result obtained by text search engines.
A web interface which can serve multiple users simultaneously will vastly
improve the usability of Diogenes. The system should be able to handle the requests of
multiple users in a concurrent fashion and still able to maintain a decent response time.
The index server just has to know the worker agent name (computer name) and
the corresponding image path on that computer and the relevance. When all these
information is embedded within HTML code, the user’s browser will automatically pull
these images from the worker agents’ web server. Also, for all practical purposes, we
assume that all the worker agent machines are of same configuration (CPU power,
memory, hard disk etc…). This will greatly simplify the process of load balancing at
this stage.
In the next section, we propose a new Distributed Multi-threaded Query
Processing architecture for Diogenes which will satisfy most of the above mentioned
design criteria.
42
4.2 Proposed Architecture – Distributed Multi-Threaded Query
Processing System
We propose a novel architecture for Diogenes to overcome some of its
drawbacks. The “Distributed memory architecture” proposed here effectively makes
use of different computers over the network. In this architecture, each processor has its
own local memory and it has to do message passing to exchange data between the
processors.
Use of parallel computing techniques allows each processor to work on its
section of the problem. Processors exchange the data within the local memory with the
other processors. Thus parallel computing effectively allows problems to be solved
which cannot be solved using a single processor or within a reasonable amount of time.
There are two types of parallel programming paradigms available. They are
1. Data parallel
- Each processor performs the same task on different data
- Example - grid problems
2. Task parallel
- Each processor performs a different task on same data
- Example - signal processing
Our search process presents a classic opportunity for data parallel
programming. Here the data (URL list) can be processed concurrently, thus eliminating
communication latency. A part of URL list can be divided among different CPUs who
43
in turn, process the URLs and return the results. Each processor performs the same
analysis operation on different URL.
Thus we employ the above explained data parallel, distributed memory concept
in our proposed architecture for Diogenes image search agent. This design is based on a
centralized P2P architecture (like Napster…) which has a centralized server to store the
index. As shown in figure 4.1, the basic components of the system are a web server,
URL server, set of worker agents and an index server. All these components are
connected by network and in this work; we use a local area network, even though this
architecture logically works for a wide area network too.
44
Query
CachedImages
IndexPresent
?
Yes
Result
NO
Web Interface
URL Server
WorkerAgent 2(PC2)
WorkerAgent 3(PC3)
Index Server
Results
IndexQuery
Results
ResultsResults
UR
Ls
Create Index
CachedImages
URLs
User
WorkerAgent 1(PC2)
Figure 4.1 Distributed Multi-Threaded Query Processing Architecture
45
The proposed system works as follows.
• The main server, which also hosts the web server, is the controlling module in
this architecture. This server receives the input query from user through a web
interface made possible by CGI scripts.
• This query is forwarded to a URL server which, sends this query to text search
engines like Yahoo!, Lycos and AltaVista etc and creates an URL file using
URLs obtained through these search engines.
• These URLs are re-ranked based on their relative ranking within a search result
and also based on multiple occurrences in results of different search engines.
• The URL server divides and sends these URL list to different Worker agents
residing on different machines (PC1, PC2, PC3…), to be processed by these
clients.
• The worker agents contain a copy of the Diogenes program. They create
multiple threads which access this URL list and run the Diogenes module for
each URL by sending HTTP request to respective URL and downloading the
web pages for further processing.
• The thread processes the downloaded web page by analyzing their textual and
image content and applying Dempster-Shafer evidence combination algorithm to
establish the relevance.
• After finishing the analysis, these threads call the “index server” and send the
relevance and other details to these servers. But the images are kept locally on
the worker agent machine.
46
• The “index server” receives the index data and appends to an index database
which is indexed based on person name.
• The web server will keep on polling the index database for any information
about the query. If any entry in the index database is found about the query, it
creates an HTML page comprising the paths of relevant images on different
worker agents.
• We assume that all the worker agents are running a web server. So all the
relevant images are pulled by the user’s browser from these web servers.
4.3 Components of proposed architecture
In this section we will see in detail, the functionality of different components of
the new architecture.
4.3.1 Web Interface (Web Server) Module
Web interface is essentially a web server which provides a graphical user
interface for accepting user query. It also initiates a web search process for the required
person’s image. The working of the web interface is as shown in the figure.
When a query is entered through the web interface, the CGI script takes this
query input (person name) and checks for presence of this query in the index database.
If the index database contains an entry (file) for this person name, it opens this file and
reads the contents of it and composes a page. The actual images are stored in a
distributed fashion across different computers and their path is included in the HTML
47
code. The user’s browser actually pulls these images from different computers since we
are running web server on all these client machines.
If the index database does not contain this query, this web interface actually
forks off a process which calls the URL server with “query” as the argument. After
initiating the search process, this module checks on the status of this process and waits
for a certain period of time and again checks the index database for the query. If any
entry is found in the database, it will compose a page and return to the user. This page
keeps on refreshing until the search process is finished.
Query
Web Server
IndexPresent
?
QueryIndex
Database
No
New SearchProcess
Query
URL Server
YesComposePage
Output
Figure 4.2 Web Interface Module of Distributed Architecture
48
The CGI script reads the index file and orders the images based on their
relevance which is stored in the index file. Any URL which is repeated in the index file
is eliminated and only one image from a web site is selected for display purpose.
4.3.2 URL Server
URL server is one of the main modules of distributed Diogenes. Figure 4.3
illustrates the detailed work flow in the URL server module. This module carries out
different tasks like querying multiple text search engines, URL merging and ranking
and URL distribution among different worker agents.
URL server gets the “person name” as the input. It sends out this query to
multiple text search engines (Yahoo, AltaVista, HotBot…) and retrieves the result. This
is done in a parallel fashion to avoid communication latency. Multiple threads are
created and these threads issue request to other text search engines and build a URL list.
49
QueryDispatcher
Yahoo
Query
AltaVistaHotBot
Merging andRanking
URLDistributor
Multiple Threads
URLList
URLList
URLList
URLList
Agent List<Descartes.uta.edu
csl.uta.edu...>
WorkerAgent 1
WorkerAgent 3
WorkerAgent 4
WorkerAgent 2
Figure 4.3 URL Server architecture
The duplicate URLs in this list are removed and these URLs are merged and
made into a single list by applying Borda’s Positional Method [20]. This re-ranked URL
50
list is distributed across different worker agents across the network. The URL
distributor function reads a file called “AgentList” which contains the “computer
names” hosting a worker agent and distributes the URLs to those machines.
Here in this URL distributor function, we follow “Interleaved Data Distribution”
approach to distribute the URLs among the worker agents. This interleaved data
distribution makes sure that the URLs occurring at the top of the list are processed
earlier than the URLs occurring at the bottom of the re-ranked list. This kind of data
distribution preserves the order of re-ranked URLs while processing. Interleaved data
distribution is facilitated by creating multiple threads which divide the data and send
partial URL lists to worker agents.
4.3.3 Borda’s Positional Method of Ranking
Rank Aggregation (RA) is the problem of collating a given set of rankings. Rank
aggregation may be used to declare the overall team positions based on the rankings
given by various judges. In our situation, we make use of rank aggregation methods to
reorder the URLs obtained from different search engines based on the rankings of
URLs, returned by different search engines. Some of the others applications on the web
which make of rank aggregation methods are spam fighting, word association, search
engines comparison etc.
We employ Borda’s positional method of ranking for URL rank aggregation in
Diogenes. Borda’s method is a positional method, in that it assigns a score
corresponding to the position in which a candidate appears within each voter’s ranked
51
list of preferences, and the candidates are sorted by their total scores [21]. A primary
advantage of positional methods is that they are computationally very easy; they can be
implemented in linear time.
Borda’s method is an example of a positional voting method, which assigns Pj
points to a voter’s jth-ranked candidate, j = 1, . . . , N, and then determines the ranking
of the candidates by evaluating the total number of points assigned to each of them.
Given k lists l1, l2… lk, for each candidate Cj in list li, we assign a score
Si(Cj) = |Cp: li(cp) > li(Cj)|. The candidates are then sorted in a decreasing order of the
total Borda score ∑=
=k
i
jij cScS1
)()(
Example:
Let,
l1 - URL list obtained by search engine 1
l2 - URL list obtained by search engine 2
a,b,c,d,e - represent URLs returned by text search engines
Given lists l1 = [c,d,b,a,e] and l2 = [b,d,e,c,a],
S1(a)=|e|=1, as l1(e)=5 > l1(a)=4.
Similarly,
S1(b)=|a,e|=2, as l1(e)=5 > l1(b)=3 and l1(a)=4 > l1(b)=3.
Proceeding this way, we get
S(a) = S1(a)+S2(a) = 1+0 = 1,
S(b) = S1(b)+S2(b) = 2+4 = 6,
52
S(c) = S1(c)+S2(c) = 4+1 = 5,
S(d) = S1(d)+S2(d) = 3+3 = 6,
S(e) = S1(e)+S2(e) = 0+2 = 2.
Now, sorting the elements based on their total scores, we get the combined ranking as
b = d > c > e > a. The '=' symbol indicates a tie.
The URLs are ordered based on the “decreasing order” of these Borda’s score. During
this process, duplicate URLs are eliminated.
4.3.4 Worker Agent
Worker agent is the main multi-threaded query processing module which
actually downloads and analyzes the web pages. Worker agent is a multithreaded java
application which will accept and process the URLs sent from the URL server. Worker
agents reside on multiple computers across the network. This agent combines both
query processing and crawling which makes good use of memory and CPU power.
Since crawling needs high CPU, but low memory requirement, combining both will
increase the efficient use of system resources. The components of a worker agent are as
shown in Figure 4.4.
53
Multi-threadedQuery
Processor
URL 1
URL 2
URL 3
URL 4
URL 5
URL 6
Thread 1
DiogenesAgent
Get URL
Output
DiogenesAgent
Get URL
Output
Thread 2 Thread 3
DiogenesAgent
Get URL
Output
Index info
Index info Inde
x inf
o
Multiple Threads
Update Index
Index Server
URL List
Figure 4.4: Worker Agent Architecture
Worker agent creates multiple threads which pick up a URL from a global list of
URLs. Each thread runs a Diogenes module with the URL as the argument. The output
54
(relevance, image path etc…) of this Diogenes module is collected and this information
is passed to “index server” for updating the index database.
Worker agent implements a “Dynamic Data Distribution” model for assigning
URLs to threads. It implements a global array of URLs. After finishing with a URL,
each thread will pick up next available URL from the global URL list. So, any thread
can process any number of URLs and this result in very fast query processing.
4.3.5 Index Server
The main task of the index server is to accept the results reported by different
worker agents and to update an index database. The worker agent threads call the index
server and pass the results. The index server handles these calls in a synchronized
manner handling all the calls, so that none of the results are lost.
The component diagram of the index server is as shown in the figure 4.5. Index
server accepts the following information from the worker agents: query name, agent
name (m/c), path to the image, URL, Relevance
Index server opens a file in the name of query name (person name) and appends
the above data into that file. Index database is a collection of files named after the
query. Each file contains the following information appended one after the other.
55
Table 4.1: Contents of an Index file
Query Name Person Name
E.g.: Bill Clinton
Worker Agent
Name
Machine name which processed the URL.
E.g.: Descartes.uta.edu
Image Path Absolute path of the image on the worker agent machine.
E.g.: /home/httpd/html/Diogenes/dirspace/128222303/bill.gif
URL URL containing the relevant image.
E.g.:www.whitehouse.gov
Relevance Associated Relevance of the image.
E.g.: 0.039
All the above information is made use in retrieving the images for processed
queries.
4.3.6 Index Database
Index database is a collection of files named after person names. Each of these
files contains index information about the relevant images belonging to that person.
This content of this file is explained in the previous section. The structure of a index
database is as shown in the following figure 4.5
56
……..…………….
…….Hillary Clinton
Carl Lewis
Bill Clinton
BillyBob
Bill Gates
……..…………….
…….Hillary Clinton
Carl Lewis
Bill Clinton
BillyBob
Bill Gates
…………………………………….
……………………………………..
Bill ClintonMachine4.uta.edu/home/db/billphoto.jpgwww.clinton.nara.gov0.033
Bill ClintonMachine3.uta.edu/home/db/clinton.gifwww.usapresidents.com0.055
Bill ClintonCsl.uta.edu/home/db/billclinton.jpgwww.clintonpresidentialcenter.com0.078
Bill Clinton Descartes.uta.edu/home/db/bill.jpgwww.whitehouse.gov 0.092
…………………………………….
……………………………………..
Bill ClintonMachine4.uta.edu/home/db/billphoto.jpgwww.clinton.nara.gov0.033
Bill ClintonMachine3.uta.edu/home/db/clinton.gifwww.usapresidents.com0.055
Bill ClintonCsl.uta.edu/home/db/billclinton.jpgwww.clintonpresidentialcenter.com0.078
Bill Clinton Descartes.uta.edu/home/db/bill.jpgwww.whitehouse.gov 0.092
Figure 4.5 Structure of an Index Database
4.4 Other implementation issues…
4.4.1 Concurrency control and Thread Safety
Concurrent execution of threads requires the threads to be mutually exclusive of
each other so that no data loss occurs. The initial architecture of Diogenes required
minimum inter-process communication and the number of shared variables is very few.
With proper planning we achieved the synchronization the use of threads thus making
the threads safe.
57
4.4.2 Shared Variable Access
Shared variable pose threat of data loss and when not taken care, can cause lot of
problems. We identified the shared variables and resource and used “locks” to prevent
data corruption.
4.4.3 Parallel & Distributed Computation
We have chosen JavaTM 2 for performing most of the parallel and distributed
computation. This is a well established, secure, and scalable development environment
equipped with many features tailored to the web. In particular, instead of explicitly
implementing a dedicated network protocol for inter-agent communication, we adopt
Remote Method Invocation (RMI), a technology which enables us to create distributed
applications in which the methods of remote Java objects can be invoked from other
Java virtual machines (residing on different hosts), using object serialization to
implicitly marshal and unmarshal parameters.
Also, JavaTM 2 provides a convenient way for creating and handling lightweight
processes (threads). The built in synchronization feature of Java, greatly reduces the
program complexity, while making coding flexible.
4.4.4 Simultaneous User Queries
Handling simultaneous user queries is a tricky business. Care should be taken
while naming and creating files and directories for processing the query. We have
handled this problem by getting “time of query” (time in milliseconds) and associating
58
it with the process ID for that query. Every time when a query is submitted, the system
creates a new “process” to carry out the search operation. The process ID for the query
is unique and combining with the “time” will yield unique file and directory names.
Also, the thread IDs associated with each thread, is made use for obtaining the
uniqueness is creating the files.
59
CHAPTER 5
Performance Evaluation
In this chapter, we describe the experimental evaluation of the new distributed
architecture and compare the results obtained by the distributed Diogenes with the
previous architecture (single threaded) of Diogenes.
5.1 Relevance Analysis
First, we calculate the “precision” obtained by Diogenes. It can be defined as
Retrieved Images of Num Totalpresent Images Relevant of Num
Precision =
A set of experimental retrievals were performed and search engines were
compared in terms of average precision in answering people image queries. Only the
top 50 images retrieved by each search engine were considered for this evaluation.
60
Table 5.1: Search Engine Precision comparison
Query Diogenes Google AltaVista
Bill Gates 0.90 0.58 0.78
Abraham Lincoln 0.96 0.62 0.86
Hillary Clinton 0.94 0.72 0.88
Dalai Lama 0.96 0.80 0.90
Average Precision 0.94 0.68 0.84
Table 5.1 shows the results of the precision evaluation. We see that the precision
of Diogenes is better in every case and the average precision is very high compared to
other similar search engines. We have some of the snapshots of the retrievals from
Diogenes and Google.
We compare the results obtained by Diogenes with results obtained by Google,
one of most popular image search engine on the web. We check for the relevance of the
retrieved images by manual examination. Following snapshots show the test results for
person name queries for Diogenes and Google.
61
Test 1: Results for query: HILLARY CLINTON
Figure 5.1 Diogenes results for query: HILLARY CLINTON
Figure 5.2 Google results for query: HILLARY CLINTON
62
Analysis:
We see that most of the resulting images obtained for query “Hillary Clinton”
are in fact images of Hillary Clinton. At least top 16 images returned by Diogenes result
in 100% accuracy. While the images returned by Google contain images of Hillary
Clinton, the top 16 images contain at least 2 non-relevant, non-facial images. It is
evident that Diogenes provides better results in this case.
Test 2: Search results for query: ABRAHAM LINCOLN
Figure 5.3 Diogenes results for query: ABRAHAM LINCOLN
63
Figure 5.4 Google results for query: ABRAHAM LINCOLN
Analysis:
In the case of “Abraham Lincoln” as the query, Google could not get very high
accuracy, while Diogenes got 100% accuracy. Google retrieved images of USS Lincoln
which does not correspond to person name query Abraham Lincoln. Diogenes still out-
performed Google in terms of relevance.
We see that Diogenes retrieved more relevant results for person name queries, when
compared to Google, one of the most popular image search engine.
64
5.2 Performance analysis: Initial Architecture Following chart shows the performance of Diogenes with the initial single
threaded architecture. The initial architecture took nearly 30000 seconds (~8 hours) to
process 2000 URLs. It can be seen that nearly 70% of the total execution time is spent
in waiting for downloading the web pages.
Diogenes: Initial Architecture Performance
05000
100001500020000250003000035000
0100
300750
12502000
Num of URLs
Tim
e in
Sec
on
ds
Total Execution Time
Total Wait Time
Total CPU Time
65
5.3 Performance Analysis: Multi-Threaded Architecture on a single
Agent
After employing the multi-threaded architecture for a single agent, the wait time
was fully reduced to zero because of the parallel processing nature of threads. So as
desired, the communication latency is almost zero now. Also we can see that the total
CPU time was almost halved when compared to initial architecture of Diogenes. This
can be attributed to the fact that, the CPU was made use to its fullest potential when a
multi-threaded architecture was employed.
Diogenes - Impact of Multi-Threading on Execution Time
0
5000
10000
15000
20000
25000
30000
35000
0 50 100
200
500
1000
1250
1500
2000
Num of URLs
Tim
e in
Sec
on
ds
Search Time usingMulti Threading
Single ThreadedProcessing
66
5.4 Performance Analysis: Distributed Architecture with 4 Agents As the chart shows, the total time taken for processing different number of
queries is drastically reduced when the total work load is distributed among 4 agents.
We carried out this test for 4 agents. But the architecture can be easily scaled to higher
number of agents. We now can obtain a initial response time of few seconds which was
our main goal. The initial response time is the time taken by Diogenes to retrieve first
relevant image. This is in par with many large scale search engines employing
thousands of computers for query processing. Of course, in this case, there is a single
user, but it can be easily scaled to serve multiple users with similar response times.
Diogenes - Impact of Distributed Architecture
0
200
400
600
800
1000
1200
1400
1600
0 50 200
500
1000
1500
2000
Num of URLs
Tim
e in
Sec
on
ds
Time taken by 4Worker Agents
Time taken by 2Worker Agents
67
5.5 Conclusion
In this work, we have designed and developed a parallel & distributed
architecture for Diogenes image search agent. The Diogenes, meta-search agent is now
capable of handling multiple user queries and is now scalable. Also, the new
architecture makes good use of system resources. The new system drastically reduces
the search time for Diogenes, and the initial response time is almost in par with similar
image search engines.
The design of open distributed search architecture reduces the centralized
storage requirement. Also, its modular design enables different components to be hosted
by different machines. With the precision of initial architecture of Diogenes and the
speed of the new distributed architecture, Diogenes promises to be a popular image
search engine.
Some of the observed limitations of the new architecture are that, it has a single
point of failure. The “index server” is the single point of failure. If the index server
which also has a index database, fails, then the entire search operation comes to an halt.
So if the index server becomes the bottleneck, the performance of the Diogenes will get
affected. Also, if any of the “worker agents” fail, then all the URLs which were
assigned, but not processed by that worker agent will be lost. The relevant images that
might have existed in those URLs will be lost. Also, if the worker agent crashes, then
the thumbnail images stored on that machine will be unavailable to be viewed by the
client browser. That part of the distributed storage will be unavailable.
68
5.6 Future Work
The present Diogenes is designed to retrieve only the person images (facial
images). The face detection module employed in this work is able to detect only face
objects in the input image. Any other kind of object detection is not possible. New
object detection modules can be incorporated into Diogenes to detect other kind of
objects (cars, mountains, buildings etc...). This would greatly increase the usefulness of
Diogenes.
A new design which makes use of a “Distributed Index Server” will remove the
single point of failure, and also reduce the bottleneck on any one machine. The fault-
tolerance of Diogenes will improve vastly.
A replication scheme can designed to replicate the images available on different
worker agents, so that in case of a worker agent crash, all the images contained within
that agent are still available for retrieval.
69
REFERENCES
[1] http://searchengineshowdown.com [2] Y. Alp Aslandogan, Clement T. Yu, Diogenes: A Web Search Agent for Content
Based Indexing of Personal Images. [3] Y.Alp Aslandogan and Clement T. Yu. Multiple Evidence Combination in
Image Retrieval: Diogenes Searches for People on Web. In Proceedings of ACM SIGIR 2000, Athens, Greece, July 2000.
[4] Y. Alp Aslandogan, Clement T. Yu, "Evaluating Strategies and Systems for
Indexing Person Images on the Web." ACM Multimedia 2000 [5] http://searchenginewatch.com/links/ [6] Search Engine History [7] Web Crawler Review: http://dev.funnelback.com/crawler-review.html [8] http://www.tripadvisor.com/ [9] www.metacrawler.com [10] Content-Based Image Retrieval Systems: A Survey. Remco C. Veltkamp, Mirela
Tanase, Department of Computing Science, Utrecht University [11] Searching for Images and Videos on the World-Wide Web. John R. Smith and
Shih-Fu Chang [12] http://www.altavista.com/image/default [13] WebSeer: An Image Search Engine for the World Wide Web Michael J. Swain, Charles Frankel and Vassilis Athitsos [14] Eric Brill. Some advances in transformation based part of speech tagging. In
Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722-727, 1994
70
[15] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural Network Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23-38 Jan 1998
[16] The Anatomy of a Large-Scale Hyper textual Web Search Engine, Sergey Brin
and Lawrence Page Computer Science Department, Stanford University. [17] www.google.com [18] Multithreading with Distributed Functional Units, Bernard K. Gunther,
Member, IEEE [19] Scalable Distributed Architecture for Information Retrieval, Zhihong Lu,
University of Massachusetts, Amherst. [20] Rank Aggregation Methods for the Web Cynthia Dwork, Ravi Kumar, Moni
Naor, D. Sivakumar, Compaq Systems Research Center, Palo Alto, CA, USA. [21] Ordinal ranking methods for multi criterion decision making, Zachary F.
Lansdowne, Economic and Decision Analysis Center [22] Towards an Open and Highly Distributed Web Information Retrieval
Architecture, Torsten Suel_ Chandan Mathur Jo-WenWu Jiangong Zhang Alex Delis Mehdi Kharrazi Xiaohui Long Kulesh Shanmugasundaram, CIS Department, Polytechnic University Brooklyn, NY 11201
[23] Performance evaluation of a Distributed Architecture for Information Retrieval,
Brandon Cahoon, Kathyrn S. McKinley, Dept of CS, UMass, Amherst. [24] OASIS Distributed Search Engine,An Insuma GmbH White Paper
71
BIOGRAPHICAL INFORMATION
Ravishankar Mysore received his Bachelor of Engineering in Industrial
Engineering at University of Mysore, Mysore, India in 1999. After obtaining his
Bachelors degree, he worked for Tata Consultancy Services., India, as a Software
Engineer from October 1999 to July 2000. He then pursued his Master of Science in
Computer Science & Engineering at The University of Texas at Arlington. He received
his Master of Science degree in Computer Science & Engineering from The University
of Texas at Arlington in May 2003.