diogenes: a distributed search agentbelieved in me, provided moral support when ever i needed. my...

Department of Computer Science and Engineering University of Texas at Arlington

Arlington, TX 76019

DIOGENES: A Distributed Search Agent

Ravishankar P. Mysore, Y. Alp Aslandogan {rmysore,alp}@cse.uta.edu

Technical Report CSE-2003-24

This report was also submitted as an M.S. thesis

DIOGENES: A DISTRIBUTED IMAGE SEARCH AGENT

by

RAVISHANKAR PUTTAIAH MYSORE

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

Masters of Science in Computer Science Engineering

THE UNIVERSITY OF TEXAS AT ARLINGTON

May 2003

iii

ACKNOWLEDGMENTS

First and foremost, I would like to express my sincere thanks to my advisor,

Dr. Alp Aslandogan for his contribution in the development of my research and his

guidance in my professional growth. His strong encouragement and broad vision, made

this work possible.

My parents have been a constant source of inspiration throughout my study and

their prayers have brought me till here. My sister and my brother-in-law, who always

believed in me, provided moral support when ever I needed. My nephew Chinnu’s

bright smile and curious voice over the phone kept me going even during long, sleepless

nights.

I would like to thank all the members of Infolab group who provided a friendly,

humorous and stimulating research environment. Also, I would like to thank my friends

Sriram and Swaroop for their all-round support during my entire master’s education.

This acknowledgement is incomplete without mentioning the support of my friends,

well wishers who helped me throughout my work. Thank you all.

April 1, 2003

iv

ABSTRACT

DIOGENES: A DISTRIBUTED IMAGE SEARCH AGENT

Publication No. ______

Ravishankar Puttaiah Mysore, M.S.

The University of Texas at Arlington, 2003

Supervising Professor: Alp Aslandogan

In this work, we have proposed and implemented a new distributed multi-

threaded architecture for reducing the response time of Diogenes, a meta-search agent

for searching people images on the web.

Initially, Diogenes was designed as a single threaded query processing system

which used only a single system. This new architecture uses multiple machines across

the network for efficient crawling and query processing and creates of a centralized

index to serve user queries. The new architecture is highly scalable and employs the

distributed storage concept. We have also implemented a new URL ordering scheme to

get faster results and a new web interface to server multiple users simultaneously.

v

TABLE OF CONTENTS

ACKNOWLEDGMENTS...................................................................................... iv ABSTRACT .......................................................................................................... v LIST OF ILLUSTRATIONS ................................................................................. ix LIST OF TABLES .............................................................................................. x Chapter 1. INTRODUCTION .................................................................................... 1 1.1 Web based image retrieval............................................................... 1 1.2 Diogenes – Web based image retrieval agent ................................... 2 1.3 Research Summary.......................................................................... 4 1.4 Organization of Thesis .................................................................... 5 2. BACKGROUND: SEARCH ENGINE CONCEPTS .................................. 6 2.1 Search Tools................................................................................... 8 2.1.1 General Purpose Search Engines............................................ 8 2.1.2 Specialized Domain Search Engines ....................................... 11 2.1.3 Meta Search Engines.............................................................. 12 2.1.4 Directories ............................................................................. 14 2.2 Multimedia Search Engines ............................................................. 14 2.3 Search Engine Architectures ........................................................... 17 2.3.1 Centralized Search Architecture............................................. 18

vi

2.3.2 Distributed Search Architecture ............................................. 20 3. DIOGENES – INITIAL ARCHITECTURE............................................... 26 3.1 Overview of Diogenes .................................................................... 26 3.1.1 Web Crawler ......................................................................... 29 3.1.2 Text/HTML Analysis Module ................................................ 29 3.1.3 Image Analysis Module.......................................................... 31 3.1.4 Evidence Combination ........................................................... 32 3.1.5 Indexer .................................................................................. 34 3.2 Result Presentation ......................................................................... 34 3.3 Automatic Feedback ....................................................................... 35 3.4 Advantages of Diogenes.................................................................. 35 3.5 Drawbacks of Diogenes .................................................................. 36 4. DIOGENES – PROPOSED DISTRIBUTED ARCHITECTURE ............... 38

4.1 Design Assumptions, Requirements, and Goals ............................... 38

4.2 Proposed Architecture – Distributed Multi-Threaded Query Processing System................................................................ 42 4.3 Components of proposed architecture ............................................. 46 4.3.1 Web Interface (Web Server) Module...................................... 46 4.3.2 URL Server ........................................................................... 48 4.3.3 Borda’s Positional Method of Ranking................................... 50 4.3.4 Worker Agent........................................................................ 52 4.3.5 Index Server .......................................................................... 54

vii

4.3.6 Index Database ...................................................................... 55 4.4 Other implementation issues............................................................ 56 4.4.1 Concurrency control and Thread Safety.................................. 56 4.4.2 Shared Variable Access .......................................................... 57 4.4.3 Parallel & Distributed Computation........................................ 57 4.4.4 Simultaneous User Queries .................................................... 57 5. PERFORMANCE EVALUATION ............................................................ 59

5.1 Relevance Analysis......................................................................... 59 5.2 Performance analysis: Initial Architecture........................................ 64 5.3 Performance Analysis: Multi-Threaded Architecture on a single Agent 65 5.4 Performance Analysis: Distributed Architecture with 4 Agents ........ 66 5.5 Conclusion...................................................................................... 67 5.6 Future Work .................................................................................. 68 REFERENCES ..................................................................................................... 69 BIOGRAPHICAL INFORMATION ..................................................................... 71

viii

LIST OF ILLUSTRATIONS

Figure Page

2.1 Internet HOST numbers since 1989 to 2002 ............................................... 7

2.2 Working of a typical Web Crawler ............................................................... 10

2.3 Meta-search engine architecture.................................................................... 13

2.4 Centralized Search Architecture.................................................................... 19

2.5 Google’s Query Processing Architecture....................................................... 22

2.6 Napster Distributed Architecture ................................................................. 23

3.1 Working Model of Diogenes ....................................................................... 28

4.1 Distributed Multi-Threaded Query Processing Architecture ........................ 44

4.2 Web Interface Module of Distributed Architecture ..................................... 47

4.3 URL Server architecture............................................................................. 49

4.4 Worker Agent Architecture ........................................................................ 53

4.5 Structure of an Index Database................................................................... 56

5.1 Diogenes results for query: HILLARY CLINTON ..................................... 61

5.2 Google results for query: HILLARY CLINTON ........................................ 61

5.3 Diogenes results for query: ABRAHAM LINCOLN................................... 62

5.4 Google results for query: ABRAHAM LINCOLN...................................... 63

ix

LIST OF TABLES

Table Page

4.1 Contents of an Index file............................................................................... 55

5.1 Search Engine Precision comparison............................................................. 60

1

CHAPTER I

INTRODUCTION

1.1 Web based image retrieval

It is estimated that, as of 2003, Web contains more than 4 billion indexable pages

and nearly as many images [1]. The World Wide Web has become a major information

publishing and retrieving mechanism on the internet. In principle, an immense amount

of information is now accessible to us at our fingertips.

However the major challenge is to efficiently locate the desired information in the

massive amount of data available. A lot of research has been focused on this area

known as “Web Information Retrieval” and has led to the development of powerful

search engines and web directories that allow users to locate the relevant data quickly.

Multimedia data, in particular images constitute important pieces of information on the

Web. Image retrieval has been one of the key aspects of web based search. Because of

the unique properties of the WWW, image retrieval has always been a challenging task

for the researchers. A plethora of smaller specialized search engines and directories,

meta-search engines, image search engines and browsing assistants exists, many of

which are built as meta tools using the data supplied by the major search engines.

Web based image retrieval is the process of extracting and indexing relevant

image and multimedia content from the web using efficient image searching and

indexing techniques. One of the key features of the image contents on the web are that

they are described by text in HTML documents as well as the content of the image

2

itself. An effective image retrieval system should make use of both text and image data,

by integrating text-based and content-based image retrieval techniques.

A number of web image search engines have been built in recent years including

both research prototypes and commercial ones. Among the former category are

WebSeer, WebSEEk, ImageScape, Amore (http://www.ccrl.com/amore), WebHunter,

ImageRover and PicToSeek. Commercial web text search engines such as Google,

Lycos, AltaVista and Yahoo also offer image search facilities.

Both WebSEEk and to a large extent WebSeer rely on the words found in the

image paths and alternate texts. The commercial image search engines like Google

image search engine (www.google.com), Lycos (http://multimedia.lycos.com/) and

AltaVista (http://www.altavista.com/image/default) do not perform any image analysis,

but rely solely on the image paths and alternate texts. Both WebSEEk and WebSeer

organize their images into conceptual categories. A user interested in people images is

directed to the people category. WebSEEk apparently uses only textual information for

this conceptual categorization. WebSeer goes one step further in image analysis by

integrating a face detector. Consequently its accuracy is much better than WebSEEk in

people queries.

1.2 Diogenes – A Web based image retrieval agent

Diogenes is an automated web-based image retrieval agent developed by

Aslandogan et al [2]. This web search agent is designed specifically to retrieve facial

images from web using efficient evidence combination algorithms, face detection

software and other unique approaches in web-based image retrieval.

Diogenes relies on both textual and image evidence for indexing the images on

the web. A face detection module examines the images on the web for facial images. A

3

face recognition module identifies the face by using a database of known person

images. A text/HTML analysis module analyzes the HTML content of the web pages

for establishing the relevance of the query with the web page. Diogenes uses Dempster-

Shafer evidence combination algorithm for combining the results of the all these

evidence and classifies the image based on the evidences [3].

When compared to similar search engines, Diogenes has a very high precision for

the same query. This can be attributed to the fact that Diogenes makes use of both the

reference text and full text of the web pages and also the information about image

content. Diogenes uses both the image path, alternate text and the full text of the web

pages [4]. These assist in establishing the relevance more accurately. Even though other

search engines (WebSeer, WebSEEk.) analyze the content of the image, they serve

different purpose of analysis, the end result of Diogenes combined with both textual and

visual analysis, results in better precision. Since it makes use of face recognition

techniques for identifying the person images on web, it eliminates non-person images

thus greatly improving the precision. It does not make use of any big database for

storing all the images and there by avoiding the huge overhead of large databases.

Even though Diogenes yielded more relevant results when compared to similar

image search engines, the total search time was very high (many hours…) which is

beyond the online users’ tolerance for waiting for a response. The initial architecture of

Diogenes had drawbacks related to search time and scalability. When we analyzed the

performance of Diogenes, we found that nearly 70% of the total search time was spent

in waiting for web pages to be downloaded from remote web servers. This can be

attributed to the slow response of the remote web sites, network congestion or the

remote site being down. This overhead also called as Communication Latency is huge

when compared to the total search time. This time increased exponentially with the

4

number of request sent by the crawler to download the web pages. Our main goal was to

reduce this overhead search time. Also initial architecture of Diogenes was single user

system. It can process only one query at any given time and lacks the ability for

simultaneous processing of multiple user queries. This architecture can run only on one

system and is not scalable.

1.3 Research Summary

In this work, we address the problem of response time and scalability issues

associated with Diogenes image search agent. The Internet has opened up distributed

computations to the world. Just about any computer may be invited to participate in a

given task. Also the latest developments in the field of cluster computing, P2P networks

have paved way for achieving super computing powers just by using a large number of

regular desktop machines. The advancements in the areas of Symmetric Multi

Processing (SMP) have enabled efficient use of system resources through parallel

computing.

All these above factors inspired us to propose and implement a “distributed multi-

threaded” query processing architecture for Diogenes to overcome its drawbacks related

to response time and scalability.

Some of the ideas employed in our work are:

• Exploiting the “data parallel” benefits associated with web-search process to

reduce the communication latency. Using multiple threads to simultaneously

download and process the web pages reduces the total response time.

• Multi-threaded query processing efficiently uses the “system resources” to

their fullest potential.

5

• Distributed search process facilitates the use of idle CPU cycles of freely

available computers over the network. This also makes use of available

network bandwidth.

• Distributed storage concept reduces the need for huge storage capacity at a

centralized location.

• A new URL ordering scheme makes possible to obtain the relevant results

earlier, thus reducing the initial response time.

• A new web interface to allow multiple users to simultaneously use Diogenes

to query the web for person images.

1.4 Organization of the Thesis

Chapter 2 of this document gives a brief introduction about search engine

concepts; multimedia search engines in general, distributed search engines etc. We also

discuss the related work in this chapter. We take look at the initial architecture of

Diogenes in chapter 3. The proposed architecture is discussed in chapter 4. Finally in

chapter 5, we discuss the performance of the new architecture when compared to the

earlier. We conclude the document by noting about the scope for future work.

6

CHAPTER 2

BACKGROUND: SEARCH ENGINE CONCEPTS

It is estimated that, as of 2003, Web contains more than 4 billion indexable

pages and nearly half as many of images [1]. Almost 3 million pages or 59 Giga bytes

of text are added daily, and the average life span of a web page is about 44 days [5].

The number of internet hosts has increased exponentially. Figure 2.1 shows the growth

of the internet hosts over the last decade. Thus the World Wide Web has become a

major information publishing and retrieving mechanism on the internet.

In principle an immense amount of information is now accessible to us at our

fingertips. However the major challenge is to efficiently locate the desired information

in the massive amount of data available. A lot of research has been focused on this area

known as Information Retrieval and has led to the development of powerful search

engines and web directories that allow users to locate the relevant data quickly. In

addition, there is a plethora of smaller specialized search engines and directories,

personal search and browsing assistants many of which are built as meta tools using the

data supplied by the major search engines.

Information retrieval is the process of identifying and retrieving relevant

documents based on user’s query. The document representation provides a formal

7

description of information contained in the documents; the query representation a

formal description of users information need; the similarity measure defines the rules

and procedure for matching the user requirement and relevant documents.

Figure 2.1 Internet HOST numbers since 1989 to 2002 (Data from Internet Software Consortium)

In 1990, Alan Emtage, a student at the University of McGill created "Archie,"

the first search tool, which used anonymous FTP servers to archive a repository of

Internet files. MIT student Matthew Gray created the Web Wanderer in June of 1993,

the earliest widely acclaimed Web robot. During the spring of 1995, scientists at Digital

8

Equipment Corporation's Research lab in Palo Alto, CA, devised a way to store every

word of every page on the entire Internet in a fast, searchable index. This lead to the

development of the first searchable, full-text database on the World Wide Web. In 1996

Daniel Dreilinger at Colorado State University introduced SearchSavvy, the first Meta

search engine [6].

As the web continues to grow, major general-purpose search engines have been

faced with serious problems. They are unable to index all the documents on the web,

because of the rapid growth in the amount of data and the number of documents that are

publicly available. Their results may be out-of-date, and they do not index documents

with authentication requirements or the information behind search forms. As more

people share their information with others, the need for better search services to locate

the interesting information is becoming increasingly important. Let us analyze some of

the different search tools available to us.

2.1 Search Tools

Search tools are sites on the internet which enable users to locate desired

information. Success with searches is dependent upon what kind of tool is being used

and the strategies employed for efficient searching. There are different kinds of search

tools available. Some of them are

2.1.1 General Purpose Search Engines

General purpose search engines such as Google, AltaVista, and Yahoo are

basically used for general queries and they do not specialize on any particular kind of

9

web data. These search engines usually index all kinds of web pages and are not limited

to particular domain. They have many advantages. They posses simple query interfaces

(usually phrase based, with Boolean operators), results are presented in structured

format, typically including match’s title, a text summary and possibly an associated

score. Their internal indexes are relatively current and are usually updated between 60-

90 days even though not every page is updated at the same frequency.

These general purpose search engines employ powerful crawlers to crawl the

web. Web crawler is a program which automatically traverses the web by downloading

documents and following links from page to page. They are mainly used by web search

engines to gather data for indexing. Other possible applications include page validation,

structural analysis and visualization; update notification, mirroring and personal web

assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc [7].

These crawlers are usually multi-threaded applications which are capable of

downloading tens of pages simultaneously, given the required bandwidth.

A typical web crawling process is shown in Figure 2.2. A URL server maintains

a queue of URLs to be processed and forwards them to multiple crawler processes.

Each crawler process runs on a different machine, is single-threaded, and uses

asynchronous I/O to fetch data from up to 300 web servers in parallel. The crawlers first

checks whether the crawling robot is allowed on the remote web server. If it is allowed,

it downloads the page and saves it to disk. It then processes the web page by removing

the HTML tags and indexing the words, phrases and the images. It then extracts the

10

URLs present in those pages and updates the URL queue. At this stage, web crawler can

be designed to perform additional processing on these pages.

Figure 2.2: Working of a typical Web Crawler

Some problems exist with general purpose engines even despite their utilities. A

common one is their weakness with the query language (they never allow regular

11

expressions). As a result of this lack of expressiveness, the chances of getting bad hits

or results in the query result greatly increases.

Another problem that exists with the general purpose search engines is that,

relatively slow update frequency of their internal indexes. This is caused due to vast

amount of web pages which need to be crawled.

The extensive coverage and weak query power of these engines means that

numerous matches are returned (often in the 1000's). The engines attempt to score the

matches, and supply summaries of the pages, to alleviate the user's feeling of being

swamped by data. Even so, the user usually has to perform a manual, second search by

scrolling through the returned matches, trying to select links which are really useful. At

best this will require a reading of the summaries, at worst it involves downloading the

full page for a result to examine it in detail; this is time-consuming and waste of

resources.

But still, these kinds of search engines are the most popular and commercially

viable engines answering millions of queries per day. With the newer algorithms and

refinement, the performance of these search engines is increasing.

2.1.2 Specialized Domain Search Engines

These search engines focus on particular domain, a good example being Medical

Search Engines [4] (www.9-11.com, MedHunt.com, MediSearch), Shopping Search

Engines (www.mySimon.com, http://compare.net), Travel search engines

(www.tripadvisor.com) etc. For example TripAdvisor, Inc. [8] provides a

12

comprehensive travel search engine and directory that helps consumers research their

travel plans via the web. TripAdvisor's search technology filters out irrelevant search

results to deliver only the most relevant and popular links associated with a given travel

search term.

Advantages of specialized domain search engines include the reduction of bad

hits since topics outside the domain are not examined. This increases the chances of

fetching relevant results for the user with a reduction in search time. Also, since the

domain is limited, the crawling can be more frequent which facilitates for more current

index.

2.1.3 Meta Search Engines

A meta-search engine sends queries to multiple search engines and other data

sources, then collates the results in some way and formats them for display. The data

sources can be internal indexes, associated text search engines, database search engines,

message archives, Intranet or web wide web search engines, or even file servers. Meta

search engines simply read each search results page and extract the text from the

HTML.

In a meta-search engine, the keywords submitted in its search box are

transmitted simultaneously to several individual search engines and their databases of

web pages. Within a few seconds, the results from all the search engines are obtained

and presented to the user. Meta-search engines do not own a database of Web pages.

Some of the Meta search engines are MetaCrawler [9], HuskySearch, and Profusion etc.

13

They send the search terms to the databases maintained by other search engines. Good

meta-search engines accept complex searches, integrate results well, eliminate

duplicates, and offer additional features such as intelligent ranking or clustering by

subjects within the search results.

Working of a typical meta-search engine is shown in the figure 2.3.

Figure 2.3 Meta-search engine architecture

The main disadvantage of these Meta search engines is that, they are mainly

dependent on the general purpose search engines and the accuracy of retrieval

depends on the efficiency of the general purpose engines. Some of the other

associated disadvantages are

• Translation of query syntax and fields not exact

Query

UserInterface

QueryDispatcher

Query

SearchEngine 1

SearchEngine 2

SearchEngine 3

Result Mergingand Ranking

Query

Results

Results

Query

14

• May need to decode query forms

• Some sources don't sort by relevance

• No way to compare relevance scores

• Vulnerable to search spam

• Variable quality of search engines

• Must keep up with changes in source engines

2.1.4 Directories

Directories are topic-oriented catalogs of Internet sites. Directories are

categorized based on subjects and are usually built by human selection in contrast to

regular search engines where in the pages are indexed by robots or spiders. Some of the

examples are Yahoo! Directory, Google Directory index etc.

Directories are organized into subject categories and web pages are classified by

subjects. These subjects are never standardized and vary according to the scope of each

directory. Directories are often carefully evaluated and annotated.

Some of the disadvantages are that, the users can only search what they see

(titles, descriptions, subject categories, etc.) -- use broad or general terms. These

directories are usually small and specialized and may not be correctly classified.

2.2 Multimedia Search Engines

The explosive growth of the World Wide Web has made a vast amount of

information available to us especially the multimedia data such as images and graphics.

15

New visual information in the form of images, graphics, animations and videos is being

published on the Web at an incredible rate. Multimedia information is published both as

embedded in Web documents and as stand-alone objects. The visual information takes

the form of images, graphics, bitmaps, animations, videos etc...[10]

Multimedia data, in particular images constitute important pieces of information

on Web. Image retrieval has been one of the key aspects of Web based search. Because

of the unique properties of WWW, image retrieval has always been a challenging task

for the researchers. The task of searching these images is difficult because the images

tend to be poorly indexed. The collections are too large to be manually indexed with

captions according to content. Even if they were indexed in this way, it is very unlikely

that the author of the caption would be able to make the captions detailed enough to

anticipate all of the potential queries posed to the users.

Today the advancement in the database technology has paved way for database

support of multimedia data types along with the traditional ones. There are a lot of

content-based image retrieval systems, which make use of these capabilities in

databases to query the images by content, regions and spatial layout” [11]. IBM’s

QBIC, Virage and Excalibur are some of the most widely used CBIR systems. These

CBIRs analyze the image and based on the relevance rule, assign some weight age to

the image. The new search techniques for textual information have aided the

developments in the web based image retrieval. A wide array of techniques has been

made use in these image retrieval systems. They range from systems asking the user to

16

draw a picture of the desired image to systems just asking a simple query input to give

out the desired images.

Image contents on the WWW are described by text in the HTML documents as

well as contained in image data itself. Thus an effective image retrieval system should

make use of both text and image data, by integrating text-based and content-based

image retrieval techniques.

A number of multimedia search engines are available on the web. Some of them

are AltaVista PhotoFinder (http://www.altavista.com/image/default), Google Image

search engine (http://www.images.google.com), WebSeer, Lycos

(http://multimedia.lycos.com/), AlltheWeb (http://multimedia.alltheweb.com/) etc.

AltaVista, Lycos & AlltheWeb are able to search for audio, video files also. All the

others offer image search capabilities only.

AltaVista [12] Photofinder is one such search engine. It measures the similarity

of images based on visual characteristics such as dominant colors, shapes and textures.

Images are tagged with textual information obtained by the text-analysis of the

corresponding web page. The user cannot set the relative weights of these features, but

judging from the results, it seems that the color is the predominant feature [8].

WebSeer, which was developed at University of Chicago, makes use of a similar

principle. The images collected from the Web are submitted to a number of color tests

in order to separate photographs from drawings. Some simple tests measure the number

of different colors in the image. . Keywords are extracted from the image file name,

17

captions, hyperlinks, alternate text and HTML titles. Depending on image and textual

evidences, the images are indexed [13].

Users searching the World Wide Web have a number of options presently

available to them. Google, Excite and Yahoo! are but a few examples of useful search

engines. All of these systems have been designed primarily to find text-based

information on the Web. This is the major drawback with these types of image search

engines. They rarely analyze the content of the image except with a few exemptions like

WebSeek, WebSeer which analyze the image content. Because of the poor indexing

done based on the textual information associated with the images, search engine users

rarely get relevant images.

2.3 Search Engine Architectures

As the importance of web, as a prime source of information is felt by many

people, internet users have been flocking to search engine web sites to retrieve all sorts

of information from the web. Now days, popular search engines have to answer millions

of queries per day. They employ thousands of computers and huge databases to crawl

the entire web and build an index and efficiently retrieve results for search queries.

Scalability is a major issue. This has led to the development of improved architectures

to serve these above needs.

Search engine architectures vary widely. Due to an extensive research being

done over the past few years, search engines make use of thousands of computers and

large databases for efficient information retrieval. Earlier search engines like Archie,

18

Gopher used less computational resources since the web was small in size at that time.

But as the size of the web increased, the number of web pages found on the web

increased and so as the number of web users. Today, search engines need vast

computational power, huge storage space and high network bandwidth. All these

requirements have led to the development of a new breed of search engines which are

based on distributed architecture comprising of thousands of computers.

Search engine architectures can be generally categorized as follows:

I. Centralized search architecture

II. Distributed search architecture

In simple terms, if the processing of the data is going to happen in numerous places,

then that architecture is said to be distributed architecture; if the vast majority of

processing happens in one place, it is said to be a centralized architecture. These

architectures have their unique strengths and weaknesses. Distributed architectures

arose partly because centralized mainframe models were deemed as inflexible and

expensive. On the other hand, it's harder to manage distributed architectures, fine-tune

their performance and pinpoint trouble spots when something goes wrong. Let’s have a

brief look at these architectures.

2.3.1 Centralized search Architecture

As the name suggests, in centralized search architecture, all the essential

processing of the data is done at a central machine. The crawlers download the web

19

pages and process these pages and store and index at a single place. Some of the

characteristics of the centralized search architecture are

• All computations are done on a local site (crawling, storing, indexing…), not

distributed across different machines over the network

• Robot causes heavy server load and heavy traffic

• Data located centrally

• Low Percentage of document coverage.

A general architecture of Centralized search architecture is as shown in figure.

Figure 2.4: Centralized Search Architecture

Index

Download WebPages

Inde

x Web

Pag

es

Query Retrieval andRanking

Crawler

Indexer

Query Engine

Online

OffLine

EndUsers

WWW

20

Some of the disadvantages are:

• Large indexes require big disks, memory, network connections

• Not fault tolerant

• Scaling up gets harder and harder

2.3.2 Distributed Search Architecture

A Distributed system is “a collection of independent computers that appear to its

user as a single coherent system”. Distributed search architecture makes use of concept

of distributed processing of data and objects across a network of connected systems.

Distributed processing allows us to harness idle CPU cycles and storage space of tens,

hundreds, or thousands of networked systems to work together on a particularly

processing-intensive problem. Increasing desktop CPU power and communication

bandwidth has also helped to make distributed computing a more practical idea for

distributed search process.

Distributed search architectures can be categorized as

• Loosely coupled (P2P, Internet based…)

• Tightly coupled (Cluster computing, Grid…)

Loosely coupled systems communicate with each other over the internet and

usually have a decentralized architecture with one or no controllers. Some examples are

Napster, Gnutella network etc. Tightly coupled systems are connected to each other

with high speed intranet and are centrally managed. Most of the grid computing projects

like Globus, Atlas etc. are tightly coupled distributed architectures.

21

Large crawl-based search engines are typically based on scalable clusters,

consisting of a large number of low-cost servers located at one or few locations and

connected by high-speed LANs. Currently commercial information retrieval systems,

such as web search engines Google [16], AltaVista handle tremendous loads by

exploiting the parallelism implicit and use SMPs to support their services. Although it is

clear that more CPUs and disks you have, more the system can handle, the important

question is how much of hardware and software resources are needed to exploit these

resources.

Google [17] has one of largest distributed search architecture. It houses over

10000 computers connected by a high speed intranet. Its fast search speed can be

attributed to thousands of low cost PCs networked together to create a super fast search

engine. Google employed thousands of linked PCs - one of the world's largest Linux

clusters - to quickly find each query's answer.

In Google, the web crawling (downloading of web pages) is done by several

distributed crawlers. There is a URL server that sends lists of URLs to be fetched to the

crawlers. The web pages that are fetched are then sent to the storeserver. The

storeserver then compresses and stores the web pages into a repository. A multi

threaded indexer analyzes these pages and indexes them based on the PageRank

Algorithm.

22

Figure 2.5 Google’s Query Processing Architecture

The above illustrated distributed architecture has been highly successful for

Google. Even though the entire crawling and indexing process is done in a distributed

manner, there will always be a central coordinator controlling the different operations.

This results in a semi-distributed architecture.

Napster was a popular mp3 search engine. Napster was based on a “centralized

P2P server-client” model. In this model a central server is used to manage traffic

between registered users. These central servers maintain a directory of the shared files

stored on the machines of various users. This directory is updated frequently, mostly

23

every time a user logs on or logs off a network. When a request is obtained, the central

server gets the request and creates a list of all files matching the request and it also cross

checks this with its own database of files. This verified list is then displayed to the user.

The user then just has to directly select the desired file and open a direct HTTP link

with the PC having that file. The download of the data takes place directly, between the

network users. It should be noted that the data is never stored on the central server or on

any other intermediate device.

An over view of Napster architecture is as shown in the figure.

Centralized Indexserver

User Computer

Other Peers

Figure 2.6 Napster Distributed Architecture (Source www.searchtools.com)

24

Advantages:

• The presence of a central server which enable a fast and efficient location of the

desired data.

• Due to constant updating of the index by the central server, latest files are made

immediately available to users for downloading.

• As all the individual users must be registered on the server’s network, the search

request is more comprehensive as all the machines in the network are included

efficiently in the search.

Disadvantages:

• Because of the presence of a central server, the system has a single point of entry.

• The entire network is dependent on the central servers and can collapse if any one

of these dies.

Distributed search process offers a lot of advantages. The most obvious is the

ability to provide access to supercomputer level processing power or better for a

fraction of the cost of a typical supercomputer. The CPU intensive tasks can be

distributed over a large number of processors. Employing large number of individual

computers reduces the network bandwidth needed by a single machine, enabling faster

downloads. Scalability is also a great advantage of distributed computing.

Though they provide massive processing power, super computers are typically not

very scalable once they're installed. A distributed computing installation is virtually

25

infinitely scalable. Simply adding more systems to the environment will increase the

computational capacity of the existing setup. A byproduct of distributed computing is

more efficient use of existing system resources. Also the fault tolerance of the whole

setup is increased. A distributed crawler will continue to work even if some of the

clients die. Other agents can carry on with the assigned work.

In the next chapter we will take a look at the concept & working of Diogenes and

its existing architecture.

26

CHAPTER 3

DIOGENES – INITIAL ARCHITECTURE

In this chapter, we discuss the initial architecture and working of Diogenes. We

begin with an overview about Diogenes. Next, we will analyze the different components

of Diogenes. In section 3.1.4, we take a look at evidence combination and the

Dempster-Shafer approach. We will discuss the advantages and drawbacks of Diogenes

in the subsequent sections.

3.1 OVERVIEW OF DIOGENES

Diogenes is an automated web-based image retrieval agent developed by Dr.Alp

Aslandogan [1] as a part of his doctoral research. This web search agent is designed

specifically to retrieve facial images from web using efficient evidence combination

algorithms, face detection software and other unique approaches in web-based image

retrieval.

Let’s look at the control flow of Diogenes in Figure 3.1. It consists of a Web

Crawler, HTML analyzer, face detection module, evidence combination module,

indexer and a face database to the hold the images. Diogenes retrieves web pages and

associates a person name with each facial image on those pages. The search process is

27

initiated when a user enters the name of a person as a query. User issues the query in

through HTML page and this is taken by the cgi-script for further processing. The query

is sent to different text search engines and related URLs are retrieved. Web pages are

retrieved by a web crawler module. The crawler module issues this query to the text

search engines and obtains the list of URLs from text search engines like Yahoo!,

Lycos, AltaVista etc,. To find images of this person Diogenes relies on two types of

evidence, Visual and textual. A face detection module examines the images on the page

for human faces. A face recognition module identifies the face by using a database of

known person images. A text/HTML analysis module analyzes the body of the text with

the aim of finding clues about the object (person) in each image. The outputs of the face

detection and text/HTML analysis are merged using Dempster-Shafer evidence

combination mechanism to classify each image as relevant or not. These relevant

images are presented back to user.

28

Input query(runsearch)

URL filevisit the

URL

9xxxx Dir,html_file

HTMLTag

Stripper

POSTagger

LEXICON

BIGRAMS

RULE FILE

Face Detector(giftest)

Face ??

Yes

Cropper

HTMLAnalyser

Converter Image Magick

GIF Image

WaveletFace

Recogniser

wfr.cpp

Face Database

tags.out..

EvidenceCombination

Module

fr.out, tf.out,imgid.txt..

AdjacencyAnalyzer

adj.out

adj.pldempster.out

Indexer

.idx file

HTMLcomposer

Auto Feedback Process

WebGoogle,Lycos,etc..

HTML File

Figure 3.1 Working Model of Diogenes – People Search Engine

29

Some of the components of Diogenes are explained in detail below.

3.1.1 Web Crawler

Web crawler formulates and sends the input query to different text search

engines like Yahoo!, AltaVista, HotBot, Lycos etc in a format as needed by the

respective search engines. It collects the search results (URL list) for the input query

from these search engines. HTTP requests to download the web pages in the URL list,

is sent to the remote web sites and the crawler gets back the web pages. It retrieves the

text of the web pages and then extracts each of the images referenced on those pages. It

then saves all this information under a unique directory whose name is generated from a

unique time-stamp.

3.1.2 Text/HTML Analysis Module

The text/HTML analysis module of Diogenes determines a degree of association

between each personal name on a web page and each facial image on that page. This

degree of association is based on two factors: Page level statistics and local (or

structural) statistics [2]. Page level statistics such as frequency of occurrence and

location within the page (title, keyword, body text etc...) are independent of any

particular image. Local statistics are those factors that relate a name to an image.

Diogenes takes advantage of the HTML structure of a web page in determining the

degree of association between a personal name and an image.

30

Some of the keywords used for this purpose are

• Frequency: The significance of a name is proportional to its number of

occurrences within the page and inversely proportional to its number of

occurrences on the whole web. It captures the premise that if a rare word appears

frequently on a page then it is very significant for that page. If a common word on

the other hand appears frequently on a page, it may not be as significant.

• Name or URL Match: A name that is a substring of the image name or the image

URL is assigned a higher significance.

• Shared HTML Tags: Names that are enclosed in the same HTML tags with an

image are more likely to be associated with that image. For instance, a caption for

an image is usually put in the same HTML table on the same column of adjacent

rows.

• Alternate Text: The alternate text identified by the “ALT” HTML tag generally

serves as a suitable textual replacement for an image or a description of it.

When a page is retrieved, a part of speech tagger (Brills tagger [14]) tags all the

words that are part of a proper name on the page. The occurrence frequencies of these

words are recorded. For each such word and for each image on the page, a degree of

association is established. The frequency of the word serves as the starting point for this

score.

31

Then the HTML analysis module analyzes the HTML structure of the page. If an

image and a word share some common tags, their degree of association is increased. If

the word is a substring of the image name or if the word is part of the alternate text for

the image, the association is increased further. Since the text/HTML analysis module

assigns degrees of association to individual words, at the time of evidence combination

the scores of the two words (the first name and the last name) that make up a personal

name are averaged to get a single text/HTML score.

3.1.3 Image Analysis Module

The visual evidences used by the classifier of Diogenes consist of the outputs of

face detection and face recognition modules. The neural network based face detection

module examines an image to find a human face [15]. This face detector converts the

image into gray scale and checks for the presence of eyes in the image. If the eyes are

found, then the distance between the eyes is a measure of the accuracy of person image

being present in that picture. The face location is indicated to an intermediate module

which crops (cuts out) the facial portion and submits it to face recognition modules.

Diogenes uses a face recognition module, which implements the eigen face method.

This module uses a set of known facial images for training. Each of these training

images has an associated personal name with it. At recognition time a set of distance

values between the input image and those of the training set are reported. These

distances indicate how dissimilar the input image is to the training images. In addition a

global distance value called “Distance from Face Space” or DFFS is also reported. This

32

is the global distance of the input image from the facial image space spanned by the

training images. Diogenes uses this latter value to determine the uncertainty of the

recognition.

3.1.4 Evidence Combination

Evidence combination is a powerful tool used for managing uncertainty in fields

such as robot vision, remote surveillance and automated equipment monitoring and

medical diagnosis [3]. In information retrieval, evidence combination has been used

successfully in integrating the evidence of different query representations or different

retrieval and ranking strategies. Bayesian statistical models and Fuzzy sets are among

the means used by researchers to integrate different pieces of evidence. The ultimate

goal of evidence combination is to improve the accuracy of a classier. Depending on the

context, the objects to be classified may be documents, images, landscape or physical

objects.

In Diogenes, Dempster-Shafer evidence combination mechanism is used for

Combining the evidences obtained by image and textual analysis. It also provides a

method for combining independent bodies of evidence using Dempster's rule. It

combines different evidences and gives the relevance of a person image present in that

web page to the input user query. The output of the face recognition module, which

classifies the image, and the output of a text analysis module constitute as the evidence

for this algorithm. These evidences are considered independent since the output of these

modules does not affect one another. The text analysis module assigns a degree of

33

association of the input query with the contents of the web page and the face

recognition module gives a distance value according to the similarity of the image to the

images in the face database. This score is used in the simplified Dempster-Shafer

evidence combination algorithm to get degree of relevance.

We have the output of a face recognition module (FR) which classifies the

image and the output of a text/HTML analysis module (TA) which analyzes the text that

accompanies the image. Both modules attempt to identify the person in the image based

on different media. We designate the two pieces of evidence as FRm and TAm

respectively. The result of face recognition module does not affect the text/HTML score

and vice versa.

Using the simplified formula [3] of Dempster-Shafer evidence combination

algorithm, the image ranking in our case can be represented as

)()()()()()()( θθ TAFRTAFRTAFR mmmmmm PcPcPcPcPcrank ++∝

Here ∝ represents `is proportional to" relationship, )(θFRm and )(θTAm represent

uncertainties in the bodies of evidence FRm and TAm respectively. These are

obtained as follows: For face recognition, we have a “distance from face space" (DFFS)

value for each recognition. This value is the distance of the query image to the space of

eigen-faces formed from the training images. Diogenes uses the DFFS value to estimate

the uncertainty associated with face recognition. If the DFFS value is small, the

recognition is good (uncertainty is low) and vice versa. The following is Diogenes'

formula for the uncertainty in face recognition:

34

))(

1(1)(DFFSeln

mTA

+−=θ

For text analysis, uncertainty is inversely proportional to the maximum value among the

set of degree of association values assigned to name-image combinations.

)1

)(( MDAenl

mTA

+=θ

where MDA is the maximum numeric “degree of association" value assigned to a

personal name with respect to a facial image among other names.

3.1.5 Indexer

This module reads in the degrees of association assigned by the evidence

combination module for each image and person name pair and produces an index file

for each person.

3.2 Result Presentation

The images that are found to be relevant are cropped into thumbnails and an

HTML page is generated with these thumbnails in them. The order of these images

depends on the total score of the images and they are displayed according to the

decreasing order of scores.

35

3.3 Automatic Feedback

One important feature of Diogenes is the automatic feedback of images which

are determined to be relevant by the user. After the searching process is finished, the

user can specify the images that are found to be relevant to him/her in the resulting

HTML page. These images are stored in a face database that is made use for future

queries. In future queries, these images are compared with the downloaded images for

“Similarity”. If the downloaded images are similar to the images in the face database of

that particular person, then they are given higher weight. This process yields better

results.

3.4 Advantages of Diogenes

When compared to similar search engines, Diogenes has a very high precision

for the same query. This can be attributed to the fact that Diogenes makes use of both

the reference text and full text of the web pages. Diogenes uses both the image path,

alternate text and the full text of the web pages. These assist in establishing the

relevance more accurately. Even though some other search engines (WebSeer,

WebSEEk.), analyze the content of the image, they serve different purpose of analysis,

the end result of Diogenes combined with both textual and visual analysis, results in

better precision. Since it makes use of face recognition techniques for identifying the

person images on web, it eliminates non-person images thus greatly improving the

precision. It does not make use of any big database for storing all the images and there

by avoiding the huge overhead of large databases.

36

Since Diogenes is a meta-search agent, it has all the advantages associated with

a meta-search engine. An important feature of Diogenes is the incorporation of face

recognition and the use of Dempster-Shafer evidence combination method with object

recognition and automatic, local uncertainty assessment. When a text/HTML analysis

module or a visual analysis module assigns a degree of similarity between an image and

a person name, there are degrees of uncertainty associated with both these values. In

Diogenes case, both of these modules produce numeric values indicating their degrees

of uncertainty. These values are obtained automatically without user interaction and

locally separately for each retrieval/classification.

3.5 Drawbacks of Diogenes

Diogenes is mainly used for person image search on the web. Other type of

image retrieval is not accurate. Any other type of image search will not yield relevant

images. We need to employ other “object detection” methods to achieve the same. On-

the-fly search of Diogenes delays the searching process. It takes hours to search the web

for obtaining the relevant person images. This is because Diogenes is a single threaded

system and crawling is done upon user query. Also lack of use of any internal database

that would store all the images on the web also causes this delay.

Diogenes is single user system. It can process only one query at any given time

and ability for simultaneous processing lacks. This architecture can run only on one

system and is not scalable.

37

Some technical details about the initial architecture of Diogenes:

• It is currently written in perl

• It was originally built on Solaris platform

• It makes use of a Face Detection Program written by Henry .A. Rowley of

Carnegie- Mellon University [15]

• It makes use of a Wavelet based Face recognition program

• Requires Image Magick software required for image processing

• Makes use of a “Rule Based Tagger” for text processing. This tagger is written

by Eric Brill of MIT [14]

38

CHAPTER 4

Diogenes – Proposed Distributed Architecture In this chapter, we propose a new multi-threaded distributed architecture for

Diogenes. Our aim is to drastically reduce the search time by making use of freely

available computers across the network. In section 4.1, we discuss the design

requirements for the new architecture. Section 4.2 explains the proposed distributed

architecture. We explain the individual components on the new architecture in section

4.3. In section 4.4, we discuss some of the other implementation issues we tackled while

implementing the above proposed architecture.

4.1 Design Assumptions, Requirements, and Goals

In this section we give a brief presentation of the most important design choices

which have guided the implementation of distributed Diogenes. More precisely, we

sketch general design goals and requirements, as well as assumptions made during our

implementations.

As seen in chapter 3, the initial architecture of Diogenes had drawbacks related

to search time and scalability. When we analyzed the performance of Diogenes, we

found that nearly 70% of the total search time was spent in waiting for web pages to be

downloaded from remote web servers. This can be attributed to the slow response of the

39

remote web sites, network congestion or the remote site being down. This overhead also

called as Communication Latency is huge when compared to the total search time. This

time increased exponentially with the number of request sent by the crawler to

download the web pages. Our main goal was to reduce this overhead search time.

During the time between the request and response time for downloading web

pages, the CPU will be idle and this results in waste of CPU cycles. By making

Diogenes, a multi-threaded system, we can drastically reduce the download time. The

principal operating objective for a multi-threaded processor is to have sufficient number

of parallel tasks multiplexed onto hardware so as to eliminate or minimize idle time in

the presence of long-latency operations [18]. It could be accounted to the fact that,

while one thread is waiting to download web pages from a site, the other thread could

process the already downloaded pages & also multi threading is the natural way of

implementing non-blocking communication operations. Also, threads provide greater

concurrency within a single process and allow efficient use of system resources (e.g.:

CPU, Memory, etc.). By efficiently coordinating between these threads, we can greatly

improve Diogenes performance.

Information retrieval is a very computation intensive task. More over, image

retrieval needs even more extensive computational power because of image analysis

process. Even if we are able to reduce to communication latency drastically, the final

response time, which is the total computational time, is still higher compared to similar

web image search agents. The image processing methods which include face detection

face recognition, cropping, conversion to gray scale etc. need a lot of processing power.

40

If we want to achieve a response time which is in the range of several minutes instead

of several hours, we needed to opt for a distributed architecture which would allow us to

distribute this huge processing requirement among different machines to achieve

computational speed up. We can distribute the total computational power required

among different in-expensive PCs to achieve faster response time from Diogenes [19].

In order to achieve this, we need to design a Distributed query processing system

encompassing several machines across the network. Higher the number of machines

involved in computation, faster will be the response time.

For Diogenes to act as an image search engine, it has to maintain its own image

index and store the images locally so that, next time if another user issues the same

query, it should be able to retrieve these images and present to the user. In order to

achieve this, we need to design indexing and storage architecture for Diogenes.

Distributed storage architecture is preferred since the total disk space required is split

across many machines over the network. This will greatly reduce the storage

requirements. For achieving a distributed storage goal, we assume that every worker

agent machine is running a web server. By this, it is possible to avoid the image transfer

back to the central index server.

In the initial architecture, the results obtained by different text search engines

are simply merged and no “re-raking” is performed. This re-ranking is necessary since,

different text search engines have varying precision and simply appending the list one

after the other may put more relevant results at the bottom of URL list. This will result

in more relevant URLs being processed at a later time, thus causing the online user to

41

wait for a longer time. We need to implement a URL merging and ranking scheme

which will take in to account, the precision of different text search engines and also the

relative order the URLs within the result obtained by text search engines.

A web interface which can serve multiple users simultaneously will vastly

improve the usability of Diogenes. The system should be able to handle the requests of

multiple users in a concurrent fashion and still able to maintain a decent response time.

The index server just has to know the worker agent name (computer name) and

the corresponding image path on that computer and the relevance. When all these

information is embedded within HTML code, the user’s browser will automatically pull

these images from the worker agents’ web server. Also, for all practical purposes, we

assume that all the worker agent machines are of same configuration (CPU power,

memory, hard disk etc…). This will greatly simplify the process of load balancing at

this stage.

In the next section, we propose a new Distributed Multi-threaded Query

Processing architecture for Diogenes which will satisfy most of the above mentioned

design criteria.

42

4.2 Proposed Architecture – Distributed Multi-Threaded Query

Processing System

We propose a novel architecture for Diogenes to overcome some of its

drawbacks. The “Distributed memory architecture” proposed here effectively makes

use of different computers over the network. In this architecture, each processor has its

own local memory and it has to do message passing to exchange data between the

processors.

Use of parallel computing techniques allows each processor to work on its

section of the problem. Processors exchange the data within the local memory with the

other processors. Thus parallel computing effectively allows problems to be solved

which cannot be solved using a single processor or within a reasonable amount of time.

There are two types of parallel programming paradigms available. They are

1. Data parallel

- Each processor performs the same task on different data

- Example - grid problems

2. Task parallel

- Each processor performs a different task on same data

- Example - signal processing

Our search process presents a classic opportunity for data parallel

programming. Here the data (URL list) can be processed concurrently, thus eliminating

communication latency. A part of URL list can be divided among different CPUs who

43

in turn, process the URLs and return the results. Each processor performs the same

analysis operation on different URL.

Thus we employ the above explained data parallel, distributed memory concept

in our proposed architecture for Diogenes image search agent. This design is based on a

centralized P2P architecture (like Napster…) which has a centralized server to store the

index. As shown in figure 4.1, the basic components of the system are a web server,

URL server, set of worker agents and an index server. All these components are

connected by network and in this work; we use a local area network, even though this

architecture logically works for a wide area network too.

44

Query

CachedImages

IndexPresent

?

Yes

Result

NO

Web Interface

URL Server

WorkerAgent 2(PC2)

WorkerAgent 3(PC3)

Index Server

Results

IndexQuery

Results

ResultsResults

UR

Ls

Create Index

CachedImages

URLs

User

WorkerAgent 1(PC2)

Figure 4.1 Distributed Multi-Threaded Query Processing Architecture

45

The proposed system works as follows.

• The main server, which also hosts the web server, is the controlling module in

this architecture. This server receives the input query from user through a web

interface made possible by CGI scripts.

• This query is forwarded to a URL server which, sends this query to text search

engines like Yahoo!, Lycos and AltaVista etc and creates an URL file using

URLs obtained through these search engines.

• These URLs are re-ranked based on their relative ranking within a search result

and also based on multiple occurrences in results of different search engines.

• The URL server divides and sends these URL list to different Worker agents

residing on different machines (PC1, PC2, PC3…), to be processed by these

clients.

• The worker agents contain a copy of the Diogenes program. They create

multiple threads which access this URL list and run the Diogenes module for

each URL by sending HTTP request to respective URL and downloading the

web pages for further processing.

• The thread processes the downloaded web page by analyzing their textual and

image content and applying Dempster-Shafer evidence combination algorithm to

establish the relevance.

• After finishing the analysis, these threads call the “index server” and send the

relevance and other details to these servers. But the images are kept locally on

the worker agent machine.

46

• The “index server” receives the index data and appends to an index database

which is indexed based on person name.

• The web server will keep on polling the index database for any information

about the query. If any entry in the index database is found about the query, it

creates an HTML page comprising the paths of relevant images on different

worker agents.

• We assume that all the worker agents are running a web server. So all the

relevant images are pulled by the user’s browser from these web servers.

4.3 Components of proposed architecture

In this section we will see in detail, the functionality of different components of

the new architecture.

4.3.1 Web Interface (Web Server) Module

Web interface is essentially a web server which provides a graphical user

interface for accepting user query. It also initiates a web search process for the required

person’s image. The working of the web interface is as shown in the figure.

When a query is entered through the web interface, the CGI script takes this

query input (person name) and checks for presence of this query in the index database.

If the index database contains an entry (file) for this person name, it opens this file and

reads the contents of it and composes a page. The actual images are stored in a

distributed fashion across different computers and their path is included in the HTML

47

code. The user’s browser actually pulls these images from different computers since we

are running web server on all these client machines.

If the index database does not contain this query, this web interface actually

forks off a process which calls the URL server with “query” as the argument. After

initiating the search process, this module checks on the status of this process and waits

for a certain period of time and again checks the index database for the query. If any

entry is found in the database, it will compose a page and return to the user. This page

keeps on refreshing until the search process is finished.

Query

Web Server

IndexPresent

?

QueryIndex

Database

No

New SearchProcess

Query

URL Server

YesComposePage

Output

Figure 4.2 Web Interface Module of Distributed Architecture

48

The CGI script reads the index file and orders the images based on their

relevance which is stored in the index file. Any URL which is repeated in the index file

is eliminated and only one image from a web site is selected for display purpose.

4.3.2 URL Server

URL server is one of the main modules of distributed Diogenes. Figure 4.3

illustrates the detailed work flow in the URL server module. This module carries out

different tasks like querying multiple text search engines, URL merging and ranking

and URL distribution among different worker agents.

URL server gets the “person name” as the input. It sends out this query to

multiple text search engines (Yahoo, AltaVista, HotBot…) and retrieves the result. This

is done in a parallel fashion to avoid communication latency. Multiple threads are

created and these threads issue request to other text search engines and build a URL list.

49

QueryDispatcher

Yahoo

Query

AltaVistaHotBot

Merging andRanking

URLDistributor

Multiple Threads

URLList

URLList

URLList

URLList

Agent List<Descartes.uta.edu

csl.uta.edu...>

WorkerAgent 1

WorkerAgent 3

WorkerAgent 4

WorkerAgent 2

Figure 4.3 URL Server architecture

The duplicate URLs in this list are removed and these URLs are merged and

made into a single list by applying Borda’s Positional Method [20]. This re-ranked URL

50

list is distributed across different worker agents across the network. The URL

distributor function reads a file called “AgentList” which contains the “computer

names” hosting a worker agent and distributes the URLs to those machines.

Here in this URL distributor function, we follow “Interleaved Data Distribution”

approach to distribute the URLs among the worker agents. This interleaved data

distribution makes sure that the URLs occurring at the top of the list are processed

earlier than the URLs occurring at the bottom of the re-ranked list. This kind of data

distribution preserves the order of re-ranked URLs while processing. Interleaved data

distribution is facilitated by creating multiple threads which divide the data and send

partial URL lists to worker agents.

4.3.3 Borda’s Positional Method of Ranking

Rank Aggregation (RA) is the problem of collating a given set of rankings. Rank

aggregation may be used to declare the overall team positions based on the rankings

given by various judges. In our situation, we make use of rank aggregation methods to

reorder the URLs obtained from different search engines based on the rankings of

URLs, returned by different search engines. Some of the others applications on the web

which make of rank aggregation methods are spam fighting, word association, search

engines comparison etc.

We employ Borda’s positional method of ranking for URL rank aggregation in

Diogenes. Borda’s method is a positional method, in that it assigns a score

corresponding to the position in which a candidate appears within each voter’s ranked

51

list of preferences, and the candidates are sorted by their total scores [21]. A primary

advantage of positional methods is that they are computationally very easy; they can be

implemented in linear time.

Borda’s method is an example of a positional voting method, which assigns Pj

points to a voter’s jth-ranked candidate, j = 1, . . . , N, and then determines the ranking

of the candidates by evaluating the total number of points assigned to each of them.

Given k lists l1, l2… lk, for each candidate Cj in list li, we assign a score

Si(Cj) = |Cp: li(cp) > li(Cj)|. The candidates are then sorted in a decreasing order of the

total Borda score ∑=

=k

i

jij cScS1

)()(

Example:

Let,

l1 - URL list obtained by search engine 1

l2 - URL list obtained by search engine 2

a,b,c,d,e - represent URLs returned by text search engines

Given lists l1 = [c,d,b,a,e] and l2 = [b,d,e,c,a],

S1(a)=|e|=1, as l1(e)=5 > l1(a)=4.

Similarly,

S1(b)=|a,e|=2, as l1(e)=5 > l1(b)=3 and l1(a)=4 > l1(b)=3.

Proceeding this way, we get

S(a) = S1(a)+S2(a) = 1+0 = 1,

S(b) = S1(b)+S2(b) = 2+4 = 6,

52

S(c) = S1(c)+S2(c) = 4+1 = 5,

S(d) = S1(d)+S2(d) = 3+3 = 6,

S(e) = S1(e)+S2(e) = 0+2 = 2.

Now, sorting the elements based on their total scores, we get the combined ranking as

b = d > c > e > a. The '=' symbol indicates a tie.

The URLs are ordered based on the “decreasing order” of these Borda’s score. During

this process, duplicate URLs are eliminated.

4.3.4 Worker Agent

Worker agent is the main multi-threaded query processing module which

actually downloads and analyzes the web pages. Worker agent is a multithreaded java

application which will accept and process the URLs sent from the URL server. Worker

agents reside on multiple computers across the network. This agent combines both

query processing and crawling which makes good use of memory and CPU power.

Since crawling needs high CPU, but low memory requirement, combining both will

increase the efficient use of system resources. The components of a worker agent are as

shown in Figure 4.4.

53

Multi-threadedQuery

Processor

URL 1

URL 2

URL 3

URL 4

URL 5

URL 6

Thread 1

DiogenesAgent

Get URL

Output

DiogenesAgent

Get URL

Output

Thread 2 Thread 3

DiogenesAgent

Get URL

Output

Index info

Index info Inde

x inf

o

Multiple Threads

Update Index

Index Server

URL List

Figure 4.4: Worker Agent Architecture

Worker agent creates multiple threads which pick up a URL from a global list of

URLs. Each thread runs a Diogenes module with the URL as the argument. The output

54

(relevance, image path etc…) of this Diogenes module is collected and this information

is passed to “index server” for updating the index database.

Worker agent implements a “Dynamic Data Distribution” model for assigning

URLs to threads. It implements a global array of URLs. After finishing with a URL,

each thread will pick up next available URL from the global URL list. So, any thread

can process any number of URLs and this result in very fast query processing.

4.3.5 Index Server

The main task of the index server is to accept the results reported by different

worker agents and to update an index database. The worker agent threads call the index

server and pass the results. The index server handles these calls in a synchronized

manner handling all the calls, so that none of the results are lost.

The component diagram of the index server is as shown in the figure 4.5. Index

server accepts the following information from the worker agents: query name, agent

name (m/c), path to the image, URL, Relevance

Index server opens a file in the name of query name (person name) and appends

the above data into that file. Index database is a collection of files named after the

query. Each file contains the following information appended one after the other.

55

Table 4.1: Contents of an Index file

Query Name Person Name

E.g.: Bill Clinton

Worker Agent

Name

Machine name which processed the URL.

E.g.: Descartes.uta.edu

Image Path Absolute path of the image on the worker agent machine.

E.g.: /home/httpd/html/Diogenes/dirspace/128222303/bill.gif

URL URL containing the relevant image.

E.g.:www.whitehouse.gov

Relevance Associated Relevance of the image.

E.g.: 0.039

All the above information is made use in retrieving the images for processed

queries.

4.3.6 Index Database

Index database is a collection of files named after person names. Each of these

files contains index information about the relevant images belonging to that person.

This content of this file is explained in the previous section. The structure of a index

database is as shown in the following figure 4.5

56

……..…………….

…….Hillary Clinton

Carl Lewis

Bill Clinton

BillyBob

Bill Gates

……..…………….

…….Hillary Clinton

Carl Lewis

Bill Clinton

BillyBob

Bill Gates

…………………………………….

……………………………………..

Bill ClintonMachine4.uta.edu/home/db/billphoto.jpgwww.clinton.nara.gov0.033

Bill ClintonMachine3.uta.edu/home/db/clinton.gifwww.usapresidents.com0.055

Bill ClintonCsl.uta.edu/home/db/billclinton.jpgwww.clintonpresidentialcenter.com0.078

Bill Clinton Descartes.uta.edu/home/db/bill.jpgwww.whitehouse.gov 0.092

…………………………………….

……………………………………..

Bill ClintonMachine4.uta.edu/home/db/billphoto.jpgwww.clinton.nara.gov0.033

Bill ClintonMachine3.uta.edu/home/db/clinton.gifwww.usapresidents.com0.055

Bill ClintonCsl.uta.edu/home/db/billclinton.jpgwww.clintonpresidentialcenter.com0.078

Bill Clinton Descartes.uta.edu/home/db/bill.jpgwww.whitehouse.gov 0.092

Figure 4.5 Structure of an Index Database

4.4 Other implementation issues…

4.4.1 Concurrency control and Thread Safety

Concurrent execution of threads requires the threads to be mutually exclusive of

each other so that no data loss occurs. The initial architecture of Diogenes required

minimum inter-process communication and the number of shared variables is very few.

With proper planning we achieved the synchronization the use of threads thus making

the threads safe.

57

4.4.2 Shared Variable Access

Shared variable pose threat of data loss and when not taken care, can cause lot of

problems. We identified the shared variables and resource and used “locks” to prevent

data corruption.

4.4.3 Parallel & Distributed Computation

We have chosen JavaTM 2 for performing most of the parallel and distributed

computation. This is a well established, secure, and scalable development environment

equipped with many features tailored to the web. In particular, instead of explicitly

implementing a dedicated network protocol for inter-agent communication, we adopt

Remote Method Invocation (RMI), a technology which enables us to create distributed

applications in which the methods of remote Java objects can be invoked from other

Java virtual machines (residing on different hosts), using object serialization to

implicitly marshal and unmarshal parameters.

Also, JavaTM 2 provides a convenient way for creating and handling lightweight

processes (threads). The built in synchronization feature of Java, greatly reduces the

program complexity, while making coding flexible.

4.4.4 Simultaneous User Queries

Handling simultaneous user queries is a tricky business. Care should be taken

while naming and creating files and directories for processing the query. We have

handled this problem by getting “time of query” (time in milliseconds) and associating

58

it with the process ID for that query. Every time when a query is submitted, the system

creates a new “process” to carry out the search operation. The process ID for the query

is unique and combining with the “time” will yield unique file and directory names.

Also, the thread IDs associated with each thread, is made use for obtaining the

uniqueness is creating the files.

59

CHAPTER 5

Performance Evaluation

In this chapter, we describe the experimental evaluation of the new distributed

architecture and compare the results obtained by the distributed Diogenes with the

previous architecture (single threaded) of Diogenes.

5.1 Relevance Analysis

First, we calculate the “precision” obtained by Diogenes. It can be defined as

Retrieved Images of Num Totalpresent Images Relevant of Num

Precision =

A set of experimental retrievals were performed and search engines were

compared in terms of average precision in answering people image queries. Only the

top 50 images retrieved by each search engine were considered for this evaluation.

60

Table 5.1: Search Engine Precision comparison

Query Diogenes Google AltaVista

Bill Gates 0.90 0.58 0.78

Abraham Lincoln 0.96 0.62 0.86

Hillary Clinton 0.94 0.72 0.88

Dalai Lama 0.96 0.80 0.90

Average Precision 0.94 0.68 0.84

Table 5.1 shows the results of the precision evaluation. We see that the precision

of Diogenes is better in every case and the average precision is very high compared to

other similar search engines. We have some of the snapshots of the retrievals from

Diogenes and Google.

We compare the results obtained by Diogenes with results obtained by Google,

one of most popular image search engine on the web. We check for the relevance of the

retrieved images by manual examination. Following snapshots show the test results for

person name queries for Diogenes and Google.

61

Test 1: Results for query: HILLARY CLINTON

Figure 5.1 Diogenes results for query: HILLARY CLINTON

Figure 5.2 Google results for query: HILLARY CLINTON

62

Analysis:

We see that most of the resulting images obtained for query “Hillary Clinton”

are in fact images of Hillary Clinton. At least top 16 images returned by Diogenes result

in 100% accuracy. While the images returned by Google contain images of Hillary

Clinton, the top 16 images contain at least 2 non-relevant, non-facial images. It is

evident that Diogenes provides better results in this case.

Test 2: Search results for query: ABRAHAM LINCOLN

Figure 5.3 Diogenes results for query: ABRAHAM LINCOLN

63

Figure 5.4 Google results for query: ABRAHAM LINCOLN

Analysis:

In the case of “Abraham Lincoln” as the query, Google could not get very high

accuracy, while Diogenes got 100% accuracy. Google retrieved images of USS Lincoln

which does not correspond to person name query Abraham Lincoln. Diogenes still out-

performed Google in terms of relevance.

We see that Diogenes retrieved more relevant results for person name queries, when

compared to Google, one of the most popular image search engine.

64

5.2 Performance analysis: Initial Architecture Following chart shows the performance of Diogenes with the initial single

threaded architecture. The initial architecture took nearly 30000 seconds (~8 hours) to

process 2000 URLs. It can be seen that nearly 70% of the total execution time is spent

in waiting for downloading the web pages.

Diogenes: Initial Architecture Performance

05000

100001500020000250003000035000

0100

300750

12502000

Num of URLs

Tim

e in

Sec

on

ds

Total Execution Time

Total Wait Time

Total CPU Time

65

5.3 Performance Analysis: Multi-Threaded Architecture on a single

Agent

After employing the multi-threaded architecture for a single agent, the wait time

was fully reduced to zero because of the parallel processing nature of threads. So as

desired, the communication latency is almost zero now. Also we can see that the total

CPU time was almost halved when compared to initial architecture of Diogenes. This

can be attributed to the fact that, the CPU was made use to its fullest potential when a

multi-threaded architecture was employed.

Diogenes - Impact of Multi-Threading on Execution Time

0

5000

10000

15000

20000

25000

30000

35000

0 50 100

200

500

1000

1250

1500

2000

Num of URLs

Tim

e in

Sec

on

ds

Search Time usingMulti Threading

Single ThreadedProcessing

66

5.4 Performance Analysis: Distributed Architecture with 4 Agents As the chart shows, the total time taken for processing different number of

queries is drastically reduced when the total work load is distributed among 4 agents.

We carried out this test for 4 agents. But the architecture can be easily scaled to higher

number of agents. We now can obtain a initial response time of few seconds which was

our main goal. The initial response time is the time taken by Diogenes to retrieve first

relevant image. This is in par with many large scale search engines employing

thousands of computers for query processing. Of course, in this case, there is a single

user, but it can be easily scaled to serve multiple users with similar response times.

Diogenes - Impact of Distributed Architecture

0

200

400

600

800

1000

1200

1400

1600

0 50 200

500

1000

1500

2000

Num of URLs

Tim

e in

Sec

on

ds

Time taken by 4Worker Agents

Time taken by 2Worker Agents

67

5.5 Conclusion

In this work, we have designed and developed a parallel & distributed

architecture for Diogenes image search agent. The Diogenes, meta-search agent is now

capable of handling multiple user queries and is now scalable. Also, the new

architecture makes good use of system resources. The new system drastically reduces

the search time for Diogenes, and the initial response time is almost in par with similar

image search engines.

The design of open distributed search architecture reduces the centralized

storage requirement. Also, its modular design enables different components to be hosted

by different machines. With the precision of initial architecture of Diogenes and the

speed of the new distributed architecture, Diogenes promises to be a popular image

search engine.

Some of the observed limitations of the new architecture are that, it has a single

point of failure. The “index server” is the single point of failure. If the index server

which also has a index database, fails, then the entire search operation comes to an halt.

So if the index server becomes the bottleneck, the performance of the Diogenes will get

affected. Also, if any of the “worker agents” fail, then all the URLs which were

assigned, but not processed by that worker agent will be lost. The relevant images that

might have existed in those URLs will be lost. Also, if the worker agent crashes, then

the thumbnail images stored on that machine will be unavailable to be viewed by the

client browser. That part of the distributed storage will be unavailable.

68

5.6 Future Work

The present Diogenes is designed to retrieve only the person images (facial

images). The face detection module employed in this work is able to detect only face

objects in the input image. Any other kind of object detection is not possible. New

object detection modules can be incorporated into Diogenes to detect other kind of

objects (cars, mountains, buildings etc...). This would greatly increase the usefulness of

Diogenes.

A new design which makes use of a “Distributed Index Server” will remove the

single point of failure, and also reduce the bottleneck on any one machine. The fault-

tolerance of Diogenes will improve vastly.

A replication scheme can designed to replicate the images available on different

worker agents, so that in case of a worker agent crash, all the images contained within

that agent are still available for retrieval.

69

REFERENCES

[1] http://searchengineshowdown.com [2] Y. Alp Aslandogan, Clement T. Yu, Diogenes: A Web Search Agent for Content

Based Indexing of Personal Images. [3] Y.Alp Aslandogan and Clement T. Yu. Multiple Evidence Combination in

Image Retrieval: Diogenes Searches for People on Web. In Proceedings of ACM SIGIR 2000, Athens, Greece, July 2000.

[4] Y. Alp Aslandogan, Clement T. Yu, "Evaluating Strategies and Systems for

Indexing Person Images on the Web." ACM Multimedia 2000 [5] http://searchenginewatch.com/links/ [6] Search Engine History [7] Web Crawler Review: http://dev.funnelback.com/crawler-review.html [8] http://www.tripadvisor.com/ [9] www.metacrawler.com [10] Content-Based Image Retrieval Systems: A Survey. Remco C. Veltkamp, Mirela

Tanase, Department of Computing Science, Utrecht University [11] Searching for Images and Videos on the World-Wide Web. John R. Smith and

Shih-Fu Chang [12] http://www.altavista.com/image/default [13] WebSeer: An Image Search Engine for the World Wide Web Michael J. Swain, Charles Frankel and Vassilis Athitsos [14] Eric Brill. Some advances in transformation based part of speech tagging. In

Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722-727, 1994

70

[15] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural Network Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23-38 Jan 1998

[16] The Anatomy of a Large-Scale Hyper textual Web Search Engine, Sergey Brin

and Lawrence Page Computer Science Department, Stanford University. [17] www.google.com [18] Multithreading with Distributed Functional Units, Bernard K. Gunther,

Member, IEEE [19] Scalable Distributed Architecture for Information Retrieval, Zhihong Lu,

University of Massachusetts, Amherst. [20] Rank Aggregation Methods for the Web Cynthia Dwork, Ravi Kumar, Moni

Naor, D. Sivakumar, Compaq Systems Research Center, Palo Alto, CA, USA. [21] Ordinal ranking methods for multi criterion decision making, Zachary F.

Lansdowne, Economic and Decision Analysis Center [22] Towards an Open and Highly Distributed Web Information Retrieval

Architecture, Torsten Suel_ Chandan Mathur Jo-WenWu Jiangong Zhang Alex Delis Mehdi Kharrazi Xiaohui Long Kulesh Shanmugasundaram, CIS Department, Polytechnic University Brooklyn, NY 11201

[23] Performance evaluation of a Distributed Architecture for Information Retrieval,

Brandon Cahoon, Kathyrn S. McKinley, Dept of CS, UMass, Amherst. [24] OASIS Distributed Search Engine,An Insuma GmbH White Paper

71

BIOGRAPHICAL INFORMATION

Ravishankar Mysore received his Bachelor of Engineering in Industrial

Engineering at University of Mysore, Mysore, India in 1999. After obtaining his

Bachelors degree, he worked for Tata Consultancy Services., India, as a Software

Engineer from October 1999 to July 2000. He then pursued his Master of Science in

Computer Science & Engineering at The University of Texas at Arlington. He received

his Master of Science degree in Computer Science & Engineering from The University

of Texas at Arlington in May 2003.

diogenes: a distributed search agentbelieved in me, provided moral support when ever i needed. my...

Documents