an intelligent metasearch engine world wide web€¦ · research. metasearch will refer to...

AN INTELLIGENT METASEARCH ENGINE FOR THE WORLD WIDE WEB

Andrew Agno

A thesis submitted in conformity with the requirements for the degree of Master of Science

Graduate Department of Cornputer Science University of Toronto

Copyright @ 2000 by Andrew Agno

National Library ($1 of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibfiographic Services services bibliographiques

395 Wellington Street 395. rue Wellington OttawaON K1AON4 Ottawa ON K1 A O N 4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Lïbrary of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in rnicrofonn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts ffom it Ni la thèse ni des extraits substantiels may be p ~ t e d or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

tract

An Intelligent Met asearch Engine for the n'orld n ï d e Ué b

.Anchen- Agno

Mas ter of Science

Graduate Department of Cornputer Science

Cniversi ty of Toronto

2000

Uachine learning and informat ion r e t r i e d techniques are appliecl t o met asearch on the

Vorld n'ide ?\éb as a means of providing user specific relennr documents in respome

to user queries. .A rnerasearch agent works in conjunction n-ith a user to provide daiiy

sers of relel-ant documents. Csers provide relennce feedback which is incorporarcd into

future resdts b. a choice of machine I~arning algorithms.

Csing a fisecl ranking niethoci. the algorithms incorporating relelance feetlback per-

forni rriuch bet ter than t hose t hat do not. Furthemore. using heterogeneoits information

sources on the Lorld Wide \\éb is shown ro be effective in short and long term usage.

Acknowledgement s

1 n-ould he much less proticl of m\- work if it tvere not for the help of a nurnber of people.

1 woiild like to firsr thank Grigoris Iiarakoulas and John 11~-lopotilos. my super~isors.

for their guidance and support of mj- work. nïthoiit theni. 1 woiild still be fishing for

a perfect ropic. Thank o u also for making me look ar quesrions insteacl of answrs. I

,.. ,,uuU - .. I I n L X e tu iZnriP iio DG ; L L L L C ~ gruup for r k i r quesrions ro my presenration oi

this thesis. which helped me focus on \\-kat questions other people woiild be inreresred

in. upon seeing my work.

Some of the n-ork in the implementation of my project used other peoples software. In

particular. 1 n-ould like to thank S teyen Brandt for the package com.stevesoft.pat. Doug

Lea. for util.concurrenr . and Brian Chambers for his pret-ioiis work in word sterrirning and

document vectorization.

Last. but certainly not least. 1 woiiltl like ro thank mu d e Jobie. for coming to

Toronto ancl staying wirh me thrse lasr two yars .

3 Architecture 24

3.1 O\-erall -1rchitect ure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Global Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2s

3 . 1 Zipf's Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 . 2 Stopword List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Stemniing 30

3.3 Scalahility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 L e m Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Topics 33

4 Experimental Results and Evaluation 35

4.1 Description of Data Gathering Procedure . . . . . . . . . . . . . . . . . . 35

4 . E d u a t i o n Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 TREC ~lcasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 . 3 . TheF3lIeasure . . . . . . . . . . . . . . . . . . . . . . . . 41

4 . 3 2 The T9L- Neasure . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.3 The T9P Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Precision of Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 5.L

4.4.1 Continuous Learning vs Train/Test . . . . . . . . . . . . . . . . . 60

4.5 Daily Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Spikes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6.1 Data Gathering Gaps . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6.2 Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Individual Search Engine Recall . . . . . . . . . . . . . . . . . . . . . . . 69

5 Conclusions and Future Directions 77

.. 5 . 1 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . r ,

5 . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2-1 Implicit Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 . 2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SO

5.2.3 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S 1

3 . 4 Alternate Document or Featiire Space . . . . . . . . . . . . . . . . S 1

5 . 2 . Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S?

5.2.6 Alternative Methods of Learning . . . . . . . . . . . . . . . . . . S.' C - a.?. 1 !discellaneou Iniprovements and Direct ions . . . . . . . . . . . . S9

Bibliograp hy

List of Figures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Architecture Diagam 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Daily Document Counts 36

. . . . . . . . . . . . . . . . . . . . . . . . 4 Precision for Plain Algorithm 39

. . . . . . . . . . . . . . . . . . . . . . . 4.3 Precision for Random aigorithm 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 4 F3 Measure 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 5 F3 Sleasiire . Top 5 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 F3 blessure. Top 10 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7 T9C Ueasure 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.S T9C Sleasure . Top 5 4s

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 T9C Ueasure . Top 10 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 T9P 'lleasure 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 T9P Skasure 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -4.12 T9P 'Ikasrire 53

- - . . . . . . . . . . . . . . . . . . . 4.13 Running Average of Precision AU Topics XI

. . . . . 4.14 Precision Running Average . Various Topics . Rocchio tk Grigoris 56

. . . . . . . . . . . . . . . 4.15 Running .A \.erage Precision . Student . Grigoris 5s

4.16 Running Average Precision . .AU Topics . Continuous Training v s Train / Test 5s

-4.17 Running avg precision . nrious topics . continuous training vs Train/Test 59

. . . . . . . . . . . . . . . . . . . . . . 4-18 Dail- recall for Rocchio algorithm 61

. . . . . . . . . . . . . . . . . . . . . . 4.19 Daily r e c d for Grigoris algonthm 62

vii

4.20 Daily precision . -411 ropics . Top 10 . . . . . . . . . . . . . . . . . . . . . . 66

4.21 Dail>- precision . I'arious topics . Top 10 . . . . . . . . . . . . . . . . . . . 67

4.22 Daily precision on srudenrs topic . Top 30 . . . . . . . . . . . . . . . . . . (jS

4.23 Search Engine R e d . -411 Topics . Riuining Average . . . . . . . . . . . . 70

4.24 Search Engine R e c d . Al1 Topics . Running -Iverage . . . . . . . . . . . . 71

4.23 Runriing Average of Recall of làhoo on Palni Pilot . . . . . . . . . . . . 72

. . . . . . . . . . . . . . 4.26 Riinning .l verage of Recall of Lycos on Student 73

. . . . . . . . . . . . . . . . . . . . . . . 4.27 Running average recall lIS/DO*I 74

Chapter 1

Introduction

Problem and Motivation

Findilig informat ion on the World Wide Web (\\'WU') c a n be difficult withottt sonir form

of assistance. As estimated by Lawrence and Giles [LG99] in 1999. there rvere SOO million

pages. an increaoe of 250% from their previous estimate in their 1998 study [LC;9Sc].

C-illance [ M M O O ] claims there are 2.1 billion unique and ptihlicly amilable pages on

the Internet. Given the size and the gowth of the W W Y . one can see that we neeci tools

to help us find information. One would typicdy turn to a search engine. like Yahoo![Ya.h]

or Coogle [Goo] . Cnfortunately. the mos t frequently used search engines [StaOO. SulOOa]

do not always do an adequate job. due to their Iack of coverage [LG99. LGSSc] and their

lack of ability t-O find the relevant documents in those that are covered.

One potential ierned- is to enable searches with more -intelligence". Given a search -

engine. ir mai- be imbued with -intelligencen in at least tn-O ways: through the use of

specialized. larger. or sirnply different informat ion sources: or t hrough the implement at ion

of various machine learning or information filtering algorithms. The purpose of this

R-ork is to create an intelligent search engine by combining both of these approaches

into a single search engine. d ra~ing from machine learning and information retriemi

techniques. The remainder of this chapter i d 1 deal with an 01-er~ien- of a particiilar

technique for searching t hrough her erogeneoiis infornia t ion sources. called met asearch.

The chapter also inclutles an esplanation of various information retrieval and machine

learning techniques rhar have been used in other work. the contributions of this work as

iveU as a layout of the remainder of this thesis.

1.2 Metasearch on the World Wide Web

One commonly tised method for adding intelligence. even m o n g search engines not

norrnally thought of as eiichjSulOOc]. is to use metasearch. For the purposes of this

research. metasearch will refer to nietasearch on the 1V1171'.

1.2.1 A simple anatomy of a search engine

In the following disciission. it helps to have nn idea of hou- search engines typically ivork.

The portion of a search engine that the user sees is only one aspect of the entire system

that makes up an engine. .U the results given by a search engine corne from some

forrn of a database underlying the engine. which may list documents and information

about those documents. This database is populated by certain software agents that

visit WWV' servers and index the documents that those semers contain. In the Hantest

system[BDW95]. these agents are c d e d Gatherers. whereas in Google[BPSSb]. they are

called c~awlers. The purpose of these crawlers is to d o w the collection of information

about documents. also known as the indexiag of documents. to proceed independently

of an? search that may be using the information. These cran-lers must revisit documents

periodically. to reindes them. This is because documents on the UTVW have a tendency

ro change or even disappear over time. Reindesing must happen at the same time that

new documents are being indesed. When a user gives a que- to a search engine. the

engine uses its database of rankings and documents to generate a nen- document that

consists of a list of pointers. knorvn as Pniversal Resource Locators ( CRLs). ro dociiments

that haïe been r a d e d as most likely ro fuMl the user's information need. The esact

r ad ing depends on the searck engine. and the ranking çchenie is typically proprietary.

Horvever. search engines developed in acadernia do publish their ranking methods. For

instance. Google uses something called a PageRank. which is defined by Brin and Page

[BPSSb]. This description of a search engine ma? also be applied to metasearch engines.

wi t h çorne changes.

1.2.2 Metasearch and how it helps

The idea behind rnetasearch is to use multiple -helper- search engines ro do the search.

then to combine the results from these engines. Engines rhat use metasearch inclucle

!detacrawler. SavvySearch. 11 SS Search and .Ut alvista. among ot hers [SulOOc] . These

helper engines form the merasearch engine's database. This approach differs from the

individual search engines in that a metaseach engine does not need to crawl the U7117V.

although it may do so. A metasearch engine for the W V 1 Y may just verify that the doc-

uments returned by the search engines stiil elùst. The problem of combining the results

can be solved sirnply or by using a little cleverness. The sirnpli&ic solution has been

used in search engines from !detacrawler to Dogpile. Sletacrawler has changed in recent

years and no longer appears to employ this simple rnethod. but Dogpile continues to use

it. The solution is to put results from separate engines under separate headings. This

solution c a n still be seen in some engines t hat empioy both a rnanually-maint ained direc-

tory (in the style of Yahoo!) and a more traditional cran-ler-generated index of lFT\W

documents (as in Inktomi). It is also seen in some of the less well h o n n metasearch

engines. such as SherlockHound [She] and . U o n e [.Ill]. This approach has a number of

problems:

0 The individual search engines are treated equdy. despite the fact that their results

might not be e q u d y relevant to the search at hand. This implies two things.

- The only indication of relative importance of the resiilts is within eacli search

engine. not across search engines. This means rliat the user cannot jiidge

which documents will be rnost relel-ant of al1 the results returned.

- The number of results is typically restricteci. wirh each jearch engine given a

cpoca of documents that ir can fill. This quota is the sanie for each search

documents that are more relevant than the results from searcli engine B. then

the rn documents from search engine A that are better t han the ones froni B

will nor be shorvn to the user. Insteacl. al1 the results froni search engine B

If a user searches ouce. then attemprs the same search again. the results do not

change. even rhough the user ma? have giveri some implicit information aboiir

documents that might be relevant. The user ma- have selected some documents

to be viewed. and by doing so. ma1 have ;ken some information about potenrial

relevance.

The other. more sophisticated. solution is to rerank the documents that the helper

search engines returned. either bp downloading the documents and analyzing them. or

using some existing preprocessed version of the document. such as the surnmary that

occasiondy accompanies the results from search engines. This approach is used in

San-ySearch [HDgÏ. DH96. Sav]. in search engines from YEC research institute [LGgSa.

LG9Sb. GLG+99]: and presumably in the new version of Metacrawler. This deviates - the problem of treating the search engines equally. but the other problem remains.

-4s can be seen from the listing of alliances at Search Engine IVatch [SulOOc]. meta-

search. even in its simple fonn. is a popular method of searching. The reasons for this

include the user's desire to t w multiple search engines. the user's la& of laon-ledge about

esisting search engines [GLGf 991. and the user's unniKngness to visit multiple search

engines. One orher reason to use metasearch is that while most 11111V users i d 1 tr>-

another search engine when rhey cannot find n-hat the>- are looking for. 20% of theni give

up [Sulood. SiilOOb]. Csing metasearch allow the user to view resiilrs from man!- search

engines simultaneousl~ which ma? allow at least one search engine to give back rele\-ant

results before the user gives up.

Fortunatel' for tliose engines thnt do employ some form of metasearch. there is eV-

itlence for its efficacy The reason that metasearch can be effective is rhat there are

problerns wi t h plain search:

1. Indi\-idual search engines have poor indesing anci reindesing t imes [LGW]. Rcin-

desing tirne refers to the time betiveen successive visits to a documerit h>- a search

engine's crawler. This is a problern. because if a search cngine's crawler does not

reindes a frequently changing document. it may nor be able to order CRLs ac-

cording to their actual data. Poor reindesing rimes also lead to the p redence of

dead links [LGW] in the results giwn by a search engine. These are pointers CO

documents that no longer esisr. are no longer published. or are no longer accessible.

2. Ranking schemes ma! be poor or inconsistent (Iiee92. LGSSa]. In a talk by Chris-

tian Collberg [ColOO]. given at the Cniversity of Toronto. he claimed that one of the

methods thar .Altavista used to r a d documents was by age of the document. This

xas meant to reduce the propensity of artificially relevant documents (also known

as -spamn). This leads to old documents being listed first. which is not usually

the best heuristic for ranking relemnt documents. The general problem of poor

ranking schemes is esacerbated by users' tendencies to use short. one or tn-O word

queries [SulOOb. LGSSa]. .A Lape query combined ni th a poor ranking scherne by

a search engine can lead to fmstration. In fact. the ranking scheme need not be

poor overd . just poor on the topic that the user has in mind.

3. Indi\-idital search engines ma! have poor coverage due to specialization [DHS6.

HD97. BBCSS. LebS;] or may jusr not have the resources to have wide CO\-erage

of the entire iYJVFV. This means that no one search ensine hw a full indes of

the WWK. Also. since the Kn'W is growing esponentially [LGSS]. so do the

cornput ing and storage needs of search engines thar maintain t heir own index.

1. Rc!n:cc! :o t h pie;-i~iis pviuî. ïhc düiuiiienî craii.irr. rLar are 1ist.d for inkronii-

style search engines ma? not he able to search the enrire \YU-n' because of the

connectedness properries-as many as half the documents are nor reachable from

the -tore" web [But 001. which is presiimably where mosr crarders spencl t heir t ime.

Sletasearch can alleviate some of these problems:

1. Csing multiple search engines should give ooerlaps in terms of the index times.

translating into a shorter mean time between reindesing [LGSS].

2. .\ metasearch engine can use its own ranking scheme. independent of the rankings

of the individual search engines [DHSG. HD97. LGSSa]. It can also use ranking

schemes that analyze the individual pages. just as individual search engine crawlers

do. The met asearch engine c m do this faster. since it has a smaller set of documents

to analyze. Combined with document caching [BCF+9S. ZFJSi']. this can result in a

fast metasearch using a custom. document based ranliing that analyzes documents

to a geater degree than the individual search engines used. Xote that a problern

may still esist since a metasearch engine must do the anaipis in real time. instead

of ofIline. as the crawlers do. Lawrence and Giles [LGgSa] do show that it is possible

to do this analysis in real time. without creating long tvaiting times for the user.

3. Coverage of a metasearch engine is greater because it uses multiple sources of in-

formation [LGSSc. LG991 and a metasearch engine can use speciaity searcli engines

without having them adversel- affect the results due to fen- results. or due to a

slow network connectiou to the specialty engine. If the nitniber of documents on

the IiTk7V continues to grow esponentially. then col-erage niay eventually be a

problem even when using ntimerous search engines.

4. The last item cannot be fised hy using metasearch engines. Instead. individual

search engines need to do random searching among IP addresses and ports. IYliile

the metasearch engine coitld do this itself. ir removes one of its advanrages from

the perspective of resource usage.

Recent studics hy Lawrence and Giles [LGSS. LGSSc] g i ~ e s more creclence to the p e

tential of metasearch. The papers show that individual search engine corerage is poor.

covering no more t han 16% of the n'Il?\' and as lit t le as 2.2%. However. r he o ~ ~ r l a p

between search engines is also low. with a range of [-l.-Ll%. X.j%] in 199s. to a range of

[2.2%.3S.3%1 in 1999. This bodes well for metasearch. despitr the facr that the combined

coverage appears to be decreasing over time. An estimate of the combinecl coterage in

1999. iising 11 search engines. was only approsimately 40% of the estimated size of the

iWW being covered. In 199s. with only 6 search engines. the estimated coverage \vas

59%, . Giwn tkis. met asearch ought to be effective. as long as the individual engines have

access to rele~ant documents and can return them to the metasearch engine.

lletasearch is not mithout problerns. however. For instance. metasearch can create

increased network traffic. both on the global Interner and on the local netrvork. especidy

for engines that perform their onn ranking by dotvnloading the actual documents t hat

are found ria the individual search engines (DH96. HDS;]. The structure of the W?W

[But001 is also a problem. because not all documents be indexed by the engines

used by a metasearch engine. The h s t problem can be alleviated by using a network

bandwidth sensitive ranking scherne [DH96. KD97..\ILY+99j. The second problern needs

to be solved by having the WTVW crawlers probe random IP addresses at port SO (HTTP)

and possibly other common or uncommon ports.

1.3 Intelligent Agents

Besides metasearch. a search engine can also turn to other algorithms for inspiration.

The other source of intelligence cari be obtained by looking at intelligent agents. These

are software programs that uncierrake tasks on behalf of a user or iisers. ;\gents ha\-e

been used to aid in sorting email [IIla99. BooSS]. and for searching and -sitriing* the

KG-U- ilIia99. CS%. PB9;. P89'J. J a n K BLGos]. The latter. the Uéb asenrs. can bc

categorizeri into ones rhat perfonn automaticall~ nithout user assistance. and those tbat

arc designed to work in conjunction rvirh the user in order to enkmce the iiser. Those that

work wirho~it feedhack from the iiser. such as CiteSeer [BLGSS] and WbSIate [CSSS] .

among others [Jang;!. can be usecl as a search agent. where the program ma- learn user

habits and preferences rnerely through obsernrion. and rhus alter search queries. or offer

dociiments. .Uternatively. as in CiteSeer. the progréun ma! st il1 rely on user geiierated

queries. but the user is assumed to want a specific type of information nnd the agent lises

a specially huilt clarabase ro ansrver queries.

The other agents. those rhat rvork to enhance a user. or otherwise reqiiire user inter-

vention or feedback. are of particular interest to this thesis. There have been a nurnber

of these systems made for viewing mebsites [PB99. PB9;. SH99. Pol99. BSY95. BS951.

The- typically require that the user give some feedback about the iVi\X' documents

that have been recommended. The agents adjust their future responses to take this new

information into account. This data is stored in a profde of the user. This profile typically

consists of a set of features that have some weight associated with them. Features include

some form of a List of keyn*ords[llla99]. although other possibilities esist. For esample.

agents mzq- form part of a collaborat ive system. ahere a profile may simply consist of a

List of documents that the user has ranked and those ranliings. As Glover et al !GLG'99]

suggest. other features of a document map also be used to construct a profile. This could

indude preferences for length of documents. reading level. languages. age. images and

URL:text ratio. The esact features used in the profile can be determined esperimen-

tally. JI-hatever the exact nature of the profle. it is used to give iniproved resitlts to the

user. Hon-e~er. none of the aforementioned agents has applied this learning from user

feedhack to the problem of nietasearch. The individual systems rnentioned above l-ar?- iu

t heir learning algori t hms. term weightings. feat ure space and int encted use. For instance.

Pazzani-s Syskill and ilébert [PB971 uses a Layesian classifier ro predict the probabiliry

of a user liking a document. whereas Balabanovic [BSYSJ. BS9j.I and Somlo [SHSQ] ilse

similariry nieasures wirh respect ro a profile. One interesting \ariarion on searching the

Kn-n- is Pazzani's web sire agent [PB99]. There. the agent learns only about people

visiting a site. as well as the patterns of site navigation of al1 users ancl the patterns

of the hypertest linkage structure to sorne degree. The linkage structure is merely the

way in tvhich documents refer to each other through hyperlinks. The narrow scope of

that agent's responsibilities is nor the purpose of this work. However. it cloes provicle

an interesting contrast in terms of usage. Pol&xa (Po1991 denionstrates a collaborative

recommender system in which user profiles are sirnply the relevane' rankings of various

documents. These profiles are compared between users. and for any given user. the sys-

tem recommends clocuments that a sirnilar user found relevant. Sone of the agents give

the user access to a large number of information sources from which to gather relevant

documents. using only a single search engine. or a single W V W site in each case.

1.3.1 Relevance Feedback

Relevame feedback is a term used in information retrietal to describe a special type of

feedback loop. This feedback hop requires that the user of a system make judgements

about the rele~ance of documents retumed by the system. The +stem. in this case. is

one that is designed to return documents to a user based on queries that the user gives

to the system. The feedback obtained in this manner is used to determine the profile

used in the nest iteration of document retrievai. The exact nature of this feedback varies

with different systems [RocTl. BBC9S. CS9S. BSY95. BS95. DH96. SB9O. BSA9.I. Joa97I

and may even be implicit [BBCSS. I<J1JIC97. DH9CI. Esplicit feedback ma' be a simple

boolean. representing relelant and irreleiant documents. or ma' be a range of ialiies

hqond the boolean O and 1. Ranges of feeclback ialiies mal- present the user with a

psychological or perhaps a user inrerface difficult>-. Specifically. if sonie document. D.

appears multiple rimes. it may be given rankings that actually depend on the relelance

lalue of the dociunenrs thar iverc listed alongside D at the time of ranking [BSY95].

If this is the case. the range of lalues ma' not give an>- more information t h r i the

boolean ranking system. and may in fact haniper learning due ro the inconsistency of

user rankings.

V ï r h implicit feedback. feedback is assigned by the sysrern basecl on whether a docu-

ment was viewed b>- the user and possibly baseci on how long a user viewed a tfocunient.

The ad\xntage of this implicit feedback is that. from the point of view of the user. the

interface is transparent. The system c m leam a user's preference wichout the user ha\-ing

to intervene. The problem with using impiicit feedback is that the feedback is somewhat

more difficult to interpret. as a visited document is not necessarily a relenrtr one. This

causes potential inconsistencies in the rankings. which is n-hy ir is not irnplemented in

the system described here.

One added complesity that may be introduced with relelance feedback is incremental

feedback. In most of the previous work. feedback processing mas done in a batch fashion.

That is. the entire corpus. and their relekance ranlüngs. were known before testing began.

and the entire corpus (or some training subset) could be given to the feedback algorithm-

the algorithm could have complete knoivledge and could process the feedback in one pass.

This is the context in nhich the original Rocchio algorithm for relelance feedback was

created [SalÏl] . Incremental feedback. on the other hand. requires only that a portion

of the corpus be judged before applping changes to the profile. Additional portions ma?

be judged and the profile changed as time progresses. with Little or no knowledge of

previous r d n g s . Fortunatel'. Rocchio and Ide [Ideil] style algorithms for releiance

feedback ( herein termed -traditional relevance feedback algorit hms- ) ma! be applied

in a incremental fasliion i.11196. Cal%. IJA92]. In fact. Allan found that -keeping a

s m d niunber of terms can acttially iniprow performance over full feedback ... alniosr any

nurnber of terms works 11-ell." Full feedback refers to feedback in which. at an? tinie t .

rhe system is @en d l ranked documents seen until time t . Harrnan [HarS?] suggests the

use of 20 terms in a full feedback em-iroument. Xian shows that traditional relel-ançe

feedback style algorithnis work n-el1 as long as some contest. possibly a d~.namically

changing contest. is maint ained froni previous iterations. This provides a niethod to

pcrforni online learning. insteacl of batch leaniing. This result is uecessary in orcler ro

did date the use of relevame feedback as appliecl to metasearch.

Query Expansion

Typically. reletnnce feedback results in an altered query. This qiiery is identical to the

profile mentioned above (earlier in section 1.3). The purpose of the espancled qiiery is to

give a more precise quer!: based on previous user rankings. for the nest ireration. This

qiiery nould also be used to order the pages in terms of relevance. for the user to view.

This approach t~orks well in certain experirnents. However. in the contest of metasearch.

t his approach does not work at d. For instance. some search engines Limit queries to only

10 terms. and some seem to ignore long queries. Those that do accept long queries do not

return man? results. Espanded queries typically used stemmed versions of words. These

are words that have had their s&~es stripped from them. Any stemrning in the query

assumes that the search engines mil1 understand the stemmed term or tenns. mhich is not

necessarily crue. In fact. search engines like Google do not do stemming. and others d o w

stemming on'\. as an ad~anced option: even with stemming as an option. however. search

engines cannot be relied upon to understand stemmed terms given as part of the q u e .

Emn if the stemmed terms were e-rpanded into the words that were originallj- seen to

create the stemmed terms. the queries would be increased in length. concributhg to the

pre~~ious problem. This is different from mosr of the literature on searching a corpus of

documents. but its omission here is support ecl by r he profile generating inreiligent agents

(earl?. in section 1.3).

1.3.2 Browsing and Searching

.-\ user surfs the IYJVJY. There are generally two different rnethocls of ~ising a search

engine to do this sufing: bron-sing and searching. The difference berwen the t~vo is

identified by the user's interest. -4 user who wants detailed information on a specific

topic is searcliing. whereas a user who wants an over~ien- of a topic. or even multiple

topics. is hrowsing. One may also distinguish the two by the user's conceptual mode1 of

a topic. If the user is looking for information on. for esample. Palm Pilots' . rhen this

is probahly browsing. .A user interested in a specific subtopic. such as free p r o d u c t i ~ i t ~

software for the paim pilot. is probably doing a search.

.iccording to Search Engine l la tch [Su100d]. 70% of people know specifically what

the' are looking for when the! use a search engine. Hoivever. this does not mean that

they can articulate this knowledge in the form of an appropriate query. Csers typically

have a specific meaning in mind nhen using a term in a que- This meaning is not

necessarily the same one that search engines assign the word. For instance. the term

-pairnw might be found in a document on handheld computer organizers as well as a

document on vacations. Furthermore. Butler [But001 says that most users only have a

general queq- in mind. These two viervs of a user that k n o w specificdy what they are

looking for and one nho does not are not easily reconcilable. The evidence suggests.

however. that most users only enter a general que- This is supported by the fact that

30% of searches are done using o d y a single word in the query [SulOOb]. according to

Search Engine Match. In another study done by Lan-rence and Giles [LGSSa]. the- found

'=\II product narnes and cornpan' narnes mentioned in this document are the trademarks of their respective holders

thar almosr half of user cperies containecl one term. and almost 80% of queries w r e one

or two terms. WhiIe the query entered into search engines mal he general. this author is

inclined to believe that rhe iiser has a fairlj- clear idea of what the!- are looking for. The'

ma>- not always he able to express rhis in the form of a qiiery. but the? can certainly

identify documents that are and are not relevant. .An!- other view of a user means that

the!- are entering randorn queries to the search engine.

The work in this thesis assumes that the mer does have a fairly good idea what she

is iooking for. ancl are able to identify these documents. This work dso assumes that the

iiser ri.iIl only enter a generai query. From there. the system will leu11 a more specific

query that corresponds ro the user's internalized and unasked qiier?: This corresponds to

the specific search aspect of surfing the \VlVR*. It should be possible to use the sysrem

describecl here to perform the browsing aspect of suhng. but this rws not one of the

items investigated.

Our Approach

The systrm presented in this thesis is meant to address some of the shortcomings of other

WRlY search engines. and in particular. other metasearch engines for the W k V W .

0 We will use the more sophisticated of the metasearch techniques by using a unified

ranking scheme across al1 search engines.

0 tVe nill allow the search engine to adapt to a user's preferences over time.

0 The feedback from users will be elduated on a daily bais. in an incremental

fashion. Thus. there tvill be no strict tuning period to generate a correct profile. as

in Somlo's engine [SH99].

0 The system ma? be used in either a server based mode. or ma!- be used as a single

client on the user-s machine.

a The search engine will generate a specific profile from a user formulated general

query.

0 rser profiles will be generated which represent a prototypical docunlenr. The user

profile will he used to determine similarity with documents that are retrim-ed h-

the search engine.

Usage Scenarios

The scenario in which the search engine described in this thesis was designed to be nstd

is the following. A user has a query that they wish to make over a period of days or even

weeks. The user may be looking for information on a topic she knows little about. or she

may be looking for a certain type of infornlation. but does not know hotv to specify the

information as a search query. As an esample of the former. suppose the user is looking

for student Life at the Cniversity of Toronto. She is not sure what this might entail. but

is certain of what i t does not. and hopes that the search engine will help to determine

the scope of the topic. The other type of search is where a user knows what to look for.

but enters only a \xgue query. For example. a user might be interested in looking for free

productivity software for the Patm PilotT". and might do so for three to five days: t h s

user might onlj- enter -palm pilotm as a query. As another example. a user might want

to retrieve information about the most recent Microsoft antitrust trial in the USA. from

its besinning to the current events. In this case. the user might enter a query such as

-microsoft doj.- In both these queries. the topic is general enough to include documents

pertaining to other. nondesirable events or topics. The metasearch engine should be able

to detect user preferences and d~namically alter the manner in which it orders pages for

user viewing.

1.3.4 Questions

There are several questions t hat resring of r he engine will answer:

a Can learning a user profile increase the relelance of tlie results as returned by tlie

met asearch engine?

a Are certain search engines hrtter to use than others on certain topics'.'

O 1s metasearch more effective than ordinary search. in the contesr of relevancc feetl-

hack'.'

1.3.5 Contribution

The work presented here dl provide evidence for the effiçacy of an adaptiw learuer for

the problem of metaseuch on the \1717\'. It ni11 provide evidence that search engines can

hr usecl for more than merel- one rime searches. but can also be used for the piirposes

of tracking a topic over tirne. né will confirm that metasearch is actually effective.

and show that even cornbining search engines that use similar techniques is worthwhile.

Furthemore. some reasons that metasearch is effective in the temporal contest of this

work d l be shown.

1.4 The Rest of the Story

The rernainder of this thesis will discuss the user's interaction with the user. delving into

representations of the user's ideas and methods by nhich one can alter these representa- - tions. Chapter 3 will esamine the architecture and implementation details of the engine.

including remarks about the data structures used and the scalability of the system as

a shole. Certain shortcomings of the system d l also be shown here. as well as some

possible ways to overcome them. The results of a tarie-- of esperiments dl be shown in

4. in an attempt to ans=er the questions posed in section 1.3.4. Chapter 4 also presents

some analysis of the data. Chapter 5 sumarizes the findiugs. answering the quesrions

$\-en ahove. anstvering possible esceptions ro the work clone. and cornnielits on esrenciin;

this work.

Chapter 2

User Interaction and Query

Processing

This chapter describes the user's interaction with the systeni. and the results of those

interactions. It also describes representations of the user that the system maintaim. and

the met hods t hroiigh which t hese represent at ions are maint ained.

2.1 Profile and Document Representation

-4s in the other systerns rnentioned earlier in section 1.3. the system uses a profile to

keep track of user preferences. In this thesis. a profile can be viewed as a prototype

document. or perhaps a union of prototype documents. A document is represented by

a set of weighted words and phrases (collectively c d e d terms) coming from a document

space [SMS3]. This document space is a muitidimensiond space. having one avis per

n-ord or phrase chat is accepted by the system. This space mal- either be knonn be-

forehand. or may gron- over rime. as new t e m s are encountered. Terms that are in the

document space at time t are called the dictionary at time t-this diction- is static if

the document space is already knoftm. Given a document space. D. consisting of terms.

D = ( t l . t2. . . . . tlnl). a document is a vector. I'. in this space n-ith non-negatke neights

II- = ( t r i . ~ 2 . . . . . "'pi): wt > O: 1- = 11'-'P. It is iiseful to have the sparse representation

of the vector. Thar is. take Il-' = ( t r ; . w ; . . . . . cc: 1. u., > O n-here 11 is the numher of

positive weights. D' = ( t ; . t:. . . . . t ;) . the terms n-itk posiri\-e weighrs and 1 -' = II'' - P'.

For a document. d. 1 ;i indicates the presence or absence of terms in the docunient. and

m a - indicate the significance of those ternis in the document. The marner in nhick

weights are assignecl cau \ary between implenientarions of an' sysrem employin:, such

a representation [\Ila99. HC931. and may range from a simple boolean only. to a real

number reprcsenting the weight of the term. The esact method iised in this thesis is

described in C hapt er 3.

2.2 The User Query

The query sent to the individual search engines is al\\-ays the original que- that the

user yives. This is partly due to the fact that queru espansion does not work in the

metasearch contest. as eqdained in section 1.3.1. hlso. given the assurnption of a generd

query. this alIo~vs the individual search engines to retrieve man- documents. Having many

documents for the search engines to retrieve is important because of the repetitive nature

of the query-the user wiU input the same query over a number of days. but only wants

to see nem or changed docriments. h large pool of documents ensures that there will be

many nen documents. even if only a fen of them change over time. Xso. the purpose of

this metasearch engine is not to learn the optimal ranking for a finite set of documents

for a specific user. Rather. i t should continua& adjust to the user's preferences. and

learn the best ranking for future documents that have not been seen. This can o d y be

tested if the rnetasearch engine can examine o d y a portion of a large pool of documents

at one t ime.

2.2.1 User Query and Ontologies

The user's query is espected to fit into some onrology. This i?; siniilar ro Glol-er et

al": -information needs- [GLGi 991. and DiialX-KI'S -caregories" [XIHT99j. The user

rnanitally selects an approprinre topic in u-hich to place the qtiery Tliese topics conir

from some esis t ing ontology Current ly. t his onto1o;y esists only for r hose qiieries rhat

bas^ the categorization of qtieries. The use of this outology also allon-s rhe sysreni to

jeneralize or specialize to a new profile. giwn the old onrs. Sonie adclicional tliotiglir

would be required to determine esactly how r his ~ o t i l d hr done. as fiirttier detailecl in

Ckaprer 5. whiïli tlescrihes possible future work.

2.3 Ranking

Two clifferent entities rank clocuments: the user. ancl the nietasearch engine. The meta-

search engine ranks documents it receives from the indiridual search engines. It does

this so that i r ma? present the user with a list of rnnked document CRLs. The ranking

is done by making cornparisons of documents to the current profile. Since the profile is

merely a vector in the document space. then the profile. P. and the document. D. may

be compared by rneasuring the cosine of the angle between the trvo rectors:

If the document vectors are normalized. relative document rankings for a specific profile

are preserved rvi th:

Other rransformations may be made prior to the nomalization. For instance. in the

engine described here. aIl vector n-eights are first made to sum to one by dividing each

vector by the sum of the weights. Thus. the actual similarity nleasure used is:

where P: and Dr are the weights for the term Pt and D, respecti~ely.

The highes t ranked documents are then shown to t he user as a lis t of L-RLs-the esac t

numher returned depends on the algorirhrn usecl (see section 2.4). Oncr this is clone. the

user ma' give rnnkings. The user looks at the documents given by the metaseasch engine

and ranks them as either relevant or nouelevant. The user's ranking is based on a set of

criteria which depends on the topic. One important global rule rvas that documents that

had heen seen before must have changed in a relelïmt manner in order for the document

to be marked relevant again. This criteria is important. because otherwise. a fised set

of documents would always be returned to the user. These documents would be those

that were accessed from a database. or othemise dynamically generated. documents

that had their date changed on a daily basis but no other changes. and documents on

servers that gaye incorrect dates for the document's datest amp. This is because the page

fetchers. as described in Chapter 3. fetch those documents whose datestamp or checksum

has changed. if the document has been seen before. The exact number of documents

ranked \aries according to the learner that is in use and the threshold the leamer uses

for determinhg tt-hen to stop giving documents to the user. This number is a maximum

of 30 documents.

2.4 Profile Adjustment

2 A. 1 Relevance Ranking

The ira>- in ~vhich profiles are aitered depends on the systeni in use. biir are generally

lariarions on Rocchio's algorithm [RocTl. SB90j: the profile at timr f + 1. cnn he

ohtained rhrough a function. f : Qtc i = f ( Q t . Rt .St) . il-here R, and St are the sers of

reletanr and nolirele\ant documents. respectivcl!-. The ac tiial algori thm usecl iu Rocchio's

original forniulat ion [Roc711 rvas:

Generally. variarions of Rocchio's algorithm are va.riatious on this formula. ttsing ditferent

weights for Q,. anci the rrro sums. For instance. .-\llan [;\1196] uses 2 ancl instead of the

inverse cardinalities of the sets as the rveigkts for the two sums:

halbersberg [IJ.l92] cites Salton and Buckley [SB901 and Ide [Ide711 for the general forms:

where o. 3.7 are variables.

For each separate que- that the user enters. a profile is created. Initia&-. the t e m s

in the profie are just the terms in the user que?. The weights in the terms of the profile

alnays sum to one. The profile is limited to a maximum of 20 single word terms and 4

two word t e m s (ie: phrases). Documents are represented in a similar fashion. but are

limited to a masimum of JO single mord terms and 10 phrases. Depending on the leamer.

the profle gets updated in different -S. TNO different lemers were used:

Rocchio variant Here. the profile at time t i l. Pt+ ,. ma>- be obtainecl from Pt and the

sets of relex-ant and norirele\-ant documents as follows:

Pt+i = Pt + I S ' I C R t - IR ' ICS ' R : sr

~ ~ h e r e R'. St are. respectively. the set of relelant and nonrelelxnt dociinients as

judged b>- the user ar cime t . R: is the \*SM for relelant document nuniber i.

Similarly. S: is the \'S1\I for nonreievam cloctunent niimber i. This is siniilar to

Rocchio's original variant [Roc;l]. modifiecl in the weiglit assiguecl to the previoiis

query and in the facr thar the algorir kni is appliecl in an incremental fashion. \\-ben

iising tLis algorithni. tip to 30 clocumenrs are returnecl to the user for relelanc-y

rankirig. Docunienrs u-hose systeni rariking is less than or equal to O are nor shown

to the user.

The algorithm is further moclified as suggesrecl in Rocchio's original formulation

[Roc;l]. by accepting a term for inclusion in Pt,, if and only if its weight is grearer

than 0. and the term \vas ei t her in Pt. or waç in more relennt vectors t han non-

relevant vcctors. Both these measures have the same effect : the- elirninate terms

thar have little discriminatory pomer. The former technique also eliminates those

terms that are only able to identify irrele~ant documents.

Ide variant This merhod is a nriation diie to Iiarakoulas [I<F9S] and is a generalization

of other variations [.-\U96. BS.494. IdeTl]. Here. the new profile Pt+i is obtained as

folio\\-s :

pt+, = Û P , + J C R ~ - ? C S ~ R : s';

where Rt . Sr are as before and a. 3. -, . d are predetermined constants. and n-here

.? = t3 for dl t e m s u; E Rt such that u, is not in the guerl- at time t. The

constants are determined through esperimentation. In another variation. a. 3 . 7 . d

may actually be variables that are determined as time progresses. Howet-er. for

the purposes of these esperimenrs. the! are constant. -AL: before. when usilig this

algorithm. up to 30 documents niiist be rankecl b?- the user. using rlie same criteria

as in the Rocchio style algorithni. Only positive. non-zero wijhrecl ternis are

accepted in Pt,I. This nierhod will be referred to b?- the name %rigoris".

Chapter 3

Architecture

This chaprer esplains the architecriire of the systeni iniplementrcl. coinnient o; ou r htl

scalability of the 5'-stem. esplains some algorith~ris tisrd. and lisrs the topics iised for the

nest chaprer.

3.1 Overall Architecture

Adaptive information filtering and metasearch are both asynchronous tasks. The archi-

tecture refiects this. Figure 3.1 shotvs the architecture of the metasearch engine. .An

esplanation of the symbols used folloivs.

Circle An object in the system. operating in the same machine as al1 other circles: over-

lapping circles represent multiple objects working concurrently in separate t hreads

of execut ion.

Square A n object in the system. operating on a specific machine. Overlapping squares

mean that sewrd object instances are working concurrently. possibly on different

machines. If on the same machine. instances work in different threads in the same

process.

Rectangle Rectangles t hat are uot also squares represenr queues or user interfaces.

Rectangles wirh a horizontal orient arion are qiieties. i~hi le a vertical orieutarioti

iutlicater the iiser interface.

Oval The 01x1 representr rhe niain client-rhis is the clictir rhat esisti on the sanie nia-

chine as the user interface.

.As can he seen froni the tliagrani. one client cornrnunicareï with niiiltiple processes ou

niultipl~ machilies in ortler to collect clocunienta for iiser rankiiig. Each one of thore

processes creates a series of queiies a d qiteiie managers to hantlle iariotis data trnns-

formations. The qtiew managers operate iri parallel wi th eacli ut hm. procevsing tlat a

as it hecoriitls availahltx in the queues. Each queue manager acttially passes the dnra

in the querie to a rniilritlireatled pool of workers. One worker iu the pool will perforni

transiorniarions ou r he data before putting i t into the nest queue. iVhat follow is au

esplanacion of figure 3.1.

1. The user intrrface iu responsible for accepting the initial query of the iiser.

2 . The query is passed to the client program.

3. The client program passes the que- ro the search engine selector. which selects a

set of search engines to use. This allows for dynamic selection of search engines.

possih1~- in conjiinction wit h a learning algorithm.

4. The set of search engines are passed to multiple page retrievers. possibly on different

machines. The machines are selected in a random order. and given a random

number of search engines to query. The masimum number of search engines on a

single machine is an adjustable parameter. Each page retriever handes o d y one

engine. Each of the follon-ing steps unril step 10 occurs in eack page retriever. and

each page retriever has its on2 copy of the ~ ~ i o u s objects and queues. escept for

global data structures. rrhich are outlined in section 3.2.

Figure 3.1: Overail architecture. esplanatioii iu section 3.1

.3. The search engine estractor works in tandem n-itli the page fetchers. It reqtiest.:

CRLs corresponditig ro the iiser's query from a search erigine. ancl continues to do

so until it fin& $0 new or changed documents. or no more dociinleurs are a\aiiai>le

from the inclividiial search engines. Changeci dociinients are rtiose thar have heen

fetclied before. but have changecl sincc last fetckect. This is deterniined using a

combina tion of a clatestanip and a checksiim for the clocunient . The page fetchers

tell the aearch eiigiue the nitmher of new or changed doc~inients that have breu

founcl. The search engine estractor puti CRL': into rhe CRL queue.

6. The CRL Manager removes CRLs froni the CRL Qtieiie and passes them ro a ivorker

in the page fetchcr pool. .Ja\a's threading niotlrl requires thar an asyuchro~ious

page fetcher use a helper ohject. 50 that page cio~vnloaclin~ rnay I>e halretl. The

page fetcher asks a helper ro do~vnloacl the document $yen the CRL. and passes

informa riori ahutit nrw or changed dociinienrs to the searck engine csrracror. The

docilnient information i?i then entered into the page information queue.

- . The page information manager removes clocuments from the page information qtieiie

ancl passes them to one of the workers in the page analyzer pool. The analyzers

estract and rearrang some of the data from each document. transforming the

document into an SGML document that the backend instances can use. The newly

formatted document is entered into the backend queue. At this stage. documents

n-ith non-English characters are stripped to include only English characters. if a q -

esist . Furthemore. the document is split into a series of rRLs. the L'RL test. the

title. and the rest of the document.

S. The backend manager estracts the document data fiom the backend queue and

passes it to one of the n-orkers in the backend instances pool. Here. a backend

instance transfoms the document into a vector representation of the document.

dong n-ith various rags assigned to the document by the page anal)-zers. The

vecrors are then entered inro rlie analyzecl \*SM qtieiie. The backeiid was origirial1)-

crea t ecl hj- Brian CLarnhers[CliaQS]. and rnoclified ru allon- for nitiit i thenclt.d itsr

au1 lYl\*U* ctoctinieuts i ie: rlor.unients in HTSZL 1.

10. The page retriever coilects the data in the analyzed \'SM qiietie aiid seuds tkta tlara

ro clirrit. basecl on a client reqtiesr for the clata.

11. The clizrir then forniars the clata and orclers i t according to tlic Iearning algorithni

in ~1st.. and presents it to the user for rankiug.

Al1 coiiiniu~iicarion betwen machines is clone via a message passirig systerii built on

top of .Java's Reniote Ifet hotl Invocatiou i R l I I ) . This niessage passirie systcrn allows

synchronotis and as>-nckronous communication with mechanisnu in place to allow agent

coniniiiaiçat ions langtiagcs such as IiQlIL [LFI;. FLM9;. FFlIE]. Howe~cr. stich lan-

guages are oot reqiiirecl for use. aucl a much sinipler protocol was usecl here. The global

data stntcrtires are shared throtigh Java's Rl I I siihsystem.

Global Data Structures

There are two global data structures that are used for all queries and across all machines.

The first of these is the document frequency table and the second is the stopword list.

The former is a list of ail the tcrms that have been seen until the current time. t,. It also

has a mapping of rems to the frequency of the term in the documents seen until time t , .

This stnicture forms the dictionary that is used by the systern. and is initially empty. If

chis dictionan- represented the entire document space. and thus. al l words and two word

phrases in the English language. the dimensionality of the document space wodd inhibit

useful leaming in addition to causing scalabilit y problems. This problem is handled fmt

by the clynamic generation of the document frequency table. rrhich ensures that o d y

those rvorcls <Lat have beexi seeu hefore are incliitlrtl iu the clicriouary. Therr are three

or her mer hods t har are u s d to bandle cliniensiuua1ir~-.

3.2.1 Zipf's Law

The firsr of t t i e ~ e i i an application of Zipf's law. This lm*. as c i t td 11)- SaLia1iii!Sah9Sj.

srates t hat worcls chat occur infrecliiently in a corpus of tlociimerits have lit rle cliscririiinar-

ing porwr hetn-ern dociimenr~. Siich words rnc- help ro ideiitify incli\-itliial docurneuts.

h~ t t do little else. Since the pitrpose of the clocumenr \'S'iIs is ro orcier docitrnents r&ti\-e

to cach other. these words ma? he cliscarclecl. as the! pro\-ide no tisefiil information in

this contest. .As an esample of this. Saliami estimates that wordr occiirring only oncp in

a corpus n-il1 proride half of the unique terms in the corpiis. This is only an approsinia-

tion. but aids in keeping the dicrionary smaU. This application of Zipf's law may be done

ar internls. when a large number of documents. such as 10000. ha\*e been seen. This

ensures that such terms really do fall inro the category of unique but nondiscriminatory.

Terms that are in this category are placecl into a dynarnically generated stopword list.

3.2.2 Stopword List

The stopn-ord list is used when creating \51Is. and is used by the system to remove words

from a document's i-ector representation that appear in the stopword Est. Complementing - the dpamically generated portion of the stopword list. described in section 3.2.1. is a

static stopword list. This list is created before the d i c t i o n q is built. and consists of a list

of words to esclude from the diction- and thus the document space. This Est is meant

ro consist of commonly used words that are knonn to provide Little or no discriminatory

power. It includes articles iike -the.- - a and common conjuncts like -anda and -or.-

3.2.3 Stemming

The lasr nirclauiim to prevenr esces cliniensionaliry i': sreniniing. Here. words trith

sirnilar rom' are rornlkrtl iii rhe clof-iinienr space. For instance. the n-or& -accessorc-

-aîceasories.~* alid "accesorized" are identifiecl to the terni "accessori- during the tram-

forniatioii of a tlocument ro a \-ecror in r h r rlocti~rient space. Tliiis. the document space

?.:A3 Lï PL -- q .*---A - - -- -1 - -. - - - -1- - C 7 kt:; GE!:; r u u ~ c I b u r c r z c~~~~ UIUC~. i.&l.iiïiiu~13 VL ificlu. LCL.ILLS thal al.^

ttvo ~ o r t l ~ ~ t i r a s r s have sternniing applietl ro each indit-icliial word iu the phrase. This is

açconiplishd through a d i s stripping routine[PorSO]. whete words are systematically

reduced to a yoot" word. whicli riiay or r n q riot correspond to an actual wortl in the

English langitaqr. The façr thar terni:: in the profile consist of stenimecl teniis ad& ro

t tic inahilirl- ro ilse ari espniiclecl qiter>- in successive searches tri t h search engines ( sep

secriou 1.3.1 1 .

3.3 Scalability

The arctii tect ure is SC alable. allowing multiple machines to cooperate in analyzing and

tlownloading documents. In k t . working on multiple machines mas necessary. as the

initial sysreni reported mernory errors wich just one machine il-ith 12SMB of RA11 in

use. The use of .Ja\?i0s Rh1 1 subsystem allows multiple machines to coordinate. and work

in an asynchronous manner. Cnfortunately. this is not as cornpletely scalable as it could

be or appears to be. Since the engine \vas rvritten in Java. it is subject to some of Java3

faults. In this case. threads in Java cannot be interrupted immediately. This means t hat

even though ail the transformation performing objects (section 3.1) are given a limited

amount of rime in n-hich to complete a data transformation. those that do not finish and

are told to intempt themselves will not stop consuming resources until the task has been

completed. or the object checks for interruption of the thread in h hi ch it is operating.

If the object happens to be in the rnidst of a blochïng c d . such as when donmloading a

Gerard Salton and C hris Buckley. Iniprovin J rer rieval perforniance by rrl-

elance feedback. Jounzal of the Americati Society for Infornrntiorl Science.

-4lr 4 ):?SS-W. 1990.

Gabriel L. S o d o and -idele E. Han-e. Agent-assistecl inreruet browsing. In

Proceedings of the Workshop on Ititekgetit iraforn~ation Systems at the 16th

iVntiotial Corifererice on Artificial Ititelligr7icc ( A A A I '39). 1999.

Sherlockhound. ht tp: //~v~~*iv.sherlockhottnct .corn.

G. Salton ancl 1I.J. McGill. Introduction to Modrnr Information Retrieval.

l\IcGraw-Hill. Sew York. Sew York. 19S3.

StatUarkct. S t at Narket search engine rat ings. .lune 2000.

ht tp://~vtv~v.searchengine~~*atch.com/reports/starmarket.html.

Louise T. Su. The relevance of recall and precision in user evaluation. Jol~rnal

of the Arnerican Society /or Infonnation Science. 45(3):207-217. 1994.

Daruiy Sulliran. Xedia Metris search engine ratings. Slarch 2000.

http://a~~~~.searchengine~vatch.com/reports/medimetrix.htd.

Danny Sullimn. SPD search Sr navigation study. June 2000.

http://searchenginen*atch.com/reports/npd.htd.

Danny Sullivan. Search engine alliances chart. June 2000.

ht tp: / /searchengine~t-atch.com/reports/alliances.htd.

Dnnny Sullikm. Survey reveals search habits. June 2000.

ht tp: / /searchenginematch.com/sereport /00/06-realnames. h t d .

mn-that is. for s e p s 1 to 11 in figure 3.1-a page retriever will nor have a complete

document frequency table. Only t hose changes that are made locally are alailable. .At

the end of the run. changes ro the local table are rransferrecl to the master table. dong

ivith changes from all orher page retrievers. This redtices mtich of the nerwork rraffic.

escep t for the initial trausfer of the table and stopword list . As a consequence of caching.

the document frecluency r able is an est imare of the current kno~declge about the frequenc?.

of terms in the documents seen. Thus. it i d 1 inirially be inaccurare. ancl the esrimate

will get ber ter as rime goes on. until the caching effect becornes irrelevant . This is nor a

prohlem. as the table is inaccurate initially. whether or not caching is tised. Furthermore.

if no caching ivere used. and the initial tablc were chauging rapidly. the kliowledge would

only alioiv bet ter vector modelling of clocuments t hat were viewecl later in the initial rtins.

Thus. documents tvould not be treated eclually in the rankings because they tvoiilcl he

ranked basect on

The problem of

frequency tablc.

S tich knowledge

different arnounts of knotdedge about the terrns in the document space.

a poor initial table ma' be reduced by using a bootstrap document

if there is knowledge about the dictionary that will Iikely be created.

ma? corne from esisting analyses of the English language or from other

information r e t r i e d studies. for example. For this study. bootstrap- t ables were used from

Brian Chamber's work [Cha99]. The document corpus and topics used in Chamber's

work were different from the ones used here. However. this means that shased words

would most likely be fairly common words in rhe English language that still had some

discrimina tory power.

In spite of the caching done. tme scalability c m only be achieved by using some

distributed database. or other distributed data structure. n-hich inciudes the use of a

more intelligent caching scherne. This database or caching scheme n-ould apply to both

the document fiequency table and the stopword list.

3.4 Term Weighting

In the creation of I'Slls in step S in fi~ure 3.1. a number of different approaclies ma?

be used for the creation of the \'Slls. In this engine. a terni t in document (1 is gi\-en a

weight IL* as follo~vs.

# unique words in d IL =

aïg # unique words per clocunienr

ivhere f, is the term freqitency. log is the logarithrn t o the base 2. .\- is the total number

of documents secn at the tinie that d is analyzed. fd(t) is the nuniber of documents iu

~vhich term t occurs at least once. Other term weighting systems may also be used. as in

(SMS3. Sal7l. LG3Sa. 'rIla991. These are not the weights that are used during learning.

however. n'ben actua!ly used. vectors are transformed so chat the sum of their weights

is 1. This includes the wctors that represent the profiles.

3.5 Topics

Four topics were chosen for queries:

Palm Pilot The palm pilot topic Kas meant to obtain documents pertaining to acces-

sories and free productivity software for the Palm handheld computer line by Palm.

h c . A U documents had to be about either accessories or free productivity software

to be deemed relevant-no demos or sharen-are- and no hst of W s to other sites.

among other criteria.

Robots This topic concernecl research into autonomous robots. parricularly n-ith respect

to courses. but an- research wodd do. Documents about robot cornpetitions ( ~ u l e s s

these were also courses j. remore controlled robots. and toys were escliided from rhis

topic.

Microsoft DOJ \!*hile Xcrosoft Lias had a niunber of cases with the Department of

Tl.,+:, 1. 3 1 ., ,J,.,C DO.] ). doci,mcu:s K C ~ C ~ C ! C ; Z Y ~ iû tkih î ü ~ l i ü d ~ - if i k ~ K C L ~ ctppL~cduir

to the case which took place in the F e u s 199s-2000. and restdtecl in the judge

determining that llicrosoft shoiild be split into two companies. This escludes the

case regarding the consent decree. circa 1997. and al1 ot her antitrust cases. stich as

the one regarding the purchase of Inriiit.

St udents The actual query used for this topic was -stuclents universitu toronro." It wns

nieant ro obtain dociunents relating to student life nt the Cniversity of Toronto:

chings such as clubs. organizations. student activities and student guides. It was

meant to mimic a query by a potential undergraduate student to the Cniversity.

who was interestcd in seeing what students did at the Cniwrsity. socially.

Chapter 4

Experimental Results and

Evaluat ion

This chapter presents results of esperiments run to answer the questions posed in the

Introdiictiou. The results are aiso evaluared for statistical significance. and e\aliiated

ivith respect to the questions posed.

4.1 Description of Data Gathering Procedure

Data were gathered on as close to a dail' basis as possible. Some days. data could not

be obtained due to the fact that few documents nrere retunied. This lack of data on a

given day was due to changing conditions on the local and global Intemet. and because

of this. the data gathering procedures ivere either redone for those days or not done at

au. As wiIl be shotm. a large gap between data gathering days did not appear to have

an effeci on the results. On those days when data n-ere gathered. the queries were given

in the order in nhich thel- were presented in section 3.5. For each topic. at least 50

documents were obrained on each day. with a mean of 1SO documents gathered per topic.

per da>-. The exact number of documents found each da' may be seen in figure 4.1. An

esrplanation of some of the larger spikes may be found in section 4.6.

(a) Palni Pilot ( b ) Rabots

(c ) lIicrosoft/ DOJ (d) Students

Figure 4.1: Daily Document Counts

CHAPTER 4. ESPERISIESTAL RESL'LTS XSD EVALU.-\TIOS

4.2 Evaluation Framework

The e\aluation of information retrielal systems rypically involves some nieaslire of pre-

cision ancl recall. In this domain. the former is a measure of the number of documents

renie\-ecl that are Iahelled as relel-ant . and the latter is the percenrage of relelant clocu-

menrs that were retrieved. out of the d o l e population of relelant clocumenrs a\ailable.

-1 l u e rotai niimoer of reielanr documenrs ior an- giren topic or cper?- is imknorvn. Fur-

thermore. -\. measure of recall should also take into account properties of the Interner.

At an- given moment. large portions of the Inrenier ma- be inaccessible or ciifficult to

access due ro fnilures of individual machines in the Internet. or iinresolved congestion

at some point in the Internet. .Uso. since users tend not to navigate beyonci the first

set of doc~iments that a search engine displays !SieOO. Sie99b. Xie%. SchOO. TogSS] and

since some search engines can give liiinclreds of thousancls of documents. it is especidly

important to have more relex-ant documents in the top few CRLs listed. Finally. as the

\\'t\'lI* grows. recall becornes far Iess important t han precision [Sie99aj. Man?- clocu-

ments will be reletant. but the most relelant should be placed at the top of search lists.

Some might argue rhar the need for better precision over recall already esists. However.

another study [Su94]. indicates that users may be more interested in absolute recall than

precision-it is unclear whether this wodd still be true ahen using today's search engines

on the mstness of the U'WV. as Yielsen argues. Whatever the case may be. measures of

recall or precision that go beyond the first 10 to 20 documents that the metasearch engine

displays is relatively useless because the user will only rarely see documents listed be-

yond those first 16 or - 20. Hoivever. a rneasure of relative recall is useful when comparing

individual search engines that make up part of the metasearch engine. Here. a measure

of the number of relemnt documents an individual search engine obtained relative to the

number of relevant documents the metasearch engine obtained can be used to compare

individual search engine performance over cime. Statistics of this nature carr be found in

section 4.7. It is possible to estirnate the number of relevant documents found each day.

These results are presented in section 4.5. For most of the other discussion. a nieasure of

precision is used. This measure is giwu with respect to a certain number of docii~iients:

for esample. the precision in the top 10 documents returnecl hy the metasearcli engine. or

the precision in the top 1 document. Tkese measures have been used before. in arialyzing

search engines [GLGf 99. CSSS]. The TREC-'7 filtering taak also siiggests a measiire to

use [Hu199]. the F3 nieasiire:

where Rs. is the number of rele\-ant ciocuments retrieved ancl ,Y+ is the number of nonrel-

evant documents retrieved. Results from this measure are presented in secrion 4.3.1. The

TREC-9 filtering task [RH] siiggests the use of the T9P mcl T K measures. presentecl

in sections 4.3.2 and 4.3.3. respectively:

JI i n l - othenvise

T9P = R+

m u ( Min D. (R+ +S+ ) )

J l i n l - = -400 for 4 y e a s or pro-rata

Min D = 50 for 4 vars or pro-rata

Establishing a performance baseline would also be useful. One baseline that could be

used is haring an unchanging profile order the documents. In other words. the profile and

the query s t a y the same through time. In the following discussion. this profile. queq- and

associated learning algorithm will be referred to as the plain profile. query. or learning

algorithm. This baseline tums out to have properties veq- similar to choosing a random

set of documents and presenting them to the user. One R-ould expect that choosing a

Figure 4.7: Running average of precision across al1 topics for Plaiii algorithm

% Plain. tw t -O- Plain: iao 3 '

a 8 Plan: top 5 l , c Pian. tao IO , -- Pfw: mp Ja ,

random set of documents would result in the running average of the precision in t he

top .Y being approsimately the sarne throughout time. for an!- S. The running average

precision in the top S is simply the precision in the top S over some number of days.

This can be seen in figure 4.2. The differences in precision end up being no more than

4%. hlso. the order in which the precisions appear on the p p h - s e e m s random. with

the precision in the top 30 being the best. mhile the precision in the top 3 is the worst.

The difference between the precision in the top 30 and the next best precision suggests

that the ordering may be poorer than random. and figure 4.3 confirms this. X R'ilcoxon

signed rank test [Gusg;]. performed due to the non-normality of the data. reveds the

difference is significant with p < 0.001. This still presents a good baseline. however.

based on the merits of the profile. .AU the other learning algorithrns use this plain profile

to s t a r with. Any improvement over the plain algorithm is thus a resdt of Ieaming.

IYi t h t his baseline measurement . 30 documents were ranked each da>-.

h o t h e r baseline %-as also used. This wiil be referred to by the random name. The

random algorithm chose 30 random documents each da>-. from the documents retrieved on

Figure 4.3: Ruming average of precision across al1 topics for Randorn algorithni

I 4 Ranaom: top 1 £- Ranaorn: toc 3 e Randam: top 5

each day. Here. al1 the running averages converge to approsimately 10%. which is slightly

higher than the precision in the top 30 for the plain profile. This convergence is espectcci

froni a ranking of a radon1 set of documents. The data frorn the random algorit hm will

be used to estimatc the proportion of relekant documents per da> and thus. the ntimber

of relevant documents per day. -4s with the plain algorithm. 30 documents were ranked

on each &y.

In the follotving discussion. it is often instructive to examine only a portion of the

results returned bu the dgori thms used. For esample. ni t hout an- ot her restrictions.

every algorithm coiild return up to 30 documents. Thus. the measures git-en. without

other restrictions. could - be called the measuzes in the top (up to) 30 documents. LVe

cari restrict the nurnber of documents we aUow in the measure. and call it the measure

in the top n documents. These measures would include ody the top n system ranked

documents. or fewer than n if fen-er documents were ranked by rhe system. For ense of

notation. these measures will be refened to as the measure in the top n. Iieep in mind

chat fewer than n documents ma? be included in the measure.

4.3 TREC Measures

The F3 Measure

The runnin; average of the F3 nieamre is sho~v~ i in figure 4.4. with \?trioils topicr.

Figure LI( h ) shows an esarnple of the tlaily F3 statistics. The line labellecl as -Test" in

rhi5 i i c n ~ ~ ~ . $1 ^?ber fi grirFF- hl -2E-p pr9f le pi rhp !ire -crinGv;c.* *-n+:l "O '" L 3 "-

da. 62. After day 62. the -Test" profile is frozeri. while the -Grigorisn profile is allowecl

CO change through continued learning. This was done to examine an>- clifferences that

niight arise. and results of this are presenred in section 4.4.1.

The clifference hrtweeu the Grigoris algorithm and the Rocchio algorithni is s t at is t i-

cally significant wheu taking iuto account al1 topics. with p < 0.0001. using the \\'ilcoson

signed rank test with a 0.5 continuity correction. The difference here is also clcarl>. sig-

nificant in terms of real tvorld performance.

The graphs seem to indicate thar. n-ith the esception of the plain algorithm. al1 algo-

rithms on al1 topics perform n-el1 in the firsr ten to twenty d q s . after which performance

seems to plateau or decline. The exception to this is the performance on the Palm Pilot

topic. where the Rocchio and Grigoris algorithms increase their performance continually.

Almost without esception. this measure indicates an increase in performance at or

around day 70. This will be esplained in section 1.6. The almost monot onically decreas-

ing line of the plain profile indicates chat the number of relevant documents is almost

always decreasing. This is to be espected. as one would thinlr that the individual search

engines are fairly good at r&ng documents. Thus. as time goes on. the individual

search engines give back worse and worse documents. n-hich are less and less relennt

to the general quer- g i~en . much less the implicit topic that the user has in mind. In

light of ths . any line segment in the cumulative plots mhich has greater siope than the

corresponding line segment on the plain profile line is indicative of better than plain

performance. Liken-ise n-ith respect to the fine for the random algorithm.

(a) Palni Pilot (b) Palm Pilot. Daily

( c ) R O ~ O I S (d) JIS/DOJ

( e ) Students

-'7 I+- ' . Ire- ! * - -? m rn m m ,QI

- n m m 'n>

( f ) Al1 Topics

Figure -1.4: F3 Sleasure on Va.rious Topics. Running Average

( a ) Palm Pilot (b) Robots

( c ) hIicrosoft/ DOJ (d) Students

(e) AI1 topics

Figure 4.5: F3 Measure on Ikrious Topics. Running Average. based on top 5 documents

returned. only

(d) Students

(e) -411 Topics

Figure 4.6: F3 JIewure on Various Topics. Running Average. based on top 10 document

returned. only

The two tradi t ional relevance feeclback algori thms perform iwil on al1 but the sr udem

ancl robots topics. On the latter topic. the Rocchio algorit hm generates approsinia tel?

four rimes as many nonrelevant documents as relelant ones. whereas the Grigoris algo-

rit hm manages to set 50 more rele~anr documents t han nonrelevant ones. Seit lier of the

algorirhms clo well in the sr d e n t topic. possibly due to rke fact that few dociirnenrs were

relelant at al1 in that topic. as inclicated by the line for the randonl meastue-note rhat

the minimum \ d u e of tliis measure is - 1300 and the random algorithin receivetl - 1-47.

whicli translates to approsimately 5.5% of the retrie~ed documents being relennt. This

may indicate that the data or the generated profiles were noisy. In facr. the Rorchio

algorithni had a recall of 24% (see section 4.5 below. on how recall is estirnatecl in this

work) but performed more poorly than the Grigoris algorithm because of the l x , ne nuni-

ber of irrelevant documents obrainrd. In comparison. the Grigoris algorithm had a 23%

recall. It is also possible thar the implicit topic behind the stuclent query lecl to noisy

results by virtue of the topic itseif. For instance. suppose rhat the t ~ r m -Associationu

were a good indicator of relelance. but only in conjunction with the term -Torontou

and -S t udent .- Furt hermore. suppose t hat t hose words w r e poor inckators of relelance

when fountl alone. or not near to the other words. The generated profiles do not take

rhis into account. and thus cannot mode1 these relationships accuratel-

The F3 measure based only on the top 10 and top 5 documents ranlred gives an

interesring picture of the systems performance. These may be seen in figures 4.6 and 4.5.

Judging fiom the shapes of the graphs'. and in particulax the shape of the graph showing

the F3 measure across all topics. it nould appear that the performance improïes bu just

taking into account only those documents. This lends credence to the idea that the

learning algorithms are working. because this type of improvement sugges ts that more

relevant documents are concentrated in the top ranked documents. More discussion on

- - - - -- - - --

'Lnless we normalize the F3 measure. we cannot compare dopes or absolute nurnbers-this can be done. but this type of comparison is more clearly seen in section 4.4

CHAPTER 4. ESPERIXIESTAL RESL-LTS XSD EVALC'XTIOS

this is presented in section 4.4.

4.3.2 The T9U Measure

The T91' measure is nieant to reduce the effect of selectine; too many documents. partic-

iilarly when those documents are irrelelanc. It does this by introducing a lower hoiind ou

-1 - z ~ t c a s ~ r e s,i~iiliir t~ îLr F2 ilirn>ULr. lu~ver LUULLJ is caiiei J i i n i - . liesuirs siiown

with this meastire. in figure 4.7. shoiild be a 'snioother' version of rhe F3 measure. This

is borne o~ i t by esamining the graphs. In facr. escept for the palrn pilot topic. the - craphs

for the l emer s eshibit a high degee of parallelism with the graphs for plain and ran-

dom algorithms. This means that the learning algorithms aide in the process of findiug

relevant documents only at specific points. This musi he tnie. given the parailelisni and

the fact that the lines representing the learning dgorithms are higher than rhose for the

plain and random algorithrns.

This behaviour is made more obvious by rsamining the graphs in figures 4. ;(e) . 4.S( e)

and 4.9(e) I t is easy to see that the 'bumpiness' of the graphs increases when taking into

account fewer documents. The 'bumpiness' occurs where the learning aigorithms actually

perform betrer than the random and plain algorithms for a particular da'. This hehaviour

cannot entirely be due to the performance on the palm pilot topic. as the graphs of the

palm pilot are fairly consistent in terms of when performance is better than the plain

and random algorithms. The consistency of the measure on the palm pilot topic and

the fact that the measure was a ln qs rising indicates that. in fact. the learners on this

topic did not select man' more than 5 documents each da- While the other topics are

also somen-har consistent. this same interpretation may not be giwn. The consistency

there is a result of the performance being similar to the baseline measures for most of

the nui. Khile the performance on the other topics appears to be poor. the overall

resulta still indicate that the l emer s n-ere correctly ra&ng relerant documents higher

than nonrelerant ones. If this were not the case. the graphs for the top 5 and 10 (figures

ia ) Palm Pilot (b) Robots

(d ) Students

( e ) A11 Topics

Fiove 4.7: T9L Measure on Various Topics. Running Averages

[a) Palm Pilot (b j Robots

+- i d -

; * T U * - ~ r r

1 + -

(d) Student

(e) Al1 Topicj

Figure -1.8: T9C Measure. Top 5 only

( a ) Palm Pilot

fc) >Iicrosoft/DOJ

( b ) Robots

Pm- +-

(d) Student

(e) -411 Topics

Figure -1.9: T9V Measure. Top 10 only

4.S ancl -4.9) n-oiild look more like the graph in nhich no restrictions w r e made (figure

4.7). Esamininj the O\-erall results on this measure one final time. it indicates rhar eitlier

there were nor too man'- rele~ant documents or the learning algorithms work besr only

for producing relevant documents in the top 5 or fewer cloctmients returned. If this w-ere

not the case. one would espect to see more the performance to be siniilar on al1 graphs

for a particular topic.

The dara seeni to indicate rhat the Grigoris algorithm is better than the Rocchio one

using this measiire. However. a n'ilcoson signed r a d test on the daily differences reveals

the difference to be insignificant. with 0.1977 < p < 0.2005. Esamining figure 4.;(r).

the difference seems airnost entirely due to the difference occiirring at days 2 . 3 wcl 4.

11-hile this difference. taken on a dail? basis. might not be statistical1~- significant. the

end results are certainly significant in the red worid. This difference ends up causing a

difference of 4.19% in rems of the precision in the top 30. Perhaps more importantly.

however. rhe clifference is almost entirely due to a sis document difference in the first

four days. with a five document difference in the second day This is important because

while the usage scenario of this system occtus over some time. it coiild be the case that

a time periocl of only a few days mas important. Thus. any irnprovement in rhose few

days would be critical. Also. since the behaviour of the individual search engines being

put to use is such that relemnt links are highly l ih ly to be present in the first few pages

of hits that the search engines return. and that these first few pages are retrieved in the

first few days. the importance of achieving a good profile in a short time increases.

4.3.3 The T9P Measure

The T9P measure is meant to *stress precision* according to the Filtering Track Guide-

lines [RH]. It uses a lon-er bound on the minimum number of documents that must

be selected. It thus penalizes systerns ahich retrieve fewer than this minimum n m b e r

of documents. These results are presented in figures 4.10. 4.11 and 4.12. The results

a.,. t- rW

(a) Palm Pilot

( c ) .\licrosoft/ DOJ

i b ) Robots

(d) Students

( e ) Ail Topics

Fi,gure 1.10: T9P Measure on nrious topics

(a) Palni Pilot ( b ) Robots

(d) Students

*- E

(e) ,411 Topics

Figure 4.11: T9P Measme on larious topics including a rnavimum of 5 documents.

(a) Palm Pilot jb) Robots

(c) lIicrosoft/ DOJ (d) Students

(e) -411 Topics

Figure 4.12: T9P Measure on various topics. including up to a maximum of 10 documents.

are clearly in favour of the Grigoris algorithm. In fact. the differences bern-een it and

the Rocchio algorithm are statistically significaot on the palnl pilot and robots topic. as

well as the overall results. Again. the U'ilcoson signed rank resr with the 0.5 continuity

correction !vas used to compare daily versions of the measure. The p values Iyere p <

0.0044. p < 0.0'207.0.1151 < p < 0.1170.0.1003 < p < 0.1020. p < 0.0013 for the graphs

of figure 4.10. in order from left to right. top to bottom. In other words. the three graphs

in which it look like there may have been a statistically significant difTerence do in facc

have one. This clifference decreases as we include ouly those documents in the top j or

10. although the overall results maintain their staristical significance. This means that

the algorithms tend to give the user the sarne number of relevant documents in the top

few document S.

4.4 Precision of Learning Algorit hms

The T9P measure in section 4.3.3 gives the precision in the top 30 for the plain and

random algorithms. since those algorithms are always forced to return 30 documents.

That measure does not completely represent the data. or accurately represent those

individual algorithms. Figure 1.13 shows the ruZlILing average of the precision across al1

topics. with precisions in the top 1. 5. 10 and 30 documents returned by the system. The

figure shows that the Grigoris algorithm is better at discriminating between relevant and

nomelemnt doiuments. since the line representing the results from the top n alaays ends

up higher with thé Grigoris algorithm than the Rocchio one.

The staggered positions of the lines corresponding to the different numbers of doc-

uments indicates that bot h tradit ional style aigori thms are pushing rele~ant documents

higher in the list of documents presented to the user. It may also imply that there are not

enough releianc documents to have a high precision in the top 30 or the top 10. Given

that the algorithms are n-orking. which they appear to be. then if there mere many rele-

* Racchio; top IO Racchio: top 30

1 L 1 I 1 1

O 20 40 60 80 1 0 0 120 days !rom start

Figure 4.13: Running average precision across dl topics

(a) Palrti Pilot

(c) Microsoft/ DOJ

(b) Robots

(d) Students

Figure 4.14: Running average precision on ~ar ious topics. shotving precisions wi th various

numbers of documents included. Rocchio and Grigons algorithms.

rant documents. one wodd espect thac the lines for the different number of clociimcnts

n-ould be closer to each other. instead of having an eigkt percent difference betrveen the

top 1 and rop 30 results. It is likely t har the staggering is also a result of the algorit h m '

inability to both have relevant documents ranked highly. and to obtain al1 the rele\arit

rankings from the set of documents anilable. Hoivever. it is difficult to estimate the

recall of the algorithms. since no data esists about how mm>- relevant articles esist in

the entire KWI*. Data about recall on a daily basis. ivith respect to the documents

rerurned by the i n d i d u a l search engines. can be estiniated by looking at the data from

the random algorithrn-t bis is presented in section 4.5.

Figure 4.14 shows the precision numbers for the individual topics. -411 the graphs

eshibit stratification. That is. given a fked nurnber. m. of documents incltidecl in the

measure. then including n documents. n < m. results in incrcased performance. The

stratification is much less apparent on the robots and lIS/DO.J topic than on the palm

pilot topic. This is probably attributable to the profiles. and rheir ability to distingtiish

between documents. For instance. with the student topic. the term 'sttident life' is

important. but only in conjunction with the term 'toronto'. However. the term 'toronto'

by itself is actually a ver? poor indication of releçance. Similarly. the term 'student life'

without the term 'toronto' is a poor indication of relemnce. Thus. at certain times in

the eduations. the term 'student life' may be important in the profile. but i t lacks the

tenn 'toronto.' or vice versa. This may be due to a fault of the feature space. which only

consists of the terms in the VShIs and does not add the correlations between terms.

The large gap in the Grigoris algorithm on the students topic appears unusual. The

stratification seems to be extreme when loobng from the top 3 to the top I data. This

is probably due to the high precision at the beginning of the run.

Figure 4.1 3: Riinning average precision of Grigoris algorit hm on the stticlent topic

- - - - - - - - - . am-

Figure 4.16: Running average of the precision for continuous training versus test portion

of train/ test. across all topics. Grigoris algorithm used. S h o m starting

from test cycle at day 70.

(a) Palm Pilot (b) Robots

id) Students

Figure 4.1;: Running average precision on larious topics using the Grigoris algorit hm.

comparing continuous training with the test portion of a trainltest cycle.

Testing cycle begins at da>- 70.

CHXPTER 4. ESPERI~IESTAL RESL-LTS X S D E\-ALUXTIOS

4.4.1 Continuous Learning vs Train/Test

One other test that was performecl was to esamine the effects of continuous training

versus hal-ing a training and testing period for the profile. This was clone with the

Grigoris algorirhm. and the results ma>- be seen in figures 4.16 and 4.17. K t h the

exception of the resulrs in the top 3. the differences in the daily data are statisticaily

significanr with p E [O.OIO.L. 0.04lSI. The differences in the top 3 result in p o 0.0643.

These resuits are only with 14 days worth of data. Furthemiore. these results nia! be

due to the facr that the metasearch engine frequently returned pages that kacl been seen

hefore. and tvhich had only changed slightly since the last time it t ~ a s seen-in the date.

for esample. Servers ma! also have returned incorrect dates for the date check. or pages

may have been dynamically generated. -411 these factors result in a basically unchanged

page being given to the user. Pages that had not changed in a relevant manner ivere

marked irrelevant (see section 2.3). This ptits any profile that has been frozen at some

particular point in time at a disadvaatage. because it cannot adjust to this method of

ranking. At the sarne time. this method is necessary in order for the system to give

new pages to the user. and for the system r d n g s not to be dominated bu d ~ a m i c d y

generated pages.

One other possible explanation of these results is that overtraining might have oc-

curred. If this were the case. the test profile would only work weU with the documents

the system had already seen. This is not the case. here. The spilie in the performance

at day 70 shows this. As explained in section 4.6. this s p i h is due to the system seeing

relevant documents that had been seen before. but when the profile nas still relatively

poor.

(a) Palni Pilot

1 - O ,- bac

b t b

(b) Robots

(d) Students

(e) .Ill topics

Figure 4-18: Dail' recall for the Rocchio algorithm

(b ) Robots

rniiaw

(d) Students

( e ) ,411 topics

Figure 4.19: Dail- r e c d for the Grigoris algorithm

CHXPTER 4. EXPERIJIESTAL RESL-LTS XSD EUL~ATIOS

4.5 Daily Recall

The random algori t h ~ n allon-s estimation of the proportion of reletant clocumenrs a d -

able on each day. Xote that this does not allon- estimation of the proportion of relevant

J~cii~iwiirs iu riir popuiarion of ciociiments aiaiable on the \\W\\*. bitt only the pop-

ulation consisting o i those dociiments retrieved on each da?. This still provitles useful

data. For instance. the 93% confidence interval of the number of relel-ant documents

each day can he estimared. and frorn there. a range of recall values for each algorithm

may he obtained. This confidence inter\d is obtained iising a 0.5 continuity correction

to approsimate a binomial distribution wit h a normal one(Gus9;j. Csing these in te rds .

it is possible to compute a range for the number of relevant clocuments that are expected

to he present on each da'. Given this. it is easy to cornpure the estirnated recall per day.

These data are presented in figure 4.18. for the Rocchio algorithni. and figure 4.19 for the

Grigoris algorithm. Sote tkat the estirnated proportions were altered to be nonnegative.

that the recall estimates shown were dtered to be no more than 100%. and that a recall

of 010 iras given 100% recall. The 'mean recall' refers to the recall obtained when using

the middle of the confidence interval as the estimate of the number of documents found.

The dail- data can be used to estirnate the o v e d recall. The ranges are given in ta-

ble 4.1. These ranges were obt ained by summing the found n u b e r of releiant documents

each da? and dividing by the surn of the estimated number of relevant documents found

each da): This sum ~ar ied . depending on esactly nhich nurnber in the 95% confidence

inten-al 1-as used in the summation-either the maximum. the minimum or the rnean of

the number of estimated documents in the confidence intend. The expected number of

relennt documents is the sum of the means of the number of relet-ant documents. and

gives an indication of the weight each topic is given in the 'A11 topics' topic.

i topic Rocchio

1 AI1 topics (0.19S-L 0.2645 0.3S5l / 0.1935 O.25SO 0.3757 1

1 min mean mas 1 min mean nias / documents

Table 4.1: Overall estimated recall.

Grigoris

/ Palm Pilot / 0.2672 0.3496 0.3039

( number of retrieved documents expected 'Z relelant t i

Lspec t ed num-

ber of relevant

0.233s 0.3059 0.4409

l Studenrs i 5365 1 5.93 I

S9'1

Robots

Table 4.2: Total number of documents retriewd for each topic. This is not the num-

ber retrieved by the learning algorithrns to be presented to the user. but the

documents thar the metasearch obtains from the individual search engines.

6453

hll topics

32.9

31664 10.2

CHXPTER 4. ESPERIXIEST.AL RESC'LTS A N D E V X L ~ X T I O S

4.6 Spikes

There are two spikes. appearing at the beginning of the graphs. and on or around cl-

70 in the graphs. that are particularly striking. as the' appear in al1 graphs. froni the

d ù c ü i ~ i n ~ iütirî grirpL üf Sgurc 4.1; t u tlic TREC i i i r ca i i r rb iu figiirc~ 4.4. 4.7. -i.iû: to rne

precision measurements in figure 4.13. The initial spike ma>- be esplainecl hl- the fact thar

the indi\*idiial search engines that were being used had their best resiilts at the begiming

of their document lists. which were viewed by the metasearch engine first. In the case

of the document counts. there were more documents becaitsr the system dicl not haïe

to go far in the iists of documents returned by the individual search engines. Thiis. the

documents tended to be fairly popular. which tends to imply good network connections.

It could also have resulted in more popular pages. particularly with DirectHit. Hothot

and derivatives. One i~ould espect more popular documents to be reachablt across the

Internet and conversel-. that less popular documents wouid be less reachable. Since

unreachable documents cause resource consunlption. the system might not be able to

collect as man- documents in the later days than in the early days.

The other spike occurred after a gap in data gathering of eight days. but also after the

table which keeps track of the seen documents had been fliished. This table kept track of

ivhich documents the metasearch engine had seen aod ranked. not which documents the

user had seen and ranked. Flushing the first 10 days of data from it resulted in

many completely new documents being presented to the user. and some previously seen

ones. Ignoring the previously seen ones unless they had changd. the met asearch engine

still gave a large number of relevant documents. especidy to show a difference even after

;O days of data gathering. This could either have been due to the gap in data gathering.

or the flushing.

Figure 4.10: Dail- precision across al1 topics. Top 10

4.6.1 Data Gathering Gaps

Gaps occur in other places. siich as r he sewu day gap starting at da- 17. the 6 da' gap

starting nt da? 30. and the 15 da\- gap starting at da! 73. Esamining figure 4.1. it is

not dif£icult to see where rhese gaps are. The l ~ g e s t gap-the 15 d q one. results in a

general decline across topics. while the other two gaps show a mis of shallow declines

and ascensions. as can be seen on the various performance gaphs. such as the F3. T9L'

and T9P rneasures. The document counts also do not appear to increase or decrease

significantly. meaning that the number of documents that were found to be changed or

new did not increase or decrease. The estimated number of relevant documents also

does not appear to have any correlation with the data gathering gaps-the results of

figure 4.lS(e) show a mis of ascensions and declines associated with the gaps.

4.6.2 Flushing

Thus. the spikes were probably due to flushing. However. one n-ould not expect that this

flushing would work more than once because the first flushing would d o w the user to

r a d most of the pages that had been missed due to a poor profile. earlier. If it did. one

tvould expect that the performance n-ouid not increase even to the extent that they did on

1 -

as-

aa - , ;

4 a,.* "

(a) Palm Pilot (b) Robots

( c ) Mcrosoft/DOJ (d) Students

Figure 4.21: Daily precision on various topics. Top 10

Figure 4.22: Dail- precision in top 30 on stiidents topic for rhe Rocchio algorithm

da! 70. .in increase at least as large woiild mean that the individital search engines were

g i~ ing man- good results over time. but the profile was not adapting quickly enoiigh.

Flushing iras done at days 70. 94. and 104-that is. 3 d-s. 7 days anc l 13 days before the

last data point. Figures 4-20. 4.21. and 4.1s confirm the hypothesis that 0ushing would

not work more rhan once-in one instance after the da! 70 flushing. flirshing results in

higher precision and in the other ir results in lotver precision. Also. the lower precision

due to flushing occurs earlier in tirne than the higher precision due to flushing (if flushing

nas the cause of this at all. which ir probably was not ). Similar patterns are found in

the daily recall statistics.

This data also has some bearing on the issue of continuous learning. and therefore.

continuous adjustment of the profile. It shows chat despite the necessity for continuous

leaming as shom-in section 4.4.1. the leamed profile still performs tvell n-hen presented -

nith a large n d e e r of relevaat documents. even 60 days after the previous peali per-

formance (and thus. peak learning with the traditonal style algorithms). The mai&-

negative reinforcement that was received afier the h s t spike in performance does not

seem to have a deleterious effect later on.

One peculiarity r ~ i t h the daily precision numbers is that the biggest s p i h in the

students topic occurs at da!- 50 with the Rocchio algorithm. This doesn't correspond to

an' particularl!- special day. Only 200 docunienrs were retrieved on that day. !&en the

average for the srudents ropic !vas 190. TO put rhat in perspective. -1'27 documenrs were

retrieved on da? 70. The high performance seems ro be due to the Snap and MSS' search

engines. each obtaining four relelant results in the rop 10.

4.7 Individual Search Engine Recall

The individual search engines do perforni differently ou certain topics than on otkers.

Figure 4.23 shows graphs of the running average of the search engine recall. for d l

topics with rhe Rocchio algorithm. Figure 4.24 shows a sirnilar graph for the Grigoris

algorithm. The recall nurnbers are given with respect to the relevant documents found

hy the metasearch engine lie: al1 search engines). Comput ing a line of regression for the

abovc graphs. and using a percentage for the recall. produces the slopes given in table 4.3.

This shows that the search engine recall is about the same over time across al1 topics.

The graphs show tliat no indivitlual search engine is able to obtain more than 25% of the

relevant documents. over time. Esamining results on the separate topics reveals that no

individual search engine obtains more rhan 40% of the relevant documents found by the

metasearch engine on an? individual topic. over tirne.

Search engine recall does change on a per topic basis. Eramining figure 4.25. there is

a clear upward trend. In fact. the slope of the line of regression is O.l545%/day. Similar

cases may also be found in the other topics. as ivith Infoseek on the robots topic. with a

dope of the regression line of O.IOSl%/day (see figure 1.26 and figure 4-27). The different

search engines perform differently on the various topics. Table 4.4 shows the engine with

the highest slope on the line of regression on the karious topics for the two traditional

style algorithms. and table 4.6 shows the lowest slopes. In each topic. a single search

engine had the best or norst slope. independently of the leaming algorithm used. The

9 3 -

lm-

Fiope 4.23: Running average of individual search engine recall. across ail topics for Roc-

chio algorithm. R e c d is measured relative to total number of relelant doc-

ument s found by the metasearch engine using the Rocchio algorit hm for

Figure 4.24: Running average of individual search engine recall. across ail topics for Grig-

ois algorithm. R e c d is measured relative to total number of reletant doc-

uments found by the metasearch engine using the Grigoris algorithm for

leaming .

Search Engine

.Ut a i ï s t a

Direct Hi t

Hotbot

Infoseek

L y cos

1ISS

Sat ionalDirectory

Snap

Thunclers tone

Yahoo

Table 4.3: Slopes of regession lines where the y asis is given as a percentage recdl: t hus

the unita are %recalI/day from start of run. Results are across al1 topics.

an-

Figure 4.23: Running average of r e c d for the Yahoo search engine on the p h pilot

topic using the Rocchio Iearning algorithm

Figure 4.26: Running average of recdl for the Lycos search engine on the student topic

using the Rocchio learning algorithni

most and least improving engines do not necessarily match the best and worst performing

search engines at the end of the tests. These are given in tables 4.3 and 4.7. The best

engines for this job do not reflecr the coverage each indit-idual search engine has according

to Lawrence and Giles [LGSS]. It is also interesting to note chat those search engines

thar share cornmon structures. such as the use of DirectHit's search engine in Hotbot

and h1SS search engines have varying results. possibly due to different versions of the

search engine. or different versions of databases. or different supplementary searches.

Figure 4.27: Running average of relative r ecd . Llicrosoft /DO J topic. Rocchio algorithm.

These graphs show a wuiety of patterns in the recd . for mxious search

engines.

ropic Rocchio Grigoris

S lope Engine Engine -

Palm Pilot

Robots

!lIicrosoft/DOJ

5 t ticients

-

!lm-

Infoseek

Alt aVist a

N S S

11SX

Infoseek

DirectHit

MSS

Table 4.4: Best improving individual engine recall per ropic. for the two traditional style

algorit hms. Slopes are given as <7; recalllday

Grigoris Rocchio I topic 1

Engine 1 Recall Engine Recall 1 03jj3i 0.2604 /

11SS

Snap

AltaVista

Lycos

Snap

1 Stiidents

Table 4.5: Besr i n d i d u a l engine recail per topic. for the two traditional style algorithms.

Rocchio Grigoris t opic

Slope Engine Engine

1 Robots DirectHit

Snap

[ Students DirectHit Direct Hit

Table 4.6: Wxst improvement in individual engine r e c d per topic. for the two traditonal

style algorithnis. Slopes are g i~en as %recd/day

topic

1 I p z T 1 Robots

Slicrosoft /DOJ

1 Students

Rocchio

Engine

SationalDirecton

Thunderstone

l'ah00

Table 4.7: Llbrsr indiridual engine recall per topic.

rit hms .

Grigoris

Engine

Thunderstone

Sat ionalDirectoy

for the two traditionai style algo-

Chapter 5

Conclusions and Future Directions

5.1 Conclusions and Discussion

To ansiver the questions posed in the Introdttct ion. the al results of the previous

chapter indicate that metasearch does seem to be effective. and learning a user profile

also appears to increase the relevance of retumed documents. The difference in precision

between the baseline algorithms and the algorithms that use a changing user profile

ranges from 15% to 30%. The T9P measure resdted in differences of 9 to 10 points. the

T9C measure resdted in a difference of between 90 to 115. and the F3 measure resulted

in differences between 3000 and 4000 points. Yetasearch in a relevance feedback context.

and with the implemented method for rankng. is much better than merely using an'

individual search engine because certain search engines perform better than others on

certain topics. and because of the increased coverage one obtains with metasearch. L7sing

a user profile appears to help rele:ance. according to the larious TREC measures and

cornparisons n-ith the plain algori thm. n-ith some caveats presented belon-. The Grigoris

algorithm appears to perforrn at least as n-eLl as the Rocchio algonthm on the individual

ropics and ~er forms significantly better than the Rocchio algorithm on the T9C. T9P.

and F3 measures on d the topics taken together.

This studj- also obtained answers to questions that were iinposed. For instance. the

importance of achiel-ing a fairly gootl profile in the first rwo to fivt. runs was shown in

the parameter tuning stage. There. a ser of paranierers that led a learner ro ohtain few

resiilts in the first feu- days led to poor learned profiles. This caused the learuer to never

see relevant resrilts at dl. becatise no reletant results were returned to the user. Thiis. the

user would be forced to indicate that al1 the clocuments w r e conrelevant . Saturally. this

led ro the learner having no way to predict whicli future clociiments woiild be rele\ant.

except throtigh random selecrion of documents.

There are some objections thar niay he raised about the accuracy of the conclusions

presented above. In particular. the tise of the plain algorithni and plain profile as a

baseline may he q~iestionable. Similarly with the use of the randorn algorit hm. Seither

algorit hm gives particularly good rankings. F u t hermore. some esist ing search engines

have t k i r own met hod of 'one-s tep' relevame feedback. For instance. Google allo~vs

a searcher to find 5irnilar pagesq to an' that they find releiant . Other search engines

have the pot ential to use implicit feedback. although this would probably occiir in a bat ch

fashion. unlike the work presented here. Search engines that fail into this category are

rhose that present a list of links to other pages. but in which those links are actually links

to a server operated by the same entity as the search engine itself. This other server can

coilect 'visited' statistics. and then redirect the user to the document they nish to vien-.

Ir is difficult to make any cornparisons with individual search engines simply because

they are not designed to be used in the same scenario presented here.

As to the baseiine measure. it would be difficult to corne up with another benchmark.

Other rnetasearch engines cannot be used for various reasons.

r They may not return enough results to make a repetitive query feasible. even with

a general q u e - such as -palm pilot.-

They may not combine the results of al1 the search engines. instead interleaving

results in some manner.

0 The? ma' not allow the user ro specify esactly n-hich search engines to use. or if

the? do. the- may not allow certain search engines thar were used in t his studj-.

The rnetasearch engine closesr to being feasible for benchrnarking purposes would be

San-ySearch. which still fails for the first two reasons given. The main reason. how-

ever. is the firsr-no metasearch engine returns enough results to make a repetirive query

worthn-hile. There would be no new documents to view. and few. if an'. changer1 nn-.

Thus. the? would tend to rerum feu- or no documents for user ranking. Csing the ranking

scheme that the indiridual search engines use wouid form a good baseline. but these are

not availabk for the obvious reason that the! are proprietary. and are the basis on which

people use a search engine. It woulcl be possible to use SavvySeafch (or sonie other

metasearch engine) as the only search engine used as a 'helper' search eligirie for the

rnetasearch engine described here. but that would not be fair to SavySearck's ranking

algori t hm.

Barring furcher objections. the results shotv t hat met asearch works tvell for a repeti rive

query. iising forced reie\ance feedback to adjitst the systern rankings. and that rnetasearch

n-orks bet ter than an- indiridual search engine. in this contest. -4 number of t hings may

be done in the future to improve the relevancy of the results presented here and to esplore

other aspects of searching.

5.2 Future Directions

5.2.1 Implieit Ranking -

One of the less savon aspects of using this systern from the user's point of view is that

the user must provide relelance feedback. and in the case of the tn-O traditional style

dgorithms. had to do so for 30 documents. This does not fit well with the scanning

method that people use to vien- K K W documents [Nie97]. A better method would be

implicit ranking-ra&g that is done nithout the user having to press a button. For

esample. one could use the tirne between visits to the page of ranked documents-thar is.

the time between when a user follon-s a link to a ranked docitnient to the tirne that a user

uest i o l l o~~s a link to anothrtr ranked document. Obviously. some niasiniuni tinie ri-otild

have to be instituted. The assumption here is that users visit relennr pages for longer

periods of t ime than nonrelelant ones. Iionsr an et al. [IiM WS;] show. \rit h uewsgroup

articles. that there is a high correlation between the rime spent reading and the rsplicit

rating given to an article. This could even be used in addition to esplicit measurcs of

releiance. San-ySearch [DH96] used a visited/not visitecl measure of perforniance. This

coiiltl easily be made into a boolean rele~ance ranking. and indicstes the relc\ance of

worcls presented in the test of the page thar displays the ranked doctmients. This latter

information could also be used as iniplicit reletance feedback.

5.2.2 Ontologies

As mentioned in Chap ter '7. the user's query is expected to fit into some ontology. It might

prove interesting to obtain some ontology. such as that from the L i b r - of Congress or

h m an existing director' based search engine such as Open D z T ~ c ~ o ? ~ or Yahoo!. One

could also use a narrow-er field such as Computer Science. using some esisting ontology

(such as one from ResearchIndex-formerly CiteSeer [LGBSS] ) . Csing this ontologv. one

could create more a general profile. P,. for a topic g by cornbining esisting profiles which

corresponded to topics belon- g in the ontology. Similady. one could create more specific

profiles by using esisting general profiles. This wodd have to be done at the user's

discretion. since specific profiles would not necessarily generalize. nor vice versa. The

user could even spec- the esact combination of profdes to use. The use of an ontology

would be even more powerfd through collaboration.

CHAPTER 5 . COSCLUSIOSS XSD FL-TI'RE DIRECTIOSS

5.2.3 Collaboration

Collaborative learning and recommendation has been used in a ntunber of differerit sys-

rems [CGW99. BP9Sa. KSS97. BS9L Ii1111+97]. and has shon-n good results. i l ï r h

ontologies. collaboration with respect to profiles might produce good resulrs as well.

The collaborarion wodd involre liaving a conunon ontology iised by al1 people using the

uict,açarcL çugiiiç. Diffrrrlic profiies !roui dinSrenr peopie createci untier a ropic in the

ontolog- could be combined to produce a prototype profile. which could be of general use.

particularly for new users. or for those users who do not wish ro have their own. separate

profile. People witk similx profiles could also recommend other profiles to each other.

Clustering techniques coulcL be used on the profiles to generate a d y a m i c onrolog- to

use. or merely be used to creace a d~namic ser of bootstrap profiler; for neu- users.

5.2.4 Alternate Document or Feature Space

The current rnetasearch engine uses only the document space as the set of features CO use

when comparing documents to profiles. Other features could be used in the profiles and

document represent at ions. such as the linkage stmc ture of retrieved documents. This is

used in search engines such as Google [BPSSb]. Clever [I<RRT99. IXRC99. CDKi99].

and Direct Hit [Dir]. and is used to identify interest ing documents. called authoritative

and hub sites. based on how maqv documents link to them via CRL references. and

how man' authoritative and hub sites the document itself links to. This creates a graph

structure. d i c h is anal-zed to produce metrics that can be used as features.

One could also use features such as the grade level of a document. the number of

words in the document. the number of links and images. the recency of the document.

and indications of whether the paper is a research paper. among others [GLG+99,511a99].

These mould provide additional information and codd provide additional insight into a

user's criteria for relelance n-hich likely include things other than the t e s of a document

5.2.5 Thresholds

The sysrem clescribed here uses a static tlireshold to determine when ro stop giving

documents to the user for ranking. Documents with a system ranking belon- rhis thresliold

are x \ - e r -hot::-, zo the user. c-.-cc if fc*::cr ;ha 30 Uûc iu i i c i i t~ L d Leeu cuiiecteci ro be

shown. This threshold could be dpamically generated. This cotild be done by monitoring

current performance. such as one of the F3. T9P or T9L- measiires. and alrering the

threshold based on the d u e of or the changes in those measures.

5.2.6 Alternative Met hods of Learning

Other learners might prove to be more effective on this task. For insrance. the palm pilot

topic triight be more easily learned by a system that used several learning agents. each

of which would learn a specialized profile. One could be good at retrie~ing results on

hardware accessories. while one could be good at retrieling resul t s on free produc tivi ty

software. Each agent would leam a local version of the more general profile. This

coiild lead to a better ability to discriminate between rele~ant and irrelevant documents.

hecause each agent wodd have a full sized profde representing a local version of the

general one. This nould also provide a kind of symmetry-t he metasearch uses met asearch

as the learning component. One candidate for this type of leaming is SIGMA [I<F96].

5.2.7 Miscellaneous Improvements and Directions

In the course of using the metasearch engine. and in analyzing the results. several points

of potential improwment have corne to light.

1. To increase the precision at the beginning. it might be usehl to implement a system

in ahich. having reached a peak (as detected bu the subsequent decline in precision).

the system woiild rerank those documents that hacl been seen before. but were

unranked bj- the user. This accounts for the second spike as net ailed in section 4.6.

2. .An alternative to the above would be to have a system that alii-ays rerankecl those

clociiments rhat had been seen before. btit were tinranked by rhe user.

3. There nreds to be a better mechanisni to detect changecl documents such that a

docurnenr woiild be rejarcled as tinchangecl if it were changed in a tririal manner.

siich as a date change. or a single number or rvord change. This ~ou lc l prevent some

ronrele\ant documents from affecting the precision measures. One siich mechanisrn

niight be to only use a sample of the data in the chrcksiim. such as the 30 bytes

of data surrounding the niost common terms of the clociiment. Altematively. the

similarity measure hetween the VSSI versions of a potent ially changed document

could also he used.

4. .Aalbersberg [IJA92] obtained favorable results wi t h incremental relelance feedback

where the user only gave a relevance ranking for a single doctirnent at a tirne. This

might dev ia te sorne of the strain mentioned in section 5.2.1. The contest of the

problem presented in that paper was slightly different. however. so might not readily

apply to the situation outlined here.

5 . It is possible that using the random algorithm at the beginning of a run would

always produce better results than using the plain algorithm. Certainly. the graphs

in figures 4.3 and 4.2 suggest that using the random algorithm for at l e s t the first

day tvould be better than using the initial que- as the profile (ie: using the plain

profile).

6. It should be possible for leamers to escape from any local minima they encounter

&en using a poor profile. This means that eve- learner needs to have the ability

to revert to old profiles or use old profiles in some way in order to e-xplore the space

CHXPTER 3. COSCLLSIOSS XSD FL'TC'RE DIRECTIOSS

of possible profiles as a means of escaping the local minima.

- 1 . Esaniination of the feature space usecl might prove fruitful. For instance. insteacl

of merel!. using two word phrases. one could deternine the average distance. in

the document. betn-een words chat are in the profile. .A low al-erase distance coulcl

indicat e increased relel-ance. Of course. as mentioned in the introduction. o t her

;>-j:em.i Laïc ii;d ûtker f ~ ~ t ü i e s . as ;ïcE.

S. To bet ter use the resources anilable. the interruption niechanisni mentionecl in

section 3.3 coitld be implemented for interrupting calls thar blockecl on socket com-

m~inicat ions ( ie: rietwork cornmunicat ions ) .

Bibliography

[.41196] J. Allan. Incremenral relennce feedback for inforniatiou filrering. In .4Ck.I

SIGIR Con f.* August 1996. Zurich. Switzerland.

[Bar941 Carol L. Barry. Ilser-defined rele~ance criteria: . in exploratory stucly. Joar-

na1 of the American Society for In f o n a t i o ~ i Scierm. 45( 3 ):l-lg-l39. 1994.

[BBCgS] Ana B. Benitez. Mandis Beigi. and Shih-Fu Chang. h i n g relevame feedback

in content-based image metasearch. IEEE Internet Computing. '7(4):59-69.

.July/ August 199s.

[BCF+9S] Lee Breslau. Pei Cao. Li Fan. Graham Phillips. and Scott Shenker. Web

caching and zipf-like dis tribut ions: Evidence and implications. Technical

report. University of IVisconsin-1,Iadison. April199S. Technical Report 1371.

Computer Sciences Dept .

[ B D W 951 Jlic Botvman. Peter B. Danzip. Gdi !danber. 4Iichael F. Schwartz. Darren R.

Hardy and Duane P. Wessels. Harvest: .A scalable, customizable discover-

and access system. Technical report. University of Colorado-Boulder. 1995.

[BLGSS] Kurt D. Bollacker. Steve Lan~ence. and C. Lee Giles. CiteSeer: .in au-

tonornous web agent for automatic retrieval and identification of interes ting

publications. In Autonomow Agenb 98. h C X 199s.

[BooSS] Gary Boone. Concept features in ReAgenr. an inrelligenr email agent. In

Proceeditigs of the Second International conference on -4 utonon~o ILS Agents.

pages 141-143. 199s.

[BPSSa] D. Billsus and SI. Pazzani. Learning collaborari\-e information filrers. In

Proceedinp of the Fifieen th Int entatiotial C o n ference on Machine Learniiig.

pages 46-34. llorgan iiauiman. 199S.

[BPSShj Serge' Brin ancl Lawrence Page. The anatomy of a largescale hypertestiial

IIèb searhc engine. In Seventh International World Wide Web Cotr ference.

Brisbane. Australia. 199s.

11. B alabanovic and Y. S hoham. Learning informat ion ret r i c d agents: Es-

perimeuts wit h automated web browsing. In A.4.41 SpBng Symposirrm o n 171-

formation Gathenng /rom Heterogeneous. Distrib ut ed Erivirorimenta. llarch

1993.

M. Balabanovié and Y. Shoham. Fab: Content-based. collaborative recom-

mendation. C o m m ~ ~ n i c a t i o ~ of the ACM. 40(3):66-;O. 'rfarcli 1997.

[BSAS-L] C. Buckley. G. Salton. and J . Ailan. The effect of adding reletance informa-

tion in a relevame feedback environment. In Proceedzngs of the seventeenth

annual international ACM-SIGIR conference o n research and development

in information retn'evd Springer-krlag. 1994.

[BSY95] 11. BalabanoviC. 1'. Shoham. and Y. Y u . An adaptive agent for auto-

mated web browsing. Journal of Visual Communication and Image Represen-

atation. 6(4). 1995. http://n?vn.diglib. stanford.edu/cgi-bin/I\T/get/SIDL-

11-P-19950023.

[But 001 Declan Butler. Souped-up search engines. Nature. - I O X 12-1 15. Sf a'- 2000.

[CalSS] J . Callan. Learning u-hile filtering documents. In Proceedings of the K M

SIGIR Conference. 199s.

[CDIi+99] Soumen Chalilabarti. Byron E. Dom. S. Ravi Iiitmar. Prabhakar Raghamn.

Sridhar Rajagopalan. Andre~v Tornkiiins. David Gibson. mc1 .Jon Iileinberg.

Minirig the Keb's link structure. IEEE Cornpater. 32 ( 8 L60-67. Aligiist 1999.

iCC;3I+99! Mark Claypool. Anuja Gokhale. Tim 'iliranda. P a \ d Jlurnikol-. Dvicry

Set es. and l l a t t ben- Sartin. Combining content-based and collaborat ive

filters in an online newspaper. ACM SIGIR WorXlrhop or2 Recomrnetider

Systems. August 1999. Berkeley. CA.

iCha991 Brian D. Chambers. .-\daprive bayesian information filrering. h s r e r ' s thesis.

Cniversity of Toronto. 1999.

[Co1001 Christian Collberg. 2000. Colloquiurn at Cniversity of Toromo.

[CS981 Liren Chen and Katia Sycara. WeblIate: -4 personai agent for browsing ancl

searching. In A utonornous Agents '98. pages 132-139. ACM. 1998.

[DH96] Daniel Dreilinger and Adele E. Howe. An information garhering agent for

querying tveb search engines. Technical Report Techincal Report CS-9G- 11 1.

Cornputer Science Department. Colorado S tate Cniversity. 1996.

[D ir] Directhit. http://ivww.directhit.corn.

[FFM92] Tim Finin. Rich Fritzson. and Don McKay. A lanwage and protocol to

support intelligent agent interoperabilit- In Proceedinqs of the CE B CALS

Washington -92 Conference. June 1992.

[FLUS;] Tim Finin. I'annis Labrou. and James Uayiield. KQML as an agent com-

munication language. In Softwan Agents. MIT Press. Cambridge. 1997.

[GLG+99] Eric .J. Glowr. Steve Lawrence. llichael D. Gordon. Killian P. Birmingham.

and C. Lee Giles. Recommending \Y& documenrs based on user preferences.

In Proceedings of the CM S I G I R '99 ÇVorkhop on Recommender Sgstenw:

..llgorithms and Eval~uation. 1999.

[Go01 Google. ht t p: / / it-ww .goog1e. corn.

[Gus971 Paul Gustafsen. Sovember 1997. Lecture notes from Statistics 303. CBC.

[Har92] Donna Harman. Relelance feedback revisitecl. In Proceedings of the Fifteenth

Annual Jnten~at ional ACM S I G I R conference on Research ond deoeloprnent

in information retneval. June 1992.

[HC93] D. Haines and K. B. Croft. Relevance feedback and inference networks. In

Proceedings of the Sizteenth Annual International ACM S I G I R Conference

on Research and Development in Infonnution Retrieval. pages '2- I l . EKU.

[HD97] -4. Howe and D. Dreilinger. Sav\-ysearch: A metasearch engine that learns

which search engines to query. -41 Magazine. lS(3). 1997.

[Hu1991 David A. Hull. The TREC-7 filtering track: Description and analysis. In

E. SI. Yoorhees and D. Harman. editors. The Seuenth Text REtn'evaZ Confer-

ence (TREC-7). pages 33-56. Department of Commerce. Xational Institute

of Standards and Technolog-. 1999.

[IdeTl] E. Ide. 'r'en* e-xperiments in reltance feedback. In Salton [SalX]. pages 337-

354.

[IJASS] IJsbrand Jan Aalbersberg. Lucremencal r e l e ~ n c e feedback. In Proceedings

of the Fifteenth Annual International ACM SIGIR conference o n Research

ond development in information ntrieval. June 1992.

.J m e s J ansen. Csing an intelligent agent to enhance search

engine performance. Fkst Monday. ( 3 . Jlwch 1997.

ht t p: / / n ~ \ ~ ~ . f i r s t monda>-.dk/issues/issue2-3/j ansen/ indes.htni1.

T. Joachims. -4 probabilistic analysis of the rocchio algorithm tvith tfidf

for rest caregorizarion. In Proc. of the 14th Ititerriatiorial Corifererrce on

Mm-h.in~ L~nrning lCMC 97. ? a p 143-1 51, ! go .

E. Iieen. Term position ranhng: Some new test results. In Proceedlngs

of the Fifteenth Annual International ACM SIGIR conference on Research

and development in i n formation retrieval. pages 66-T6. l%Q. amilable at

ht tp://~~~v~~.acm.org/pubs/contents/proceedings/ir/ L3316O/.

Grigoris J. Karakoulas and Innes A. Ferguson. A cornpiitational market

mode1 for multi-agent learning. In AAAI 96 Fall Symposium orr Leamirig

Cornplex Behaviors in Adaptiue Intelligent S y s t e m . .-\--\.4I Press. 1996.

Grigoris J . Iiarakoulas and Innes A. Ferguson. Applying SIGJI-4 to the

TREC-7 filtering track. L-npublished paper obtained from Grigoris Karak-

oulas. 199s.

.I. Iileinberg. S. Kumar. P. Rapham. S. Rajagopalan. and A. Tomkins. The

web as a graph: Measurements. models and methods. In Proceedings of the

International Conference on Combinatorics and Computing. 1999.

J. Konstan. B. Xiller. D. Maltz. J. Herlocker. L. Gordon. and .'J. Riedl.

Grouplens: Applying collaborative filtering to usenet nems. Communications

of the ACM. 40(3):77-Sr. 31arch 1997.

S. R. I<umar. P. Raghatan. S. Rajagopalan. and A. To&ns. Extracting

largescale howledge bases fiom the web. In Proceedtngs of the International

Conference o n Veru Laroe Databases. Edinburgh. Scotland. 1999.

H. Kautz. B. Selman. and II. Shah. Referral rveb: Combining social networks

and collaborative filtering. Comm.unicatiotis of the A CM. AO(3). blarch 1997.

Alesander Lebedel-. Best search engines for finding scientific in-

formation in the web. néb aiithored. May 1997. \\éb address

http://n~r~v.chem.rnsu.su/eng/comparison.htnil.

't'annis Labrou and Tirn Finin. -1 proposal for a new kqml specification.

Technical report. Computer Science and Electrical Engineering Department.

Cniwrsity of Maryland. Baltimore County. Bdtimore. SID '21250. F e b r u q

1997. TR CS-97-03.

Sceve Lawrence and C. Lee Giles. Contest and page analysis for improveti

Uéb search. IEEE Inteniet Computing. 2(4). 199s.

Steve Lawrence and C. Lee Giles. Inquirus. the SECI rneta search engine. In

Seventh International World Wide We b Con ference. pages 95-105. Brisbane.

-lus tralia, 1998. Elsevier Science.

Steve Lawrence and C. Lee Giles. Searching the Korld n'ide Web. Science.

%O(536O) :9S. 199s.

Steve Lawrence and C. Lee Giles. Accessibility of information on the web.

Nature. 400(6740):107-109. 1999.

Steve Lamence. C. Lee Giles. and Kurt Bollacker. Digital libraries and

auronomous citation indexing. IEEE Compter. 32(6):67-Z. 1999. Worbring

systern aiailable at ht tp: //citeseer.nj .nec.com/cs.

Dunj a Mladenic. Te--learning and related intelligent agents: h survey.

IEEE Intelligent Systems. pages 44-54. July 1999.

W. Meng. Ii. Liu. C. Yu. W. Ku. and S. Rishe. Estimating the usefulness

of search engines. In 15th Intenaatiotial Conference o n Data Engineering

(ICDE '99). Sydney. -4ustralia. Ilarch 1999.

.Uvin Moore and Brian H. .\lima\-. Sizing the inremet. Jidy 2000.

ht tp://~~~~~~-.cy~eillance.com/ne~~sroom/pressr/OOO~~O.asp.

.Ja.kob Sielsen. How uusers read on the w b . Octoher 1991.

http://w~v~r..useit.corn/alert box/971Oa.htnil.

Jakob 'iielsen. Why yahoo is good (but niay get worse). Xovember 199s.

http://~r?v~~.useit.com/dertbos/9S110l.htrnl.

.Jakob Sielsen. July 1999. http://~~*~vm.useit.com/hotlist/spot-

light 1999q234.htrnl.

Jakob Sielsen. 'top ten mistalies' revisited three years later. l lay 1999.

http://i\-tt.~~.~iseit.com/alertbos/990502.html.

.Jakob Sielsen. 1s navigation useful? . January 2000.

ht tp://w~nv.useit .corn/alert box/20000109.html.

Yoshiki Niwa. 1Iakoto Irvayama. Toru Hisami tsu. S hingo Sishiola. Akhiko

Takano. Hirofumi Sakurai. and Osami Imaichi. Interactive document search

with DualIWVI. In Proceedings of the First NTCIR Worhhop on Research

in Japanese Text Retrieval and Term Recognition. pages 123-130. August

1999. Tokyo. Japan.

Open directon http://\t7tv-.dmoz.org.

Taemin Kim Park. Toward a theory of user-based rele\ance: -4 c d for a

nen paradigm of inquîry. Journal of the Arnerican Society for Information

Science. 45(3):135-141. 1994.

M. Pazzani and D. Billsus. Leaming and revising user profiles: The identifi-

car ion of interesting web sites. k?achine Learning. 27:3 13-331. 1997.

'ilichael J. Pazzani and Daniel Billsus. Edua t i ng adaptiw web sire agents.

b*orkshop on Recommender Systems Algorithms and Enluation. '2nd Inter-

national Conference on Research and Development in Information Retrielal.

L!Xl!L

Gabriela Polticova. Recommending htd-documents iising feature guided au-

toniated collaborarive filtering. In Johann Eder. h a n Rozman. and TaGana

nélzer. edirors. ADBIS Short Papers. pages S I -87. Instit ute of Informat-

ics. Faculty of Elecrrical Engineering and Cornputer Science. Smetanova 1;.

IS-2000 Illaribor. Slovenia. 1999.

SI. Porter. An algori t hm for s d s stripping. program. Automated Librarg

and Information Systems. l4(3):130- 137. 1980.

Stephen Robertson and David A. Hull. Guidelines for the TREC-9 filtering

track. http://wtvtv.soi.cityac.uk/ ser/filterguide.htm.

Joseph J. Rocchio. Rele$ance feedback in information retrieial. In Gerard

Salton. editor. The SMART retrieval system: experiments in automatic doc-

ument processing. pages 313-323. Prentice-Hall. Englewood Cliffs. US. 1971.

'rlehran Sabarni. Using Machine Learning to Improve Infornation Access.

PhD thesis. Stanford Uni-jersity. December 1998.

Gerard Salton. editor. The SMART retrieval 3ystem: ezperiments in auto-

matic document processing. Prentice-Hd. Englewood Cliffs. US. 1971.

Gerard Salt on and C hrir Buckley. Improving ret r i c d perforniance by rel-

elance feedback. Journal of the Amencan Society for Information Science.

-Il(-I):2SS-297. 1990.

llathew Schwartz. Shwper staples. June 2000. hr tp://wv~\-.cornputer-

worlc~.com/c~1-i/story/O.1I99.S~~~~~~STO457S~.OO.html.

Gabriel L. Somlo and Adele E. Howe. Agent-assisted internet browsing. In

Proceedings o f the Worhhop on Intelligent Information Systems- nt the 16th

National Conference on Artificial Intelligence (AAAI '99). 1999.

G. Salton and 11.J. 1lcGill. Introd*uction to Modern Information Retrieval.

McGratv-Hill. Xew York. Sew York. 19S3.

Stat Market. Stat Uarket search engine rat in gs. June 3000.

ht tp://~~*lnv.searchenginetvatch.com/reports/statmarht .html.

Louise T. Su. The reletance of recall and precision in user etaluation. Journal

of the Arnerican Society for Information Science. 13(3):207-217. 1994.

Danny Sullivan. Media Metris search engine ratings. Mar& 2000.

Dnnny Sullivan. ';PD search k navigation study. June 2000.

Damy Sulli~an. Search engine alliances chart. June 2000.

ht tp:/ / s e a r c h e n g i n e w a t c h . c o m / r e p o r t s / ~ ~

Dnnny S u l l i ~ n . Survey reveals search habits. June 2000.

http://searctenginewatch.com/sereport/00/06-r .html.

[TogSS] Bruce Tognazzani. Scding information access. August 199s.

http://~~~~-~~-.asktog.com/columns/00Sscale&nfo.html.

[ZFJ97! L. Zhang. S. Floyd. and 1.. Jacobson. Adaptive web caching. In NLAiVR

W e b Cache Worbhop. June 1997. http://tv\~x\--nrg.ee.lbl.gov/floyd.