oasis at ntcir-5: web navigational retrieval...

OASIS at NTCIR-5: WEB Navigational Retrieval Subtask

Vitaly KLYUEVSoftware Engineering Lab

The University of AizuTsuruga, Ikki-machi, Aizu-Wakamatsu city, Fukushima, 965-8580, Japan

[email protected]

Abstract

We experienced negative results participating in theNTCIR-5 Web Navigational Retrieval Subtask: theOASIS system, which is a distributed search systembased on VSM and full text indexing, failed to retrieverelevant documents from the huge data set of JapaneseWeb pages when the number of relevant documentsin the collection was relatively small. This paper de-scribes our approach and the techniques used.Keywords: search engine, distributed system, full-textphrasal indexing, vector space model.

1 Introduction

At the��

NTCIR Workshop, OASIS participatedin the NTCIR-5 Web Navigational Retrieval Subtask(the Navi-2 Subtask). OASIS stands for the Open Ar-chitecture Server for Information Search and Deliv-ery [4]. Our aim was to test the ability of the systemto search huge datasets of Japanese documents.

The basic idea of the OASIS approach is the fol-lowing. The OASIS service presents a distributedsystem of search engines in the Internet. The stan-dard server usually creates and supports one or moresubject-specific (topic related) indices. The user queryis only propagated to a subset, which possibly containsthe requested information. After merging, returned re-sults are presented to the user in the usual search en-gine manner [6].

To meet all requirements of the Navi-2 Subtask, thecore of the system was changed. The Vector SpaceModel behind the system and an idea to use the dis-tributed approach to index the data remained.

A test dataset provided by the organizers was di-vided into three parts without any topical principle anddistributed between three different servers. As a re-sult, the propagation and merging mechanisms werechanged: Every query was propagated to all sets ofservers. An approach adapted to merge retrieval re-sults is discussed in section 2.

We applied a classical variant of the Vector SpaceModel (VSM) to represent documents and queries inthe memory of a computer. They were considered asvectors in the term space. A measure of similarity be-tween the two vectors was computed. To calculate thevalue of the jth entry in the vector corresponding todocument �� , the following formula was used [2]:

� �� where �� is the number of occurrences of term indocument �� , � � � ��! #"%$

&.�

is the total num-ber of documents in the data set and

� � is the numberof documents which contain . A similarity betweenquery ' and document �(� is defined by the product oftwo vectors in VSM. To compute its value, we appliedthe standard formula as follows:

) � '+*,�-� & ��.0/21

��354 � � �� &

where 364 and� � are term weights of vectors ' and

�+� respectively; t is the total number of terms.The full-text indexing technique was applied to the

dataset offered by the organizers.

2 Methods Used

We applied the techniques characterized below toenhance the vector space model implemented:

1. The indexing units used by our system are wordsextracted by the Mecab lexical analyzer andpseudo-collocations generated automatically. Weutilized the following simple approach to com-pute them. Every term generated by Mecab hasa part of speech mark. Our algorithm employednoun, adjective and unknown word marks. Eachpseudo-collocation, consisting of two or threeterms, was constructed from two to three nounsor unknown words located next to one anotherin a document. Two term collocations were builtfrom an adjective and a noun or an adjective and

Proceedings of NTCIR-5 Workshop Meeting, December 6-9, 2005, Tokyo, Japan

an unknown word. If their position in the doc-ument was next to each other, a new collocationwas taken into account.

Japanese stop words from the list provided in [1]were applied to eliminate common terms.

2. We took into consideration how often each termappears in document headings (in the title sec-tion) and enhanced weights (1.5 times) accord-ingly.

3. Weights of the terms gathered around URLs wereincreased in a ‘wave’ style manner. The follow-ing formula was applied:

3�� 3 �� 3 �� *�� * �� *

where � as a distance in words from the URL � ;3 �� is a weight of the term.

4. A two-stage retrieval technique was used to getand merge results obtained from the servers:First, terms from the NARR element of the cor-responding topic description were submitted toeach server. The aim was to get documents in-cluding at least one searched term. Second, alldocuments after retrieval were put into one pool.An examination of this pool gave the final resultof the search. The latter query included termsfrom the TITLE element of the topic description.

5. Alternatively, the Borda-fuse algorithm intro-duced in [3] was adapted to the configuration ofour system to merge the search results and assignthe final relevant score. As we mentioned in sec-tion 1, there was no topical distribution of databetween servers: They served different parts ofthe dataset. Each of our three servers retrieved upto �� documents in response to a query submitted.The top ten documents from each returned listwere given a new score according the formula.

) � �� & � � � ��

* � � � *��* � �

where � �� is ith ranked document retrieved byserver �k;

�� is a number of documents retrieved

by server �k. Because each server kept differentsized data, the damper coefficients were utilizedto adjust the obtained score. These coefficientsreflect the size of the data of each server: Theyare �� , �� , and �� .The rest of documents fromthe retrieved lists kept their initial score assignedby the corresponding server. Finally, all retrieveddocuments were ranked in order of the points cal-culated. This scoring mechanism requests theservers to get only a small number of documents

Table 1. Parameters of the ServersNumber Processor Memory Hard Drive1 Dual Intel 2GB 500GB

Xeon 2.8GHz2 Intel 1.7GHz 1GB 1TB3 Intel 1.8GHz 2GB 300GB

because fewer hits returned by a server, the higherthe score given to them, and the higher the placeon the list presented to the end user they receive.

3 System Description

The prototype which was used in the Navi-2 Sub-task included three servers on the base of PC compati-ble computers running Linux 9.0 and Linux 7.3. Theirparameters are presented in Table 1.

Servers were attached to each other using a gigabitnetwork connection. Full-text indexing took approxi-mately 2 months. Half of the data set was put on server2. The second half was distributed between server 1and 3. Server 1 kept a slightly larger portion of data.

4 Search Results

Only mandatory queries were submitted to the sys-tem: They were taken from topics � �� !� �#"�" . Aretrieval process was carried out in a fully automaticway. The results of three official runs were submit-ted as an outcome of our tests. They are OASIS-01,OASIS-02, and OASIS-03. The search techniques ex-plained in section 2 (items 1 to 4) were applied to runOASIS-02 and other sets of techniques (items 1, 2, 3and 5) were activated for run OASIS-03. During theOASIS-01 run, one of our servers generated an errorand stopped. This run did not make a retrieval fromthe whole data set. Affected topics were � �� $�%�#��&#' .

The system failed in the retrieval of relevant docu-ments. Table 2 gives some examples. It introduces thenumber of retrieved documents by OASIS (OASIS-03run) in response to queries submitted and the numberof relevant documents (A and B categories: relevantand partially relevant) defined by assessors.

5 Failure Analysis

The problems we faced in conducting this researchwere as follows:

( The amount of data was critical for the hardwarewe used: a deficit of the hard drive space. A por-tion of data (about

� ��#) * ) was not indexed. Wewere worried about overrunning indexes and al-locating working space big enough on hard disks.


Table 2. Retrieval Statistics: Some Exam-ples

Query Number of Number of relevantnumber retrieved documents in the

documents collection1003 5 21004 40 31005 22 11006 14 121008 3 51010 10 121394 21 61395 40 211397 37 11398 38 1

( Indexing and searching such a dataset were test-ing the system itself and checking its ability tomanage large amounts of data: Some bugs werefound and fixed.

( We dealt with a deficit of time to test differentapproaches.

( Indexing raw data was not successful for our sys-tem because of a large amount of errors on Webpages in the corresponding data set. In connec-tion with this issue the following note is still im-portant: Any parser which is designed to run onthe entire Web must be capable of handling ahuge array of possible errors. These include ty-pos in the HTML tags, kilobytes of zeros in themiddle of tags, HTML tags nested hundreds deep,etc [5]. The cooked data prepared by the organiz-ers did not include the HTML information. Wedid not take into account the HTML structure ofthe data. The title tag was only an exception.

( Link information was not taken into account be-cause of problems with their application.

We failed to retrieve relevant documents applyingan improved VSM approach to the huge dataset whenthe number of relevant documents in the collectionwas relatively small. The results should be investi-gated carefully. It was not a problem of the mergingtechnique we used: A merging process did not discardrelevant pages. Our servers did not retrieve relevantdocuments. We see the following reasons for this:

( Some bugs remain in our software.

( Probably, adding heuristics and incorporatingnegligible improvements to the commonly usedmodels (VSM, probabilistic, etc.) cannot be an

efficient solution to discover a small amount rel-evant documents and retrieve them successfullyfrom very huge data collections. We think theycannot work well for this task.

( Scientists working in the text information re-trieval area need a new language model applica-ble to natural languages (Japanese, English, etc.)to design new systems which can be tools for theNavi-2 Subtask.

References

[1] Mnogosearch web search engine software (a list ofjapanese stopwords). http://www.mnogosearch.org/.

[2] David A. Grossman and Ophir Frieder. Information Re-trieval: Algorithms and Heuristics. Kluwer AcademicPublishers, 2000. (ISBN: 0-7923-8271-4).

[3] J.S. Asiam and M. Montague. Models for metasearch.In SIGIR 2001, pages 276–284, New Orleans, Luisiana,USA, 2001.

[4] V. Kluev. Compiling document collections from the in-terner. SIGIR Forum, 34(2):9–14, Fall 2000.

[5] Sergey Brin and Lawrence Page. The anatomy of alarge–scale hypertexual web search engine. In Proceed-ings of 7th International World Wide Web Conference(WWW-7), 1998.

[6] V. Kluev. Web search experiments using oasis. In Pro-ceedings of the Third NTCIR Workshop Meeting, Tokyo,Japan, 2002.


oasis at ntcir-5: web navigational retrieval...

Documents