abstract - gupta lab€¦ · web viewto create an inverted index from the above structure, we...

Project Report

Comparative Analysis of Data Structures for Inverted File Indexing in Web Search Engines

Ingrid Biswas, Vikram PhadkeCSE 598: Design and Analysis of algorithms Project

Computer Science & Engineering DepartmentArizona State University

[email protected],[email protected]

ABSTRACT...........................................................................................................4

1 INTRODUCTION............................................................................................4

2 BACKGROUND: INFORMATION RETRIEVAL SYSTEMS.........................5

3 WEB SEARCH ENGINE ARCHITECTURE...................................................6

3.1 Crawler...................................................................................................................6

3.2 Repository...............................................................................................................8

3.3 Parser......................................................................................................................9

3.4 Indexer..................................................................................................................10

3.5 Page ranking module and Query Engine...........................................................11

4 THE GOOGLE SEARCH ENGINE..............................................................12

5 TEXT INDEXING AND RETRIEVAL............................................................15

5.1 Signature files.......................................................................................................17

5.2 Vector space models............................................................................................18

5.3 Latent semantic indexing (LSI)..........................................................................18

5.4 Inverted File Indexing.........................................................................................19

5.5 Inverted File Compression..................................................................................20

5.6 Representing and Accessing Lexicons...............................................................21

6 IMPLEMENTATION.....................................................................................22

7 ANALYSIS AND RESULTS.........................................................................24

7.1 Inverted File Indexing Using Sorted Array.......................................................25

7.2 Inverted File Indexing Using Hash Table..........................................................27

7.3 Inverted File Indexing Using BTrees.................................................................30

7.4 Comparative analysis of all three data structures............................................32

7.5 Search and retrieval efficiency of the data structures......................................34

7.6 BTrees and External memory:...........................................................................34

8 FUTURE RESEARCH DIRECTIONS..........................................................35

9 CONCLUSION.............................................................................................36

10 References...............................................................................................37

ABSTRACT

Search Engines of today serve as portals to the millions of web pages that form the WWW

(World Wide Web). They are probably the most popular examples of Information

Retrieval tools. They contain four major components that interact together namely, the

Crawler, Storage module, Parser, Indexer Query Processor and Ranking module.

Efficient algorithms and data structures can make the difference between an average and

an exceptional search engine. Search engines today have to index millions of pages. Our

work studies text indexing in the context of web search engines. In particular the

inverted file-indexing algorithm for indexing is studied in detail. Different Data

structures are compared in terms of the time required to create index, the time required

to query the index and the space footprint.

1 INTRODUCTION

Search engines are extremely useful information retrieval tools. They are used for just

about everything from shopping for electronics to looking for research papers. With the

size of the WWW growing rapidly the search engine technology faces increasing

challenges. Our work had the following objectives (1) Gain an in depth understanding of

search engine technology (2) Look at search engines from the perspective of algorithms

and data structures (3) Studying the different modules of the search engines in detail,

analyzing algorithms, data structures.

We focus on the indexing module of search engines, and analyze the inverted file

indexing algorithm. Different kinds of data structures can be used to implement the

index. Sorted arrays, tries, Btrees, Hash Tables can be used to create the index. Various

issues such as the time required to create the index, the space footprint of the index, the

time required for retrieval arise when talking about efficient data structures for the

indexing algorithm. Our work focuses on comparing data structures for the inverted file

indexing algorithm in terms of time required to create the index. An outline of this report

follows. Section 2 provides some background on information retrieval techniques.

Section 3 discusses web search engines and the various modules in the web search

engines. Section 4 describes in detail the working of the Google search engine. Section 5

describes the various algorithms used for text indexing. It describes in detail the inverted

file indexing algorithm and the data structures that can be used to store the index.

Section 6 describes the design of the implementation of the “evaluation environment”

that was used for comparing the performance of the inverted file indexing algorithms

when different data structures are used. Section 7 explains the results of the experiments.

Section 8 explains future research directions based on the experiences with this work.

2 BACKGROUND: INFORMATION RETRIEVAL SYSTEMS

Information retrieval is a general term that is used to identify all those activities that

enable us to choose from a given collection of documents. These could be documents that

belong to particular domain of interest or a particular topic. The activities that we are

concerned with in retrieving information are those that permit us to reach the target of

choosing the documents that are probably relevant to the initial information need in an

automatic way. The main criteria for automatic information retrieval is that the

collections of documents that are available are in a digital form. In traditional IR, the

collection of documents is a set of documents that has been put together, because it is

related to a specific context of interest for the users that are going to use it. An IR

collection is a set of all the documents of the collection that have certain properties or

features in common. These features are used to cluster similar documents enabling a

faster retrieval of documents pertaining to the user query.

It is possible to use a traditional IR system and its documents collection in a web based

IR system. But there are issues that need to be looked into in this aspect. The IR system

needs to be made available to the end user through a program that connects the IR system

sitting on the web server to a Web page that acts as an interface between the user and the

IR system.

Retrieving information from the Internet is a common practice for Internet users.

However the size and heterogeneity of the web makes it very challenging. It also reduces

the effectiveness of information retrieval techniques that are used to retrieve information

from traditional data sources. Many software tools are available these days for web

information retrieval, like search engines, hierarchical (Google, AltaVista) directories

(Yahoo) and many other software agents.

Web users started having the availability of proper tools to access documents on the

Internet during 1994. Before that year, it was possible to use tools that were indexing and

managing only the title, the URL, and some small parts of Web pages [Maud98]. Since

then there have been so many advances in this field that is can be looked at as big event

in the history of Information retrieval system technology. WebCrawler, developed at the

University of Washington (USA), was the first tool that allowed the user to search the full

text of entire Web documents, was available in April 1994 [Maud98]. Lycos, another web

search engine developed at Carnegie Mellon University (USA) [Herr99] in July 1994. So,

we can say that from 1994 on it has been possible to have Web tools with effective IR

functionalities.

Since 1994, the IR system for the web has flourished with innovative and better tools for

effective and faster information retrieval. Altavista entered the scene in 1995 with a

number of innovative features, and in the following years many other search tools were

made available.

3 WEB SEARCH ENGINE ARCHITECTURE

3.1 Crawler

The crawler module retrieves pages from the Web. It typically starts with an initial set of

URLs. This initial set of URLs is fed into the crawler in a queue structure. The crawler

then gets a URL from this queue one at a time. There are different ways to choose which

URL to visit next, namely, Depth-First, Breadth-First or randomly, depending on the

implementation of the crawler. The crawler downloads the web page, extracts any URLs

in the downloaded web page, and adds the URLs that it found on the downloaded web

page in the same queue. This action is continued until the crawler decides to stop. The

crawler will stop once it has visited all the web page URLs in its queue. There are several

issues that need to be taken into consideration regarding how the crawler behaves. The

main issues that need to be considered are based on the enormous size of the Internet. It is

impossible for the crawler to download all pages on the Web. The most comprehensive

search engine can index only a small fraction of the entire Internet. Based on this fact, it

is necessary for the crawler to prioritize the URLs in such a way that it will visit

“important" pages first. This ensures that the part of the Web that is visited by the crawler

is more meaningful.

The main steps that the crawler has to take can be summarized in the following steps.

First, it needs to be fed a URL or a set of URL’s. The crawler picks a URL from this

queue and fetches the web page from this URL. It then parses this page and extracts links

to other URL’s from this page. It filters out unwanted links and links that it has already

visited. It add all these URL’s into the queue. This is the basic working of all crawlers.

The main difference between crawlers comes in depending on what algorithm they use

for choosing the next URL. Some crawlers use simple algorithms to pick the next URL

like random, FIFO, LIFO, others use a priority algorithm shown below in [Kwon00].

After the crawler has downloaded a number of pages, it sends the downloaded web page

to the repository module to be stored. It then needs to make sure that the repository of

web pages it has stored is refreshed. For this the crawler needs to revisit the same URLs

in order to detect changes in the downloaded pages and refresh the collection. Because

Web pages are changing at very different rates [Cho00], and due to the enormous size of

the web the crawler is not able to go back to all the web pages fast and refresh the pages.

Hence, it needs to decide which pages to revisit and which page to skip. This decision

significantly impacts the “freshness" of the downloaded documents. As an example, if a

certain page changes rarely, the crawler may not want to revisit the page very often, that

way it is able to visit more pages that change more frequently.

[Kwon00] gives a prediction algorithm that can be used to find out when a particular

web page will be updated, helping the crawler to decide when to visit the page. This

paper tells how to calculate the update frequency of each page by using these main

factors. Firstly, you need to get LA(P), which is the local average of the page P, i.e., we

calculate the average frequency of the pages that are in proximity of this page P and also

all the page frequencies should be close within a certain threshold. Secondly, we need the

history average of the page HA(P), which gives the average frequency that is calculated

using the page modification history. Thirdly, we need to calculate the tolerance of the

page, which defines how close this page is to other pages and this value is used in

calculating the value of LA(P). The formula used to calculate the update frequency of a

given page P as FR(P), is given below. It uses the 3 terms we calculated earlier.

FR(P)=HA(P)*(1-LW(n)) + LA(P)*LW(n)

where LW is a weight factor associated with the local average LA(P) and n is the

number of history records. The algorithm makes a few trivial but useful assumptions.

First, recent history is much more important than old history. Second, history data of the

page are more trustful than locality data, provided that we have enough history records.

The equations for history average and local weight are defined based on these two

assumptions.

Due to the enormous size of the Web, crawlers often run on multiple machines and

download pages in parallel [Cho00]. This parallel processing is needed so that the crawler

is able to download a substantial amount of pages in a reasonable amount of time. These

parallel crawlers need to be coordinated with each other, so that multiple crawlers do not

visit the same URL multiple times.

3.2 Repository

The page repository is a scalable storage system for managing large collections of Web

pages. The repository needs to perform two main functions. Firstly, it needs to provide an

interface for the crawler to store web pages it has crawled. Secondly, it must provide an

efficient API for accessing that the indexer module can use to retrieve the pages. There

are a few challenges that the storage module needs to address. It needs to be scalable and

distributive so that the data it stores can be distributed over a network of servers, due to

the large size of data we are dealing with. The repository also must support different

access modes namely, Random access and Streaming access.

Random access is used to quickly retrieve a specific Web page, given the page's unique

identifier. The query engine module needs to access the repository with “Random access”

to serve out web pages to the end-user depending on their query string. Streaming access

is used to receive the entire collection, or a significant subset, as a stream of pages.

Indexer module uses “Streaming access” to process and analyze pages in bulk. The

repository also needs to deal with issues regarding updating the newer versions of the

web pages. The repository needs to be able to identify pages that are obsolete (deleted

from their websites). When the web pages are removed from their web sites, it is not

informed to the repository. Thus, the repository needs a mechanism to be able to identify

and remove obsolete pages from its storage.

3.3 Parser

The parser module is an intermediate module between the repository and indexer. The

Indexer module uses this module to extract the web pages from the repository and

process the web pages to remove the HTML tags. The Parser then takes this page content,

i.e. the web page without all the tags and parses the page again to remove any Stop List

words. Stop List words are words that occur very frequently and do no help in any way to

differentiate between the documents. In other words, they appear in almost all the

documents. Eg. of Stop List words would be a, and, the, if, how, etc. The indexer module

will take the page content left by the Parser and use it to index the text. The parser then

extracts the keywords from the page content and creates a forward index for each page.

Forward index is a structure that stores a list of all the keywords that appear in the web

page along with the occurrence of the keyword.

3.4 Indexer

The indexer module builds a variety of indexes on the pages in the repository. It gets the

forward index structure built by the parser module and creates an inverted index

structure. An inverted index structure contains a list of all keywords along with the list of

URLs that the keyword appears in, for each keyword. The inverted index structure is

indexed on the keywords. The indexer module creates two main indexes: a Text Index to

index all the keywords and a Link Index to index all the links on the web page.

Text-based retrieval, namely, searching for pages containing some keywords is the main

method for identifying pages relevant to a query. Various methods have been used to

implement support for text-based retrieval to search over the text document collections.

Examples include suffix arrays [Manb90], inverted files or inverted indexes [Salt89,

Witt94], and signature files [Falo84]. Inverted indexes have been the index structure of

choice on the Web traditionally. Inverted indexes will be discussed in detail later on

section 6.

The whole Web is modeled as a graph with nodes and edges [Brod00]. Each node in the

graph is a Web page and a directed edge from node A to node B represents a hypertext

link in page A that points to page B. A Link Index is a subset of this graph that contains

web pages (nodes) that have been visited and links (edges) that have been found on the

web pages. The most common structural information that is often used by search

algorithms [Brin98] is neighborhood information, i.e., for a given page P, the outward

links are the set of pages that are pointed to by P or incoming links are the set of pages

pointing to P. Neighborhood information of the original graph and its sub graph can be

easily retrieved using the Adjacency list representations [Aho83] of the graph. The

information stored in these adjacency lists can be used to extract other structural

properties of the Web graph. For example, if we need to retrieve pages that are related to

a given page, then the notion of sibling pages is often used. This information about

sibling can be easily derived from the adjacency list structures described above.

Small graphs of hundreds or even thousands of nodes can be efficiently represented by

any one of a variety of well-known data structures [Aho83]. However the biggest

challenge is to do the same for a graph with several million nodes and edges. The

Connectivity Server in the AltaVista search engine that is used to deliver linkage

information for all pages retrieved and indexed, is described in [Bhar98]. Even though

link-based techniques are used to enhance the quality and relevance of search results, text

based structures are the most important ones used.

3.5 Page ranking module and Query Engine

The Query Engine takes the query string from the user containing the terms to search for

and retrieves pages that are likely to be relevant to the query. The relevant pages that are

retrieved need to be ranked. Traditional Information Retrieval (IR) techniques do not

have any effective algorithm for ranking query results due to the reasons listed below.

Firstly, the Web is very large and has a great variation in the content, amount and quality

of information present in the Web pages. Hence, many pages that contain the search

terms may not be relevant to the user or could be of poor quality. Secondly, most Web

pages are not very self-descriptive, so the traditional IR techniques that are used to

examine the contents of a page do not work very well. An often cited example to

illustrate this issue is the search for “search engines" [Klei99]. The homepages of most of

the important search engines does not contain the text “search engine". Spamming is a

big issue while ranking pages. Web developers have started adding misleading terms to

the web pages so that the search engine will rank them higher. This is another reason, the

content of pages alone cannot be used as a technique to rank the pages.

As we have mentioned earlier, the web is looked as having a graph structure. The

information maintained by the link structure can be used in ranking pages. For example,

if there is a link to page B in a web page A, then it implies that web page A is

recommending web page B. This recommendation can be used to give an importance to a

web page based on how many pages are referring to it. Some new algorithms have been

proposed that make use of this link structure. These algorithms are based not only on the

content of the page but also on the link structure, hence they are generally better than the

traditional IR algorithms. Spamming has come into even this aspect of the web with web

developers adding more links to particular web pages. But the advantage is that they are

not able to influence the link structure at a global level. Hence link analysis algorithms

working at a global level are relatively robust against spamming.

Page and Brin describe a global ranking scheme, called PageRank, in [Page98] that tries

to capture the notion of “importance" of a page. The rank of the page is defined based on

the number of pages that link to that page, in other words, a page is more important than

another page if the number of incoming links is higher than the other’s. The rank of a

web page A can be defined as the number of pages in the Web that point to A, and could

be used to rank the results of a search query. This is known as citation ranking. It does

not work very well against spamming, as it is very easy to artificially create a huge

number of pages to point to the desired page.

The PageRank algorithm extends the basic citation-ranking algorithm. It takes into

consideration how important the pages are that point to this web page. Thus if an

important web page points to a page A, it receives more importance in its ranking than if

an unimportant page pointed to it.

The definition of PageRank is recursive and the importance of a page both depends on

and influences the importance of other pages. A simple definition of PageRank algorithm

is given below that captures the above intuition. Let us denote the pages on the Web as 1,

2,….,m. forward(i) denotes the number of outgoing (forward link) links from a page i.

back(i) denotes all the pages that contain a link to page i (back links). In this algorithm

we assume that we can reach every page from any given page, i.e., the web forms a very

strongly connected graph. A simple formula to calculate PageRank of page i, denoted by

rank(i), is given by

The division by forward(j) captures the intuition that pages which point to page i evenly

distribute their rank to boost to all of the pages they point to.

4 THE GOOGLE SEARCH ENGINE

In this section we see the architecture and working of a very popular Search Engine,

Google. Most of Google is implemented in C and C++ for efficiency and it can run on

Linux and Solaris servers.

Google has several distributed servers for web crawling, i.e. finding URLs and

downloading web pages from the Internet. This helps in parallel processing, as there are

millions of web pages all over the Internet. At the start of each run, the URL Server has a

list of URLs that need to be crawled. The URL Server sends a list of these URLs to the

crawlers. The crawlers use these URLs as a starting point to go and fetch more URLs

from the web pages. The fetched web pages are then sent to the store server where they

are compressed and sent to the repository for storage. This function of the crawler is an

on going process where it keeps on going to the same URLs and tries to see if the web

pages have been updated since it got them the previous time. If they have been updated,

then the crawler gets the new web page and sends it to the store server. Every web page is

given a Document ID number that is assigned when the URL is parsed on the web page.

Next step in the process is to index and sort the web pages that is done by the indexer and

sorter. The indexing module takes the web pages from the repository, uncompresses and

parses them. Each web page is then converted to a forward index structure that contains

all the words in that web page along with the occurrences, the number of times the word

occurs in the document. The position of the word in the document along with the font

size and capitalization are also stored in the forward index. The indexer then distributes

this structure into a set of barrels, creating a forward index that is partially sorted. The

index also parses out all the links in the web page and stores them in the anchors file.

Information from this file can be used to easily determine where each link points to and

from and the text that is part of the link.

The URL Resolver reads the URLs from the anchors file and converts them to absolute

Document ID’s. It puts the anchor text into the forward index associated with the

Document ID. It also generates a Links database, which are pair of Document IDs. This

Link database is used later in the Page Ranking algorithm.

The sorter takes the documents in the barrel that are sorted by Document ID and creates

an inverted index. An inverted index contains the index for each word associated with the

document and the occurrence in that document. The DumpLexicon takes this inverted

index list along with the Lexicon that is produced by the indexer module and creates a

new lexicon to be used by the searcher. The searcher that is run by the web server takes

the query words from the user and uses the lexicon produced by the DumpLexicon, the

inverted index and PageRank to answer the query.

Figure 1. High level Google Architecture [Huan00]

5 TEXT INDEXING AND RETRIEVAL

Indexing addresses the issue of how information from a collection of documents should

be organized so that queries can be resolved efficiently and relevant portions of the data

extracted quickly. We will describe a variety of indexing methods. To be as general as

possible, a document collection or document database can be treated as a set of separate

documents, each described by a set of representative terms, or simply terms (each term

might have additional information, such as its location within the document).

An index must be capable of identifying all documents that contain combinations of

specified terms, or that are in some other way judged to be relevant to the set of query

terms. The process of identifying the documents based on the terms is called a search or

query of the index.

Applications of indexing

Indexing has been used for many years in a wide variety of applications. It has gained

particular recent interest in the area of web searching (e.g. AltaVista, Hotbot, Lycos,

Excite, ...). Some applications include Web searches, Library article and catalog searches,

Law, patent searches , Information filtering, e.g. get interesting New York Time articles.

The goals of these applications:

Speed -- want minimal information retrieval latency

Space -- storing the document and indexing information with minimal space

Accuracy -- returns the ``right'' set of documents

Updates -- ability to modify index on the fly (only required by some applications)

Figure 2 provides an Overview of Indexing and Searching process.

Figure 2: Overview of Indexing and Searching

Figure 2 Overview of indexing and searching

The main approaches that are used for Text Indexing are as follows:

Full text scanning (e.g. grep, egrep)

Inverted file indexing (most web search engines)

Signature files

Vector space model

Each one of these approaches will be explained in detail in the following sections. Our

work focuses on Inverted file indexing and efficient data structures that can be used.

The different types of queries that a index may have to support are, boolean (and, or, not),

proximity (adjacent, within), key word set, in relation to other documents (relevance

feedback). The Index should also allow for prefix matches (AltaVista does

this) ,wildcards ,edit distance bounds (egrep)

There are some general techniques that are used by all indexing approaches irrespective

of the algorithm or data structures. These are

case folding: London = london

stemming: compress = compression = compressed

(several offtheshelf English language stemmers are available)

ignore stop words: to, the, it, be, or, ...

Problems arise when search on To be or not to be or the month of May

Document Collections

Index

“Document List”

Query

Thesaurus: fast = rapid

(handbuilt clustering)

Granularity of Index

The Granularity of the index refers to the resolution to which term locations are recorded

within each document. This might be at the document level, at the sentence level or exact

locations. For proximity searches, the index must know exact (or near exact) locations.

5.1 Signature files

Signature files are an alternative to inverted file indexing. The main advantage of

signature files is that they don't require that a lexicon be kept in memory during query

processing. In fact they do not require a lexicon at all. If the vocabulary of the stored

documents is rich, then the amount of space occupied by a lexicon may be a substantial

fraction of the amount of space filled by the documents themselves.

Signature files are a probabilistic method for indexing documents. Each term in a

document is assigned a random signature, which is a bit vector. These assignments are

made by hashing. The descriptor of document is the bitwise logical OR of the signatures

of its terms. As we will see, queries to signature files sometimes respond that a term is

present in a document when in fact the term is absent. Such false matches necessitate a

three valued query logic.

There are three main issues with respect to signature files : (1) generating signatures, (2)

searching on signatures, and (3) query logic on signature files.

5.2 Vector space models

Boolean queries are useful for detecting Boolean combinations of the presence and

absence of terms in documents. However, Boolean queries never yield more information

than a Yes or No answer. In contrast, vector space models allow search engines to

quantify the degree of similarity between a query and a set of documents. The uses of

vector space models include:

Ranked keyword searches, in which the search engine generates a list of documents that

are ranked according to their relevance to a query.

Relevance feedback, where the user specifies a query, the search engine returns a set of

documents; the user then tells the search engine that documents among the set are

relevant, and the search engine returns a new set of documents. This process continues

until the user is satisfied.

Semantic indexing, is a type of indexing in which search engines are able to return a set

of documents whose ``meaning'' is similar to the meanings of terms in a user's query. In

vector space models, documents are treated as vectors in which each term is a separate

dimension. Queries are also modeled as vectors, typically 01 vectors. Vector space

models are often used in conjunction with clustering to accelerate searches.

5.3 Latent semantic indexing (LSI)

All of the methods we have explained so far to search a collection of documents have

matched words in users' queries to words in documents. These approaches all have two

drawbacks. First, since there are usually many ways to express a given concept, there

may be no document that matches the terms in a query even if there is a document that

matches the meaning of the query. Second, since a given word may mean many things, a

term in a query may retrieve irrelevant documents. In contrast, latent semantic indexing

allows users to retrieve information on the basis of the conceptual content or meaning of

a document. For example, the query automobile will pick up documents that do not

contain automobile, but that do contain car or perhaps driver.

5.4 Inverted File Indexing

Inverted file indices are probably the most common method used for indexing

documents. Figure 3 shows the structure of an inverted file index. It consists first of a

lexicon with one entry for every term that appears in any document. We will discuss later

how the lexicon can be organized. For each item in the lexicon the inverted file index has

an inverted file entry (or posting list) that stores a list of pointers (also called postings) to

all occurrences of the term in the main text. Thus to find the documents with a given term

we need only look for the term in the lexicon and then grab its posting list. Boolean

queries involving more than one term can be answered by taking the intersection

(conjunction) or union (disjunction) of the corresponding posting lists.

We will consider the following important issues in implementing inverted file indices.

How to minimize the space taken by the posting lists?

How to access the lexicon efficiently and allow for prefix and wildcard

queries?

How to take the union and intersection of posting lists efficiently.?

Figure 3: Structure of Inverted Index

5.5 Inverted File Compression

The total size of the posting lists can be as large as the document data itself. In fact, if the

granularity of the posting lists is such that each pointer points to the exact location of the

term in the document, then we can in effect recreate the original documents from the

lexicon and posting lists (i.e., it contains the same information). By compressing the

posting lists we can both reduce the total storage required by the index, and at the same

time potentially reduce access time since fewer disk accesses will be required and/or the

compressed lists can fit in faster memory. This has to be balanced with the fact that any

compression of the lists is going to require onthefly uncompression, which might increase

access times. In this section we discuss compression techniques that are quite cheap to

uncompress onthefly. The key to compression is the observation that each posting list is

an ascending sequence of integers (assume each document is indexed by an integer). The

list can therefore be represented by a initial position followed by a list of gaps or deltas

between adjacent locations.

For example:

original posting list: elephant: [3, 5, 20, 21, 23, 76, 77, 78]

posting list with deltas: elephant: [3, 2, 15, 1, 2, 53, 1, 1]

The advantage of using the deltas is that they can usually be compressed much better than

indices themselves since their entropy is lower. To implement the compression on the

deltas we need some model describing the probabilities of the deltas. Based on these

probabilities we can use a standard Huffman or Arithmetic coding to code the deltas in

each posting list. Models for the probabilities can be divided into global or local models

(whether the same probabilities are given to all lists or not) and into fixed or dynamic

(whether the probabilities are fixed independent of the data or whether they change based

on the data).

5.6 Representing and Accessing Lexicons

There are many ways to store the lexicon. Here we list some of them

Sorted -- just store the terms one after the other in a sorted array

Tries -- store terms as a trie data structure

Btrees -- well suited for disk storage

Perfect hashing -- assuming lexicon is fixed, a perfect hash can be calculated

Frontcoding -- stores terms sorted but does not repeat front part of terms. Requires much

less space than a simple sorted array.

When choosing among the methods one needs to consider both the space taken by the

data structure and the access time. Another consideration is whether the structure allows

for easy prefix queries (e.g., all terms that start with wux). Of the above methods all

except for perfect hashing allow for easy prefix searching since terms with the same

prefix will appear adjacently in the structure. Wildcard queries (e.g., w*x) can be

handled in two ways. One way is to use ngrams, by which fragments of the terms are

indexed (adding a level of indirection). Another way is to use a rotated lexicon.

6 IMPLEMENTATION

The main idea behind this project is to create indexes of keywords from web pages and

store them in data structures. We then find out the time complexity and space complexity

of storing and searching the keywords in these data structures.

This project contains three main modules, the crawler module, parser module and the indexer module as shown in figure 4.

Figure 4 : Architecture of implementation of indexer moduleThe crawler module is input a root URL and the number of levels that it needs to go

down. In order to test our system with changing number of keywords, we use the option

of crawling different levels to get variable web pages. This module uses Breadth-First-

Retrieve information

Output file

Crawler

<URL>

HTML Parser (take URL and parse

HTML page to remove all HTML tags)

Indexer(get the index of documents

with keyword and occurrences and create inverted index)

Index keywords into data structures

Out.txtList of URL’s

crawled from the website URL

Out.txt

Text Parser(get the page contents

from the HTML Parser and retrieve the keywords)

Search approach to get URL’s and visit these URL’s to get the web page content. The

crawler visits the root URL and looks for links on the web page at that URL. It stores the

links on that page in a queue and visits these pages in sequence. The crawler then outputs

the visited URL’s into a file “out.txt”. In traditional search engines, the web page content

is also downloaded and compressed and stored in a repository. Our crawler only gets the

URL’s and does not store the web page content as we have very few web pages and we

are not implementing a query processor that will need the document for returning to the

user.

The next module is the parser module. The parser module reads the file output by the

crawler containing the entire URL’s visited. It takes each URL and does processing on it

to remove the HTML tags first. This content is then taken and text parsed to extract all

the words in that page. The words that are extracted are then processed some more to

remove Stop Words and Stemmed to save the word as its root. Stop words are words that

occur very often in the documents and do not assist in any way to discriminate one

document from another. The Porter stemming algorithm is used to stem the ends of the

words. We have not used any algorithm to remove prefixes to the words. These words are

then stored in the forward indexing structure. The forward indexing structures stores the

URL of the page visited, a list of keywords processed and the number of times each

keyword occurs in that document.

The indexer modules takes the forward indexing structure and processes it to create a

inverse index structure that’s stores the keywords along with a list of documents that the

keyword occurs in. Ideally the location of the keyword in each document should also be

stored, but we have ignored that aspect as we are not interested in displaying the result to

the user. Out main aim is to study the time taken to build this inverted index and time it

takes to search for keywords in this structure.

We have used 3 indexing structures. The first structure we have used is a simple sorted

array which stores the keyword as the key for sorting. We use s binary search technique

to find the keywords in the document.

The second index structure used is a HashTable. Here again we use the Keyworsd as the

key for hashing into the HashTable. The third data structure use is the Btree. We have

implemented a 2 Btree that stores a minimum of 2 keywords in each of its nodes and a

maximum of 4.

7 ANALYSIS AND RESULTS

We compared the efficiency of three different data structures with respect to the inverted

file indexing algorithm. As explained in the implementation section 3.1 a crawler is used

that can retrieve links contained in web pages. We first use the crawler to retrieve a set of

pages that shall be indexed. A HTML parser is given the list of URL’s or web pages. The

HTML parser then parses the files and retrieves only the strings in the web page. This set

of strings from a web page is then passed on to a text parser that performs stemming and

uses a stop list to remove words. This text parser then gives a set of words that shall be

indexed using the inverted file indexing algorithm.

We can set the crawler to visit pages with varying depths thus allowing us to vary the

number of keywords that are indexed. We initiated the crawler with different starting

URLs like http://www.cnn.com , http:// www.nbc.com etc. The depth used by the crawler

is also varied so as to vary the no of keywords. With a depth of 1, the crawler would

generate a set of all URLs available on the home page of the websites.

Once a set of keywords is generated, the number of keywords is calculated. We shall

compare the performance of the different data structures using based on the time that is

needed to create the index.

A tester program retrieves a set of keywords from a set of URLs and uses three different

data structures sorted array, hash table and Btree to create indexes using the inverted file

indexing algorithm.

In the following sections the results on the performance of each of the data structures are

presented.

7.1 Inverted File Indexing Using Sorted Array

To create an inverted index using sorted array, initially an index is created that is of the

following format as shown in figure 5.

http://www.nbc.com/

http://www.cnn.com/

URL Words found on the web page

URL1 Word1,word2…………..word n

URL2 Word1,word2…………..word n

URL3

URL n Word1,word2…………..word n

Figure 5. Forward index of URLs with list of keywords that appear in them

To create an inverted index from the above structure, we visit each URL in the URL list,

and then every word contained in the URL, if the word has never been indexed it is

inserted into a sorted array and a pointer is placed to list of URLs that contain this word.

If the word already is present in the sorted array then the pointer that points to URLs is

updated to add this new URL. Eventually the inverted index has the following structure

as shown in figure 6

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Sorted array of words

4, 8

2, 4, 6

1, 3, 7

1, 3, 5, 7

2, 4, 6, 8

3, 5

3, 5, 7

2, 4, 6, 8

3

1, 3, 5, 7

2, 4, 8

2, 6, 8

1, 3, 5, 7, 8

6, 8

1, 3

1, 5, 7

2, 4, 6

Postings list(Index in to URL list)

Figure 6: Inverted index using sorted array

The plot shown in figure 7 shows the performance of the sorted array structure when used

in the inverted file indexing algorithm. The X-axis represents the number of keywords

indexed, and the Y-axis represents the time required to create an inverted index based on

the sorted array data structure.

Figure 7: Plot of performance in sorted Array

As can be seen from the plot, the curve is closed to linear. The time required to create the

sorted array is quite large is because every time a keyword needs to be indexed a binary

search algorithm decides where this word should be placed, this is quite an expensive

operation and must be performed for every keyword, The other operation that needs to be

performed is that of retrieving and updating the postings list for a new keyword, or

creating and initializing a postings list for a keyword that has been indexed before.

7.2 Inverted File Indexing Using Hash Table

To create an inverted file index using a hash table the class library in Java was used. The

class “HashTable” implements a hash table, which maps keys to values. Here the

keywords represent “keys” and the values are the “list of URLs”, that contain the

keyword. Any non-null object can be used as a key or as a value. The hash table is open:

in the case a "hash collision", a single bucket stores multiple entries, which must be

searched sequentially. The load factor is a measure of how full the hash table is allowed

to get before its capacity is automatically increased. To create an inverted index, each

URL a structure similar to Table 1.0 is used. Each URL is visited sequentially and the

words contained in each URL are read. A keyword that represents a key is then hashed

using the hashing function. The java hashing function seemingly satisfies the

requirements of a good hashing function. The structure of an inverted file index with a

hash table is shown in figure 8.

Figure 8: Hash Table Implementation

Figure 9 Inverted index structure of Hash Table

Key (Keyword)

Hash Code = H(K)

Insert Element At(Hash Code) = Posting List

Posting List

quick

brown

fox

over

lazy

dog

back

now

time

aid

fox

men

come

jump

aid

their

party

Key words being indexed

4, 8

2, 4, 6

1, 3, 7

1, 3, 5, 7

2, 4, 6, 8

3, 5

3, 5, 7

2, 4, 6, 8

3

1, 3, 5, 7

2, 4, 8

2, 6, 8

1, 3, 5, 7, 8

6, 8

1, 3

1, 5, 7

2, 4, 6

Hash Code generated by hashing

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

The following plot shows the performance of the hash table when used in the inverted file

indexing algorithm. The X axis represents the number of keywords indexed, and the Y

axis represents the time required to create an inverted index based on the Hash table data

structure.

Figure 10 Plot of performance in HashTable

As can be seen from the plot, the curve is sub linear. The time required to create the hash

table is small because as opposed to the sorted array wherein every time a keyword needs

to be indexed a binary search algorithm decides where this word should be placed, the

hash table requires a simple hash function to compute the hash code which serves as an

index into an array where a posting list is stored. The other operation that needs to be

performed is that of retrieving and updating the postings list for a new keyword, or

creating and initializing a postings list for a keyword that has been indexed before.

Because of the absence of the binary search operation this is a very efficient data

structure for creation of the inverted index.

7.3 Inverted File Indexing Using BTrees

BTree is a data structure that is a balanced tree (all leaf nodes are at the same level) in

which all the nodes are sorted based on a key. Each node, except the root, stores a

maximum of m number of keys in each node and a minimum of m/2 nodes. If each node

has t number of keys in it, then it will have t+1 number of children to represent the range

of values that it can store.

At each insertion and deletion, the tree is restructured so that it is height balanced. There

are 2 ways that the tree can be restructured. First, add the new keyword into the slot that

it should be put into. If the node is over it's size limit, then split the node into 2 nodes and

pass the middle keyword into the parent of this full node. This keeps going till the root,

so that none of the nodes are full and the end result is a balanced tree. This method makes

2 passes of the tree like in an AVL tree where the node is first added and if the node is

full, then it is split. Another way of inserting a new node into the tree would be to start

with the root and keep splitting nodes that come in the way that are at their size limit, i.e.

m. This way we parse the tree only once and create a place for the new node. The BTree

class we have implemented uses this method. We have not concentrated on deleting

keywords from the node. In the index structure that we have created using the BTree, the

keyword is used as a key to sort the nodes. Each keyword has 2 lists associated with it.

One list contains indexes to all the documents that contain that keyword and the second

list contains the occurrence list, i.e. the number of times the keyword occurs in the

document.

The index in the BTree is shown in the diagram. Each node contains minimum 2

keywords and maximum of 4 keywords. We have not been able to show the associated

document and occurrence list for each of the keywords. The structure of the BTree is

shown in figure 11.

And example of a node would be

keyword : document

urlList : [doc1, doc3, doc4]

occurList : [1, 2, 1]

This implies that the keyword "document" occurs in doc1 1 time, doc3 two times and

doc4 one time.

Figure 11 Structure of BTree created

The following plot in figure 12 shows the performance of the BTree when used in the

inverted file indexing algorithm. The X-axis represents the number of keywords indexed,

and the Y-axis represents the time required to create an inverted index based on the

BTree data structure.

Doc

Cours

Alpha

cut

bat

abet

Wet

util

zip

vest

sit

Figure 12: Plot of performance in BTree

As can be seen from the plot, the curve is sub linear. The time required to create the

index using the BTree is of the order of : . where m is

the order of the BTree. The maximum depth of the BTree is always log [m/2] n, where n

is the total number of keywords that we have indexed. Each node has to have a

minimum of m/2 keywords.

7.4 Comparative analysis of all three data structures

The following plot shows the performance of the hash table, sorted array and the BTree

when used in the inverted file indexing algorithm. The X-axis represents the number of

keywords indexed, and the Y-axis represents the time required to create an inverted index

when using each of the three data structures. Three different colors distinguish the line

corresponding to the hash table, sorted array and the BTree from each other.

Figure 13: Comparitive plot of all 3 data structures

As is quite evident from the comparison plot of all three for these data structures. The

Hash Table outperforms sorted array and the BTree in terms of the time required to create

the inverted index. This can be attributed to the fact that when inserting a new keyword

into the index, minimal amount of computing time is required in case of the hash table

index. Comparing this to the binary search algorithm that the sorted array requires to find

the correct place to insert the keyword. The binary search Searches a sorted array by

repeatedly dividing the search interval in half. Begins with an interval covering the whole

array. If the value of the search key is less than the item in the middle of the interval,

narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly

check until the value is found or the interval is empty. It runs in O(log N) wherein N is

the size of the array, in this case the size of the lexicon in the index.

The hash table complexity depends on the hash function and collision resolution, but in

this case is constant (1). Some open addressing schemes may suffer from clustering

more than others. So it is evident that if we use a hashing function that minimizes

collisions and we use a good resolution strategy, outperforming the sorted array is an

easy task.

A BTree which is a balanced search tree in which every node has between minimum m/2

and ceiling m children, where m>1 is a fixed integer. The root may have as few as 2

children and the leaf nodes do not have any children. Inserting a keyword in a BTree has

complexity of O(m log m n), where m is the order of the tree and n is the total number of

keywords that are being indexed.

7.5 Search and retrieval efficiency of the data structures

Even though the project has focused on comparing the data structures in terms of the time

required to create the index, the search and retrieval efficiency and the memory storage

requirements of the data structures in warrants discussion. Searching in a hash table is

O(1) , i.e. constant time and is extremely efficient.

Searching in a B Tree is O(log[m/2] n), where n is the total number of keywords that are

indexed. The advantage of using the BTree is that it is balanced. Hence the height of the

tree remains constant.

Searching and retrieval on a sorted array requires O(log n) operations, because it needs

the binary search algorithm for retrieval.

7.6 BTrees and External memory:

The payoff of the BTree insert and delete rules are that B-trees are always "balanced".

Searching an unbalanced tree may require traversing an arbitrary and unpredictable

number of nodes and pointers.

Searching a balanced tree means that all leaves are at the same depth. There is no

runaway pointer overhead. Indeed, even very large BTrees can guarantee only a small

number of nodes must be retrieved to find a given key. For example, a B-tree of

10,000,000 keys with 50 keys per node never needs to retrieve more than 4 nodes to find

any key.

This is a good structure if much of the tree is in slow memory (disk), since the height, and

hence the number of accesses, can be kept small, say one or two, by picking a large m.

They are especially useful for search structures stored on disk. Disks have different

retrieval characteristics than internal memory (RAM).

Obviously, disk access is much, much slower. Furthermore, data is arranged in

concentric circles (called tracks) on each side of a disk “platter” . (Most disks these days

have a single platter, but some disks are a stack of platters.) A disk is read by read/write

heads mounted on an arm that is moved in and out from track to track. Moving that arm

takes time, so there is a real timing benefit to grouping data so that it can be read without

moving the arm. The amount of data that can be read without moving the arm (from both

sides of all platters) is called a cylinder. It's much faster to read an entire cylinder than to

read a little, move the arm, read a little more, move the arm, etc., even if the total amount

of data in a cylinder is much more than we need.

BTrees are a good match for on-disk storage and searching because we can choose the

node size to match the cylinder size. In doing so, we will store many data members in

each node, making the tree flatter, so fewer node-to-node transitions will be needed.

8 FUTURE RESEARCH DIRECTIONS

This work has analyzed the performance of different data structures when used to build

an index for text using the inverted file indexing algorithm. The metric that was used for

the comparison was the time required to build an index. The ways in which this work

could be expanded is as follows:

Using other metrics to compare the data structures for example the space footprint

of the index, time required to search for a keyword in the index.

Analyzing the efficieny of data structures like kd-trees and tries within the context

of inverted file indexing algorithm

Evaluating different text indexing algorithms like Signature Files. LSI and vector

space model. Different metrics can be used for this analysis.

Analysis of indexing algorithms for image and video retrieval.

Text indexes are compressed to save space. Analysis of compression algorithms

like Huffman coding and searching compressed indexes is another interesting

research topic.

9 CONCLUSION

We achieved the goals that we had set for this project. We have gained a sound

understanding of search engine technology, information retrieval techniques particularly

text indexing. We have studied in depth the inverted file indexing algorithm and related

data structures like hash table B trees and sorted arrays. From the performance analysis of

the inverted index file indexing algorithm and data structures we can conclude that

efficient algorithms and data structures are the key to efficient search engines. Google’s

page rank algorithm that revolutionized search engine technology also bears testament to

this fact. This work also enumerates future work based on this project.

10 REFERENCES

[Huan00] Huang, L. A survey on web information retrieval technologies. Tech. rep.,

ECSL, 2000.

[Aras01] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan . Searching

the Web. ACM Transactions on Internet Technology, 1, p. 2-43, 2001.

[Brin98] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search

Engine. Proceedings of the 7th International WWW conference, 1998.

[Najo01] M. Najork and Janet L. Wiener. Breadth-First Search Crawling Yields High-

Quality Pages. In Proceedings of the Tenth Internal World Wide Web Conference, pages

114-118, May, 2001

[Bent97] J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching

Strings. Proceedings of the eighth annual ACM-SIAM symposium on Discrete

algorithms. New Orleans, January, 1997. Pages 360- 369.

[Have99] T. Haveliwala. Efficient Computation of PageRank. Stanford Technical Report

2000-36.

[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A.

Tomkins, J. Wiener. Graph structure in the web. In Proc. Ninth International World Wide

Web Conference (WWW9), 2000.

[Cho00] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an

incremental crawler. In Proceedings of the Twenty-sixth International Conference on

Very LargeDatabases, 2000. Available at

http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0129.

http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0129

[Kwon00] A. Kwong M. Gertz. Improving the Quality of a Web Page Index. Department

of Computer Science, University of California, Davis. 2000.

[Manb90] U. Manber and G. Myers. Suffix arrays: A new method for on-line string

searches. In Proc. Of the 1st ACM-SIAM Symposium on Discrete Algorithms, pages

319-327, 1990.

[Salt89] G. Salton. Automatic Text Processing. Addison-Wesley, Reading, Mass., 1989.

[Witt94] I. H. Witten. Managing gigabytes : compressing and indexing documents and

images. Van Nostrand Reinhold, New York, 1994.

[Falo84] C. Faloutsos and S. Christodoulakis. Signature files: An access method for

documents and its analytical performance evaluation. ACM Transactions on Office

Information Systems, 2(4):267-288, October 1984.

[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A.

Tomkins, and J. Wiener. Graph Structure in the Web. In Proceedings of WWW9

Conference, 2000.

[Brin98] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search

engine. In Proceedings of 7th World Wide Web Conference, 1998.

[Aho83] A. Aho, J. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison-

Wesley,

1983.

[Bhar98] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian.

The connectivity server: Fast access to linkage information on the web. In Proceedings of

the

Seventh International World-Wide Web Conference, April 1998.

[Klei99] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the

ACM, 46(5):604-632, November 1999.

[Page98] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:

Bringing order to the web. Technical report, Computer Science Department, Stanford

University, 1998.

[Maud98] M. Maudlin. A history of search engines. 1998.

http://www.wiley.com/compbooks/sonnenreich/history.html,

[Herr99] S. Davis Herring. The value of interdisciplinarity: A study based on the design

of Internet search engines. Journal of the American Society for Information Science,

50(4):358-365, 1999.

abstract - gupta lab€¦ · web viewto create an inverted index from the above structure, we...

Documents