cs401 project - web.mst.eduweb.mst.edu/~madrias/cs401-01/cs401project.doc · web vieweach word...

CS401 PROJECT

Ranking in WHOWEDA

Project Members

Ritesh Sagi

Srikanth Bolledula

Mohammed Abdul Baseer

Ranking in WHOWEDA

Abstract:

The rate of growth of data on WWW is more than exponential. There is a huge amount of data on

a single topic available on the web. Almost all of the search engines present today are based on

some keywords search on the topic. The results obtained from the search engine are huge. User

is not interested in looking at all the search results and would like to find the data or document of

interest within the few top search results. Hence, there is a need to rank the search results. The

keyword search techniques do not provide more freedom to the user to specify his constraints.

WHOWEDA is a data model developed which searches for the user-interested document

depending on a query graph. The results obtained from the query graph search satisfy the

constraints specified in the query graph and each of the result resembles the other in its structure.

The results obtained are called a web tuples and are stored in a web table. This paper discusses

about the issues involved in ranking of these tuples and also presents the algorithm to rank these

web tuples. It also includes a prototype of implementation of one of the various cases of ranking.

Introduction:

World Wide Web has become one of the fastest growing applications on the Internet today. It

provides a powerful and easy to setup medium for almost any user on the Internet to disseminate

information. More and more information has become available on-line through WWW, from

personal data to scientific reports to up- to- the- minute satellite images. The information growth

rate is so alarming that it has become very difficult to find or search the information that is

relevant. This information explosion leads to a problem commonly known as resource discovery

problem. In order to find interesting WWW pages, a user has to browse through many WWW

sites. This is a very time consuming process.

Currently, information on the web may be discovered primarily by two mechanisms: browsers

and search engines. Existing search engines such as Yahoo, AltaVista, Google service millions of

queries a day, yet it is clear that they are less ideal for retrieving and ever growing body of

information on the web. Whenever a user searches for some required data, the search engine

provides with a list of URL’s that contain relevant results. The amount of results might be in

thousands and the user does not have patience and time to look into each of the search results.

1

Thus there is a growing need to rank the web results. All most all of the search engines existing

today perform their search using some important keywords. This method of search requires the

user to have knowledge of some keywords that might be present in the document of his interest.

It is not possible always that the user has such knowledge. He might be interested in providing

with a SQL like query with some additional constraints on the results of the search like., he might

be only interested in search results obtained only from a particular website, he might be

interested in finding web pages which has some kind of relationship between them and so on.

The present day search engines are not capable of doing such search. The Ware House for Web

Data model tries to fulfill these needs. It allows the user to provide a SQL query type request and

also allows the user to provide extra constraints on the links between pages of his interest. The

complete details about the WHOWEDA model are out of the scope of this paper. We introduce

you with the basic concepts of the data model.

Review of other related work:

Because of the growing demand of ranking the search results, there has been lot of work done

and various algorithms developed to tackle this problem. Information retrieval techniques like

Boolean retrieval model, vector space model, probabilistic retrieval model etc have existed since

long. Latent semantic indexing (LSI) in a somewhat different approach, takes into consideration

the drawbacks of the information retrieval techniques. The Boolean retrieval is the simplest of

these retrieval methods and relies on the use of Boolean operators. The terms in a query are

linked together with AND, OR and NOT [1]. This method is often used in search engines on the

Internet because it is fast and can therefore be used online. The vector space model procedure

can be divided in to three stages. The first stage is the document indexing where content bearing

terms are extracted from the document text. The second stage is the weighting of the indexed

terms to enhance retrieval of document relevant to the user. The last stage ranks the document

with respect to the query according to a similarity measure. The vector space model has been

criticized for being ad hoc [2]. In the probabilistic retrieval method [3], the probability that a

specific document will be judged relevant to a specific query is based on the assumption that the

terms are distributed differently in relevant and non-relevant documents. The probability formula

is usually derived from Bayes' theorem [4]. Expansion of the probabilistic retrieval model is to

incorporate relationships of the document descriptors [5,6].

Retrieval methods suffer from two well-known language related problems called synonymy

and polysemy [7]. Synonymy describes that a object can be referred in many ways, i.e., people

use different words to search for the same object. Examples of this are the words car and

automobile. Polysemy is the problem of words having more than one specific meaning. An

example of this is the word jaguar that could mean a well-known car type or an animal. Latent

2

Semantic Indexing (LSI) [7] offers a dampening or weakening of synonymy. By using a Singular

Value Decomposition (SVD) on a term by document matrix of term frequency, the dimension of

the transformed space is reduced by selection of the highest singular values, where the most of

the variance of the original space is. By using SVD the major associative patterns are extracted

from the document space and the small patterns are ignored. The query terms can also transform

into this subspace, and can lie close to documents where the terms do not appear. The

advantage of LSI is that it is fully automatic and does not use language expertise and the positive

side effect is that the length of the document vector becomes much shorter. All the above

discussed searching and ranking techniques base there search upon keyword input from the

user. As discussed earlier, the user might not always be able to provide the keywords that might

appear in the document of his interest and also he would like to specify some more constraints on

his query than just specifying the keywords.

Most recent of the works in this field has been done by web warehousing group at Nanyang

technological university, which investigates issues in Web databases and Web information

mining. The Whoweda project aims to design and implement a warehousing capability for an

organization that materializes and manages useful information from the World Wide Web (WWW)

in order to support strategic decision-making. It aims to build a data warehouse containing

strategic information derived from the WWW that may also interoperate with conventional data

warehouses. Some interesting work in ranking of the web tuples (search results, as mentioned in

WHOWEDA) has been done by Mr. Saurov Bhowmick in his paper titled “Ranking of web pages

using a global ranking operator”, which uses the concept of global coupling of WHOWEDA [9].

WHOWEDA model:

As we all know that most users obtain WWW information using a combination of search engines

and browsers, these two types of retrieval mechanisms do not necessarily address all of a user's

information needs. The search engines of present day fail to fulfill the present day user needs

because they are purely resource locators with no capability to reliably suggest the contents of

the websites they return in response to a query. Furthermore, the task of information retrieval still

burdens the user, who has to manually sift through 'potential' sites to discover the relevant

information. It is also possible that because of the presence of the mirror sites, the task of finding

the document of user interest becomes tedious.

To overcome the limitations of search engines and provide the user with a powerful and friendly

query mechanism for accessing information on the web, the critical problem is to find the effective

ways to build web data models of the information of interest, and to provide a mechanism to

manipulate these information to provide additional useful information. Until now, knowledge

3

discovery on the WWW has been limited to mining path traversal patterns by analyzing server

access logs (Web Log Mining) and extraction of semi-structured information from HTML

documents. This leaves much to be desired in deriving interesting, non-explicit patterns from the

web information base.

Whoweda (Warehouse of Web Data), a project started by the web warehousing group[ ] at

Nanyang technological university tries to look into this field. The objective of this group is: To

design and implement a Web warehouse that materializes and manages useful information from

the Web to support strategic decision-making. It aims to build a Web warehouse containing

strategic information coupled from the Web that may also inter-operate with conventional data

warehouses.

Architecture overview of Whoweda

Whoweda is a meta-data repository of useful, relevant web information, available for querying

and analysis. As relevant information becomes available in the WWW, this information are

coupled from various sources, translated into a common Web data model (Web Information

Coupling Model), and integrated with existing data in Whoweda. At the warehouse, queries can

be answered and Web data analysis can be performed quickly and efficiently since the

information is directly available. Accessing data at the warehouse does not incur costs that may

be associated with accessing data from the information sources scattered at different

4

geographical locations. In a Web warehouse data is available even when the WWW sources are

inaccessible. It is not possible to discuss the complete WHOWEDA model here in this paper, as

we are concerned with the ranking of the search results obtained from the searching done by the

WHOWEDA.

The model consists of collection of web objects known as nodes and links. Nodes represent the

web pages and links represent the hyperlinks connecting these web pages. Node is

characterized by the following attributes {url, title, format, size, date, text} and a link is

characterized by the following attributes {source-url, target-url, label, link-type}. The format of the

node object represents the file type of the node that can be HTML, Postscript, and PDF etc. The

link-type represents if the link between pages is an interior link, or local or an exterior link. A web

tuple is a connected and directed graph consisting of nodes and links. A collection of web tuples

described by a web schema is called a web table.

An example web schema written in as a file is shown in fig 1 and the same schema is

represented graphically in fig 2.

“http://www.umr.edu” “computer science” “madria”

e f

department faculty

5

[nodes]‘a’ URL Equals http://www.umr.edu/.‘b’ title contains “computer science”‘c’ title contains “madria”

[links]

e label contains “department”f label contains “faculty”.

Fig1 An Example Web Schema

cba

Fig 2. Graphical representation of web schema

http://www.umr.edu/

The types of links present between the nodes (web pages) are:

Interior Link: This type of link links to a different portion of the same web page. The URL of the

source and the target documents is the same.

Local link: The target URL is on the same server as that of the source URL.

Exterior Link: The target URL is completely external to the source URL. In other words, the

target URL is not on the same server as the source URL.

Formally, a web tuple tw is a triplet,

tw = (Nw, Lw, Ew)

Nw is a set of nodes,

Lw is a set of links,

Ew is the set of connectivities in web tuple tw.

Formally, web schema M is represented by,

M = (Xn, Xl, C, P)

Xn – set of node variables

Xl – set of link variables

C – set of connectivities

P – set of predicates

The connectivities and predicates present in the WHOWEDA model are out of the scope of this

paper to be discussed.

The query from the user resembles the web schema as shown in fig2. It is called as the query graph. In fig2 , we have that we have constraints specified on the starting node or URL that it

should be http://www.umr.edu . And then on the link from that page having a label “department”

pointing to a web page containing “computer science” in its title and so on. As we can observe

that the user is provided with more flexibility to specify the constraints in order to obtain results of

his interest. The results obtained from this query graph are stored in web table as web tuples.

Here, in this project we are not concerned about the searching for the results of the query graph.

We assume that we are provided with the results of the search that match the query graph. We

are concerned only with ranking of these results obtained from the search.

6

http://www.umr.edu/

Problem Statement and Analysis: The number of possible matches of a search query increases with time as more and more

documents are put on the web. The user’s ability to look at documents does not scale up in this

fashion. People are still only willing to look at the first few tens of result. It is necessary to extract

the “most relevant documents” in the web tuples to the user. Our notion of “relevant” web tuples

include only the very best tuples (hot tuples) since there may be large number of “slightly

relevant” tuples. It is necessary to rank the web tuples based on the user specified conditions. It

enables the user to browse the web tables efficiently. There is a subtle difference between

ranking and searching. We adopt a index and ranking (IR) system for ranking of web tuples.

Indexing is the first task an Index-Ranking(IR) system has to accomplish; you can only search

and retrieve a document if it has been indexed into a database. Each word (index) is represented

by a set of coordinates that describes where a word is located—in what document, sentence,

paragraph, heading, and so on. The resolution of the index controls whether retrieved pointers

point to whole documents, subsections within documents, or exact word locations.

Ultimately, IR systems analyze queries to see if the words in the queries are in the indexes. If

they are, the system displays the matching documents and ranks or scores them according to

various scoring methods/algorithms. The documents that receive the highest score represent the

best match, and the system presents them at the top of the list of retrieved documents.

As specified earlier, in our project we are concerned only with the ranking of web tuples

obtained by a query. It is assumed that they have been obtained from an index server who does

the indexing part as described above. The indexing in IR system is performed according to the

following model:

Result result

query query redirection

…. …. ….

7

User Client

Index Server

IndexServer

IndexServer

WWWsite

WWWsite

WWWsite

WWWsite

WWWsite

WWWsite

The various factors that are to be considered while ranking the web pages:

1. The existence of the query terms in the document or web page.

2. The number of occurrences of query items in the document.

3. Total size of the document.

4. The frequency of occurrence of the query terms in the document.

5. The number of links between pages containing the relevant document. A document

referred by or referring to some other page

6. The importance of the place of occurrence of the query terms: in URL or in title or inside

the web page.

7. The same object being referred in different ways by different people. That is the

synonyms of the query words.

8. The duplicated web pages. That is a replica of the same web page.

9. Domain restriction. That is document to be found within a particular domain of servers.

Ranking of web pages so that the user gets to find his data of interest in minimum time, has been

a recent and growing area of interest in Web Warehousing environment. Ranking is performed

using a scoring function.

Scoring Function: A scoring function assigns weights to components of a query based on such

considering all the factors specified above. Each of the factor assigns a weight to the web pages

and the resulting score, used for ranking, is the sum of these weights. The values of the weights

differ among document types and languages. When weighting, words that are indexed or used in

a query are assigned a numerical value usually based on the frequency of the words in stored

documents.

Ranking models can be probabilistic. By using information about word frequency, IR systems

can assign a probability of relevance to all the documents that can be retrieved. These

documents are then ranked according to this probability. Ranking is very useful because

document databases tend to be very large, and large return sets can be daunting if not sorted.

8

Approach/Model:

Ranking algorithm uses a scoring function based on which it assigns different ranks to different

web pages resulting from the query search and sorts the web pages in the decreasing order of

their ranks. The user may supply different criteria for ranking the pages. We consider some of

them .

Advanced search Criteria : a. Standard Condition : A user may be interested in a subset of original results as search engines provide a very broad

category of results. For example consider, if a user wants to search about SATURN which is not

a planet, we can do it as follows:

Saturn - Planet

Assumption: We shall make the assumption that if rank turns out to be zero, then it is eliminated.

Let k1, k2 .. be the keywords that the user wants to have and a1, a2,.. be the keywords the user

wants to be avoided, then the condition can be specified as,

k1, k2 - a1 , a2, a3

Let us assume that once the tuples are obtained, the ranks need to be redefined, so rank of a

web page ‘u’ will be defined as,

R(u) = c.f(k1).f(k2)…..q(a1)q(a2)q(a3)….

c is a constant

f(k1)= number of times or the frequency of the word k1 in the document

q(a1) be considered as a boolean function which helps in eliminating the pages that have the

keywords to be avoided.

q(a1) = { 1 if a1 does not occur in the document

0 if a1 occurs once or more in the document

}

So even if any of the keywords a1, a2, …..etc occurs at least once, then rank turns out to be

zero.

b. Location criteria :

9

A user may also specify the position of the keywords he is looking for such as occurrences of

keywords in title, URL or inside the document. This can be done as follows:

If a user enters a query which contains K1,K2…..Km keywords

K1,k2,k3, …. …….….km

We define the following expressions :

Wtitle : It is the weight assigned for a keyword present in the title of the document

Wtext : It is the weight assigned for a keyword present in the title of the document

Wurl : It is the weight assigned for a keyword present in the title of the document

Ftitle(k) : It is the frequency of the keyword k in the title of the document.

Ftext(k) : It is the frequency of the keyword k in the text of the document.

Furl(k) : It is the frequency of the keyword k in the url of the document.

Then rank can be approximately assigned as ,

Rank = Wtitle(kn)**Ftitle(kn) + Wurl(kn)**Furl(kn) + Wtext(kn)**Ftext(kn)

Wtitle(kn),Wurl(kn) and Wtext(kn) are the weights (constant values)assigned depending on the

ocurences of key word ‘kn’ in title,url and text of the document respectively. We assign these

values such that

Wurl > Wtitle > Wtext.

This is the default criterion, which we use in ranking the web pages. But if a user is interested in

some other criteria other then the default criteria, then we can change, the default criteria by

changing weights. For example if a user is interested in keywords occurring only in title of pages

then,

Wurl(kn) = 0 ; Wtext(kn) = 0;

and the rank will be automatically calculated with the pages having keywords in the title. The

page which has got the highest number of keywords in the title gets the highest rank.

c. Domain restriction : The user may be interested in looking at only those web pages that are from a particular server,

for example, suppose the user is interested in looking at only the UMR related web pages, now in

this case, the node variable would be url,

Now let us assume the keyword is k1,

Rank = c.Wurl(k1) ;

10

Wurl is the weight that will be assigned as a function of the frequency of the keyword in the url,

Wurl = { 0 if no keywords are present ,

w(f1) if keywords are present }

f1 is the frequency or the number of times it appears in the url.

Web Tuple, Web Table and Query Graph :

A web tuple is a directed connected graph. And a web table is a collection of web tuples. The

boxes and directed lines correspond to nodes and link variables. The keywords express the

content of the web documents or the labels of the hyperlinks between the web documents. The

webware house model consists of a hierarchy of web objects Nodes correspond to html or plain

text documents. And links correspond to the hyperlinks between web pages.These objects

consist of a set of attributes .

Node = [url, title, format, size, date, text ]

Link = [source-url, target-url, label, link-type ]

For the Node object , the attributes are the url of a node instance and its title, document, format,

size(in bytes) , date of last modification and its textual contents. For the link type, the attributes

are url of source and target documents, anchor or label of the link, and the type of link.

A Web Tuple is a collection of directed connected graphs each consisting of a set of nodes and

links which are instances of Node and Link respectively.

A query graph is defined as a 5 tuple G = {Xn, Xl, C, P, Q) where

Xn is a set of node variables,

Xl is a set of link variables ,

C is a set of connectivities ,

P is a set of predicates over the node and link variables

Q is a set of predicates on the complete query graph .

Example:This is an example of a query graph and the global web coupling operator can be applied to

retrieve those sets of related documents that match the query graph.

Observe that some nodes and links have keywords imposed on them. These keywords express

the content of the web document or the label of the hyperlink between the web documents.

11

News

sports soccer

a b c

Example Of Web Tuples :

url: Cnn.com

Title:cnn.com

News: 5 times

sports url:sportsillustrated.cnn.com

title: cnncsi.com from cnn and sportsillustrated

sports: 11 times

soccer

url:sportsillustrated.cnn.com/sports/soccer

title: cnnsi.com from cnn and sportsillustrated

soccer: 7 times

url:abcnews.com

title:abcnews.com

home

news:11 times

sports url:abcnews.go.com/section/sports/

title:abcnews.com sports index

sports: 4 times

soccer url:espn.go.com/soccer/index.html

title:espn.com- soccer- index

soccer: 3 times

url:foxnews.com

title:foxnews.com

sports url:foxsports.com

title:foxsports.com

more

sports

url:foxsports.com/more/home/index.cfm…

title:foxsports.com more sports

12

url:foxsports.com

url:foxnews.com

url:espn.go.com/soccer/index.html

url:abcnews.go.com/section/sports/

news:11

url:abcnews.com

url:sportsillustrated.cnn.com/sports/soccer

url:sportsillustrated.cnn.com

news:7 times sports:11 times

soccer

url: foxsports.com/mls/home/index.cfm

title:foxsports.com – soccer

soccer: 3 times

url:msnbc.com

title:msnbc

cover

news:8 times

sports url:msnbc.com/news/spt_front.asp

title:msnbc sports

sports: 5 times

more

sports

url:msnbc.com/news/xmoresports-

front.asp

title:other sports front page…

Nbc.: soccer star helping launch new womens pro league

url:msnbc.com/news/561402.asp

title:Mia Hamm: selling her dream

soccer: 9 times

url:usatoday.com/news/

nfront.htm

title:usatoday.com

news:12 times

sports url:usatoday.com/sports/

sfront.htm

title:usatoday.com

sports:10 times

soccer url:usatoday.com/sports/

soccer/sos.htm

title:usatoday.com

soccer:4 times

13

url:usatoday.com/sports/soccer/sos.htm

url:usatoday.com/sports/soccer/sos.htm

url:usatoday.com/sports/sfront.htm

url:usatoday.com/sports/sfront.htm

news:12

url:msnbc.com/news/561402.asp

url:msnbc.com/news/xmoresports-front.asp

url:msnbc.com/news/xmoresports-front.asp

url:msnbc.com/news/spt_front.asp

news:8

url:msnbc.com

news:7

url:nytimes.co

m

title:new york

times on the

web

news:13 times

sports url:nytimes.com/pages/

sports/index/html

title:the new york times

sports

sports: 9 times

soccer url:nytimes.com/pages/sports/

soccer/index.html

title:The new york times soccer

soccer:11 times

The most relevant tuples to the user must be ranked high . The most relevant means those

tuples that the user finds most interesting.

Ranking Operator and Ranking Function : is a ranking operator

W is the web table which consists of web tuples

WR is the ranked set of web tuples.

W R = (W, condition(s))

The operator takes in the web table to be ranked and user specified node variables and

keyword conditions based on which ranking is performed.

The output is a ranked set of web tuples.

14

url:The

url:nytimes.com/pages/sports/index/html

url:nytimes.com/pages/sports/index/html

news:13

url:nytimes.com

url:nytimes.com

Specification of Ranking condition :

Type 1 :

The user specifies the condition for the keyword for a single node variable .

The keywords specified by the user can be present either in the url or the title or the text of the

webpage,

The query can be expressed as,

RANK web_table_name

WHERE (node_variable,keyword)

The where clause is specified by the user and ranking is performed based on these conditions.

Assume the user enters, say,

The keywords as k1,k2,……km and a web page u ,

Then rank can be approximately assigned as ,

Wtitle, Wurl have been explained previously,

Rank(u) = Wtitle(kn) ** Ftitle(kn) + Wurl(kn) ** Furl(kn) + Wtext(kn) ** Ftext(kn)

For easier terminology, we use weight factor w1, to be equal to Rank(u) .

Wtitle is a weight which is a constant depending on the user options, these can be assigned,

Suppose the user selects an option which says to select only pages that have the keyword kn in

the title,assume 1 and 2 are small compared to the Wtitle, then

Wurl(kn) = 1 ; Wtext(kn) = 2 ;

this is done so that if two tuples result in the same rank , then these weights help in looking at the

url and the text to decide on the rank of the tuple.

and the rank will be automatically calculated with the pages that have the highest number of

times that the keyword appearing in title getting the highest rank, but if the user specifies the

keywords only ,

If there are two or more web tuples which have nodes having the same number of keywords,

then the tuples take into account the following conditions to rank them.

Webpages which have keywords appearing in url or title are considered to be more

accurate than others,

15

Since we are looking at only one node and its conditions, the concept of incoming links

does not arise here.

Here since the user specifies only a single node variable and the keyword condition

imposed on it, we don’t care about the hyperlinks and the labels.


Type 2 : Here the user specifies the start node variable and the end node variable, he may also enter a set

of keywords with some conditions on which nodes it should appear and where it should,

Here the primary idea is to :

Calculate the ranks of all the nodes between the start and end nodes including them and sum it

up to get the total rank of the web tuple.

The query can be expressed as ,

RANK web_table_name

WHERE (start_node_variable,end_node_variable,keyword)

Unlike the first condition which was the simplest case, now we have to take into consideration the

interrelation ship between the different nodes as depicted by the query graph .

So , now we will assign four scores to the node and sum them up to get the actual rank of the

node, these four scores take also into account the interconnection between the tuples so that the

most relevant tuples are brought up in the ranking.

16

Algorithm :

For (I=1 to m)

{

Webtuple(i) = weight(start node) + weight(next node) + …….+ weight(end node)

;

//Webtuple(i) indicates the ranking score of the ith web tuple.

}

//where m is the number of web tuples in the web table.

Rank the web tuples in the descending order of ranking scores.

Weight of a single node is calculated as follows .

Weight(node) = w1 + w2 + w3+ w4 ;

Weight w1 has been explained previously, we proceed to explain the other three weight

factors,

Weight W2 :

It should be pointed out that the ranking scores are assigned in such a way that the web tuple

that satisfies the user query graph and has the highest occurrences of the keywords in locations

as specified by the user , and with the shortest path satisfying the user query graph will get the

highest rank.

That means , the other weights assigned on various other factors make a difference only when

two webtuples almost get the same ranking score .

W2 = some constant * f(number of occurrences of the keyword specified for previous node in the

current node)

Where f is a function of the frequency of the label on the incoming link,

An example will make it clear,

17

Supposing the user query graph is :

Panacea.org Drug list Issues

side

effects

a b c d

For convenience purposes, we consider only part of the query graphs,

Here the user wants a web page containing the keyword drug list followed by a web page having

the keyword containing the keyword issues.

It is clear that the user is looking for some issues related to the drug list, but when we carefully

look at the web tuple 2 , it is quite possible that the second page may be addressing other health

issues not necessarily related to the drug lists.This is because the node c is unbound in the

The Drug list for the disease

myopis …

Issues concerning the drugs

in the drug list,…..

coronary failure requires the

following drugs specified in

the Drug list

The primary health issues to

be concerned about ….

18

sense it is not defined precisely by any predicate. But in the case of the first web tuple, the

second page also contains the “drug list” keyword in addition to the “issues” keyword.

We can conclude that the first web tuple more accurately reflects the information the user is trying

to look for, so web tuple1 will be ranked high,

Weight W3 :

Now if you look at the query graph, it does not explicitly say anything about the number of nodes

to traverse to get to the end node variable,

But if user specifies a query graph and a set of keyword conditions for the nodes, it is clear that

the user is definitely not interested to go through a long path in which case he looses interest and

may move on to other pages,

In one of the previous examples ,

The query graph was :

News

sports soccer

a b c

The user is interested in news and he wants to browse through the sports page and go to find

some news on soccer,

As you can observe from the web tuples, cnn has a direct link to soccer from the sports page,

where as the foxnews page and the msnbc pages do not have direct link to the soccer page, it is

listed on the more sports page from where there is a link to the soccer page, this clearly indicates

that these pages do not give as much importance to the soccer game, so obviously appear down

in the ranked set of tuples relative to others with direct straight links,

19

Let Pq be the least path length satisfying all the keyword conditions given in the query graph,

and let Pw be the path length which is the number of nodes traversed to reach the end node from

the start node.

W3 = some constant * f(path difference) ,

f is a function of the path difference and can be defined as ,

f = 1 if Pq = Pw

1/(Pq-Pw)**2 if Pw>= Pq

This will help in longer paths to be pulled down in the ranked set of web tuples

Weight W4:

This factor primarily takes into account the importance of links to the next nodes.

For example , in the previous query graph, the user specifies the keyword side effects, suppose

he explicitly specifies that the label must contain the keyword side effects, but in cases , with two

tuples containing the keyword side effects in their labels, the one which points to a node whose

url also contains this keyword will be ranked higher, this is because the web page which has the

word side effects in its url is most likely to describe the side effects accurately than a page which

does not have the keyword in the url.

W4 = some constant * (frequency of the keyword in url of next node) ;

If we observe the sports related tuples, we observe that the foxnews does not have the keyword

explicitly in its url, but the other pages like the cnn.com and nytimes.com have the keyword

soccer in their url, which gives them a definite advantage over other pages.

20

Weight W5 :

This factor takes into account the fact that if lot of web tuples go through a particular path, then

that particular piece of path could be quite important, so all those tuples passing through that

particular path should be given higher importance .

But the question arises : if there are two or more such paths which should be given more

importance, in that case we look at the query graph and the keywords, and then give relative

importance to them based on the location and frequency of the keywords.

W5 = some constant * (relative weight of a particular path)

Weight of the particular path can be calculated using the first weight factor .


Type 3 :

In this case , the user specifies only the depth of the tuple, he would like to browse through a

specified number of pages starting from a particular node, and the tuples which interest the user

most within that particular depth must be ranked high,

In this case, the query can be expressed as,

RANK web_table_name

WHERE (start_node_variable,depth, keyword)

In this case the we do not take into account the weight W3 because here the user explicitly

specifies the depth , so it is not necessary to consider the path length required to traverse the

path,

Algorithm :

21

For (I=1 to m)

{

Webtuple(i) = weight(start node) + weight(next node) + …….+ weight(end node)

Endnode = startnode + depth ;

//Webtuple(i) indicates the ranking score of the ith web tuple.

}

//where m is the number of web tuples in the web table.

Rank the web tuples in the descending order of ranking scores.

Weight of a single node is calculated as follows .

Weight(node) = w1 + w2 + w4+ w5 ;

Repetitive Paths :

In some web tuples , a particular path may form a repetitive loop, but this will be taken care of by

the weight factor W5 .

One of such examples is shown below :

sanjay madria webdb

http://www.informatik.uni-trier.de/

~ley/db/conf/webdb/index.html

webdb sanjay madria

http://www.informatik.uni-

trier.de/~ley/db/indices/a-tree/

m/Madria:Sanjay_Kumar.html

22

Suppose the first graph be the user query graph , he does not specify the end node variable, but

he specifies only the depth and condition being that the first page must contain the keyword

madria and it should point to some page containing the keyword webdb ,

If we carefully look at the path, we observe that there is a link to the web db from the home page

of Dr.Madria and when we browse through the web db pages , one of the pages in the web db

has a link to Dr madrias home page, there is a repetitive patteren of the user query graph,

Obviously this repetitiveness indicates a strong correlation between the web pages,

Specification of System Defined Condition :

In this case , there is no explicit location criteria for the location of the keywords, so the default

values of the weights need to be taken into account, rather there are some keyword conditions

based on which the system has to rank the tuples,

The query can be expressed as :

RANK web_table_name

WHERE (startnode specification, keyword conditions)

Weight(node) = w1 + w2 + w3+ w4+w5 ;

w1 = Wtitle(kn)**Ftitle(kn) + Wurl(kn)**Furl(kn) + Wtext(kn)**Ftext(kn) +

Wolink(kn)**Folink(kn)

Wolink indicates the weight assigned to a keyword present in an outgoing link.Generally it has a

small value as a keyword in an outgoing link does not accurately describe the page it is present

in.

then the weights can be selected as,

Wurl > Wtitle > Wtext > W olink

In the default case, the keywords in the url is given more importance , next the title and next the

content of the document.

23

All other factors remain the same in this case.

Conclusions:

We would be discussing web tuples containing single web pages different with that of the tuples

with the tuples containing a bunch of related web pages. That is we shall be discussing ranking of

web pages different from that of ranking of web graphs. And we guess the algorithms and their

implementation would provide a basis for future improvements like mixing both of the ranking

cases as one whole and may lay a foundation for more better ranking algorithms. We are just

trying to implement the already existing algorithms for different situations and try to find how they

actually work.

Implementation:

The implementation of the ranking algorithm involves the storage structure for the nodes and the

links and also the language selection and platform on which they were implementation. We have

taken a dummy example consisting of the query graph given above in the example and stored the

results obtained from this query graph. All the nodes obtained in the resultant web table are

stored independently in a separate table and all the links were stored in an independent table. We

maintained another table consisting of tupleID and the NodeID to represent the relationship

between the different nodes connected with each other with links in the actual web tuple obtained

from the search results. We have stored these tables in MS Access and then implemented the

Default- user ranking of web table in ASP.

24

The User defined ranking was not implemented. It is similar to the default-user ranking only with

few additional conditions.

25

References:

1. Data visualization Operators for WHOWEDA by SS Bhowmick, Sanjay K Madria, Wee Keong

Ng and Ee- Peng- Lim

2. Slides on “Ranking web pages using global ranking operator” by Saurav Bhowmick.

3. “Ranking of Graphs ” by Hans L.Bodlaender , Jitender, Klaus, Ton , Dieter, Haiko , Zsolt ,

WG 1994:292-304

4. Efthimis N. Efthimiadis: User Choices: A new Yardstick for the Evaluation of Ranking

Algorithms for Interactive Query Expansion. Information Processing and Management 31(4): 605-

620 (1995)

5. Hans L. Bodlaender, Jitender S. Deogun, Klaus Jansen, Ton Kloks, Dieter Kratsch, Haiko

Müller, Zsolt Tuza: Rankings of Graphs. SIAM Journal on Discrete Mathematics 11(1): 168-181

(1998)

6.Wai Yee Peter Wong, Dik Lun Lee Implementations of Partial Document Ranking Using

Inverted Files. Information Processing and Management 29(5): 647-669 (1993)

7. Ananth V. Iyer, H. D. Ratliff, G. Vijayan : Optimal Node Ranking of Trees. IPL 28(5): 225-229

(1988)

8. Robert M. Losee, Lee Anne H. Paris: Measuring Search-Engine Quality and Query Difficulty:

Ranking with Target and Freestyle. JASIS 50(10): 882-889 (1999)

9. Heaps, H. S. Information retrieval, computational and theoretical aspects. Academic Press,

1978.

10. Raghavan, V. V. and Wong, S. K. M. A critical analysis of vector space model for information

retrieval. Journal of the American Society for Information Science, Vol.37 (5), p. 279-87, 1986.

11. Fuhr, Norbert. Probabilistic Models in Information Retrieval. Computer Journal. 35 (3) , p.

243-55, 1992.

12. van Rijsbergen, C. J. Information retrieval. Butterworths, 1979.

13. Kim, Won-Yong; Lee, Yoon-Joon and Kim, Myoung-Ho. Probabilistic retrieval with

incrementally constructed knowledge. Proceedings of the IPSJ International Symposium on

Information Systems and Technologies for Network Society, p. 241-248, 1997.

14. Kim, W.Y.; Kim, M.H. and Lee, Y.J. Probabilistic retrieval incorporating the relationships of

descriptors incrementally. Information Processing and Management Vol.34 (4), p. 417-430, 1998.

15. Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K. and

Harshman, Richard. Indexing by Latent Semantic Indexing. Journal of the American Society for

Information Science. 41(6), p. 321-407, 1990.

16. Finding near-replicas of documents on the web by N. Shivakumar, Hector Garcia- Molina.

26

GLOSSARY :

WHOWEDA : Ware house of web data

Node : A web page is considered as a node variable

Link : A hyper link from one page to another

Query Graph : A coupling frame work specified by the user

Unbound Node : Node which is not defined by any predicate

Web Tuple : It is a connected directed graph which satisfies a given query graph

Web Table: A set of web tuples is materialized in a table called the web table.

Web Schema : schema that contains meta information that binds a set of web tuples in a web

table.

Expressions :

Wtitle(K) : Weight assigned for a keyword present in the title .

Wurl(K) : Weight assigned for a keyword present in the url .

Wtext(K) : Weight assigned for a keyword present in the text .

furl(K) : frequency of a keyword k in the url

ftitle(K) : frequency of a keyword k in the title

ftext(K) : frequency of a keyword k in the text

Pq : the least path length satisfying all the keyword conditions given in the query graph

Pw : the path length which is the number of nodes traversed to reach the end node from the

start node.

27

cs401 project - web.mst.eduweb.mst.edu/~madrias/cs401-01/cs401project.doc · web vieweach word...

Documents