file · web viewntroduction. information ... word frequencies in individual documents and...
TRANSCRIPT
A Method to Improve Effectiveness of Information Retrieval using
Link Information
Yong Kim*, Won-Kyun Joo**
*Department of Library & Information Science, Chonbuk National University, 567 Barkje-daero, Deokjin-gu, Jeonju-si, Jeollabukdo, South Korea**Department of R&D System Development, KISTI, 245 Daehak-ro, Yuseong-gu, Daejeon, South Korea (Corresponding Author)
Abstract
A scheme for using various types of hyperlinks for the purpose of improving information retrieval
effectiveness. The proposed scheme is based on the distinction between incoming and outgoing links by
directionality as well as the distinction between direct and indirect links by directness of links. various types of
hyperlinks have different impact on enhancement of retrieval effectiveness and that the improvement can be as
much as 44.8% in 11-point average precision when right combinations of weights are given to the different link
types The proposed method serves as a useful reference for electronic libraries or IR system considering the
improvement of IR service.
Keywords: Boolean Model, Information Retrieval, Directionality and directness of links, Hyperlink, Role of
Link, Retrieval effectiveness.
I. INTRODUCTIONInformation retrieval has dealt with a collection of documents whose contents are predominantly texts, linear
sequence of characters, words, and sentences. With the growing popularity of World-Wide-Web (www) where
texts are inter-linked in a complex way to form a hypertext, however, it has become a common practice to
access texts in a non-sequential way by means of navigation or browsing. Few search engines on the web
employ new methods, whereas, the others do traditional information retrieval techniques that work under the
assumption that documents are independent units. Word frequencies in individual documents and in the entire
collection are the basis for determining the degree of association between a word and a document.
In this paper, we take a stance that the meaning of a document is not just restricted to the sequential text itself
but can be affected by and derived from its relationship with others. We hypothesize that when a source text (or
a node in hypertext) is linked to a destination text by means of an anchor in the source and a link pointing to the
destination, they each reveal the content of the other to some extent. Besides, hypertext can enhance notions of
relevance and hence the performance of retrieval heuristics (Frisse, 1998). In most cases, hyperlinks encode a
considerable amount of latent human judgment. In fact, the creators of web pages create link to other pages,
usually with an idea in mind that the linked pages are relevant to the linking pages. Therefore, a hyperlink, if
reasonably created, reflects the human semantic judgment and this judgment is objective and independent of the
synonymy and polysemy of the words in the pages. This latent semantics, once revealed, could be used to find
deeper relationships among the pages, as well as to find the relevant pages for a given page (Hou, 2003).
The main thrust of this paper is to present a way in which this link information can be used for retrieval
purposes and show that this information in fact improves retrieval effectiveness and reveals the role of each type
of links defined in continuing sections. This paper presents a generalization and an extension of these works.
To accomplish these goals, this paper focuses on the followings.
0) Classification of Links: Links in hypertext have many different characteristics and can be associated with
many different meanings. Depending on the nature of the objects at the source and destination points,
first of all, links can connect a document with another document or a term with a document. On the other
hand, the directionality of a link can determine whether it is an outgoing or incoming link with respect to
a given text. Although not common, it is also possible to build a one-to-many link so that more than one
destination can be associated with the source. Furthermore, an indirect link between two objects can be
defined when they are pointed to by a single source, or when they point to a single destination. Finally,
an explicit meaning such as ’example-of’ or ’definition-of’, or a numerical weight can be attached along
a link to indicate its strength.
1) Investigation on the roles of link types: While the long-term goal here is to investigate the roles of
various types of links and their properties in improving retrieval effectiveness, the current work has a
focus on directionality and the distinction between direct and indirect links.
This paper gives suggestion on ways to make use of such link information in computing the retrieval status
value (RSV) or similarity value between a document and query. Results from the experiments carried out to
test the value and impact of different link information is also reported herein.
II. RELATED WORKSResearch on the usage of hypertext links in information retrieval started from the study on scientific citations.
The most well-known measure in this field is Garfield’s impact factor (Garfield, 1972), which is a ranking
measure based fundamentally on pure counting of the in-degrees of nodes in the network. Pinski and Narin
(1976) proposed a more subtle citation-based measure of standing, stemming from the observation that not all
citations are equally important.
Hypertexts allows for browsing, an effective way of acquiring information from a structured information
space and of finding unexpected items in serendipity. However, since browsing is not an efficient means with a
large search space, researchers have attempted to combine browsing and searching functions in a single retrieval
environment. Lucarella (1990), for example, proposed a model for retrieval of hypertexts and its
implementation. More recently, researchers have attempted to integrate browsing and searching for hypermedia
documents. Croft and Turtle (1989) developed a method by which hypertext links are regarded as evidence in
answering a query. Savoy (1996) suggested a scheme where a vector space model was extended to include
citation links to improve overall retrieval effectiveness. Brin and Page (1998) have proposed a ranking measure
based on a node-to-node weight propagation scheme and its analysis via eigenvectors. In order to measure the
relative importance of web pages, they proposed PageRank, a method for computing a ranking for every web
page based on the graph of the web. It is the first application of these approaches to www search. Kleinberg
(1999), developed a model of the web as Hubs and Authorities, based on eigenvector calculation on the co-
citation matrix of the web.
As the amount of information on the World Wide Web (WWW) increases rapidly, search engines such as
Google, Yahoo, MS Live Search and so on are becoming powerful tools for retrieving information on the web.
Among the typically huge amount of relevant web pages satisfying a query, users only want to browse the most
important ones (Bar-Ilan et al, 2006). Thus, web search engines need to use effective ranking methods for
ordering the search results according to their “importance” (Arasu et al, 2001). There have been a number of
methods proposed to effectively rank the relevant web pages (Devanshu, 2002). These methods are classified
into two categories: content-based methods and link-based methods. Content-based methods exploit the
similarity between contents of web pages and given queries whereas link-based methods exploit the link
structure of web pages. Currently, link-based methods are prevalent in prominent web search engines such as
Google, Yahoo, and MS Live Search (Yoshida, 2008). Among link-based methods, PageRank, which was first
introduced by Google, has been known as one of the most popular method. As PageRank is an important
component for ranking web pages in Google and other web search engines (Berkhin, 2005), many variations of
the PageRank algorithm have been proposed in order to improve the ranking quality (Haveliwala, 2002).
However, it is unclear which variations (or their combinations) provide the “best” ranked results since existing
research neither explicitly explains which ranking algorithm are used (Can et al., 2004) nor provides sufficient
experimental results for comparing the ranking quality among those variations (Yi et al., 2006). As one of the
several studies, Eiron et al. (2004) proposed four algorithms ― jump-weighting, push-back, self-loop, and
BHITS ― in order to demote the ranks of bad pages. The jump-weighting algorithm in particular demotes the
rank of a bad page by distributing the PageRank value of a dangling page to all non-dangling pages in
proportion to the fraction of the number of its good links over its out degree. Here, bad links are the links
pointing to the dangling pages that cannot be crawled because they are not available or not downloadable.
Acharyya and Ghosh (2004) also proposed three algorithms ― dangling link estimation, naive link estimation,
and clustered link estimation ― in order to estimate the outlink distributions from dangling pages to non-
dangling pages. The dangling link estimation algorithm exploits the link information of dangling pages
estimated.
Meanwhile, some studies focused on investigating influence between web sites according to the frequency of
inlinks and outlinks. Adekannbi (2012) revealed that African universities have more interest in linking to the
world class universities as analyzed web links to investigate interrelationship between top ten African
universities and world universities based on the frequency and percentage distributions of outlinks and inlinks.
Islam and Alam (2011) revealed that some private universities in Bangladesh have higher number of web pages
but their link pages are very small in number, thus the websites fall behind in their Overall WIF, self link,
external links and Absolute WIF. The study showed that these universities did not have much impact factor on
the web and were not known internationally as examined in the 44 private university websites in Bangladesh. It
also identified the number of web pages and link pages, and calculated the Overall Web Impact Factor (WIF)
and Absolute Web Impact Factor (WIF).
III. DEFINITIONS ON LINK TYPES
3.1. Directionality and Directness of Links
Among many different types of link characteristics mentioned above, only two aspects are focused upon in
this section: directionality and directness of links. It appears on general hypertext document.
Directionality: This notion explains the relation between two documents actually connected with links. For
directionality, a distinction is made between outgoing and incoming links. When there is a link from document
A to B, it serves as an outgoing link for A and as an incoming link for B. These links can be expressed as out-
degree and in-degree in graph theory. Outgoing and incoming links described below in Figure 1 are all examples
of a direct link.
Directness: In this case, an indirect link that is assumed to exist between documents A and B is defined when
they each have an outgoing link to a common document C, or when they each have an incoming link from the
same document C. So, indirect link is not an actual connection but a virtual connection between two documents.
The definition of indirect links is similar to co-citing and co-cited reference (Savoy 1996) in the science
citations, but the usage of links differ. Description on the directness is explained on the right side of Figure 1.
From the two points of view, thus, we derive three basic types of link: incoming link, outgoing link and
indirect link. Based on these basic link types, we can define more link types applied to some circumstance. More
specified link types will be shown in the following section.
Figure 1. Directionality and directness of links
3.2. Considering Search Environment
Considering search environment constructed by query, documents, links and sets, the types of direct links
including incoming links and outgoing links can be more specified. For this, two additional distinctions are
made on links according to the relations of the above four objects; 1) Whether the anchor terms on the links
have appeared on query terms or not. 2) The location of a current link node in the set. Figure 2 illustrates such a
specified environment.
Figure 2. The links considering search environment
In Figure 2 above, the rectangles and arrowed lines each represent documents and links. Thus, there are 5
documents and 10 kinds of links. Q and N represents query and non-query term anchors respectively. All links
are attached with circled number indicating link types.
Query term link vs. Non-Query term link: The links in a document can be separated into query anchor and
non-query anchor. If all terms in the anchor appeared on query terms, it is referred to as query term link,
otherwise, it would be a non-query term link. For example, document A has one query term link and one non-
query term link.
Internal vs. External: There are 3 kinds of sets in the search environment. The initial set consists of search
results for a query, which are the results obtained by basic search methods like vector model (Witten et al.,
1999; Frakes & Baeza-Yates, 1997). The extended set is a growing set of the initial set using outgoing links
circled 1 and 5. Lastly, the total set consists of a group of all the documents. For example, there are 2, 4, 5
documents for each of the three sets. For these sets, the initial set was redefined to internal set while the
extended set to external set which have documents except internal ones.
Finally, eight different types of links were derived based on the two classes above. The description is shown
in Table I below. Each case matches with the circled number in Figure 2. The extended set is the basic unit for
the re-ranking process. Therefore, case 9 and 10 is out of our consideration hence excluded.
Table 1. Eight different types of links considered
Source Page Internal Internal External ExternalTarget Page External Internal Internal External
Query Term Type
Query Case1 Case2 Case3 Case4Non-Query Case5 Case6 Case7 Case8
IV. ALGORITHM FOR LINK BASED RETRIEVAL
The overall algorithm for link-based retrieval works as follows.
1. A set of documents (internal documents) is retrieved using a method based on the vector space model to
form an initial retrieval set.
2. By using link information, the set is expanded to include additional documents (external documents),
forming an extended set.
3. The documents in the extended set are re-evaluated for their relevancy based on the link information that
exists among the documents.
The purpose of the second step is to enhance recall by retrieving those documents that would not be retrieved
by the given query but are linked with the initial retrieval set. The new external documents as well as the
internal documents are re-evaluated at the third step, where the locations of the source and the destination of the
links between any two documents in the extended set are taken into account. Table I shows 8 different types of
links, depending on whether the source of a link is from a query term or from a non-query term and on where
the source and the destination of the link are located. For example, case 3 refers to a link whose source is a
query term in an external document, and whose destination is an internal document. It should be noted that an
external document may contain a query term when its rank is lower than the cut-off point and therefore excluded
from the initial set. Each of the three steps is explained in detail in the following subsections.
4.1. The First Step: Construction Of The Initial Retrieval Set
This step is basically the same as the ordinary process of retrieving documents using a vector space model.
We employed the following formula to calculate a weight wij for a term j appearing in document I:
(1)
Where nt is the total frequency of a term normalized by the maximum term frequency in the document, nidf is
the normalized inverse document frequency for the term, and n is the total number of documents in the
collection. The retrieval status value (RSV) for document Di and query Q is defined to be a cosine similarity ㅍ
value:
(2)
Where t means the total term frequency in document Di, Wd means the specific term weight in document Di,
and Wq means the specific term weight of the query term Q. |Di| and |Q| is defined as the length of the
document and query respectively.
When this step is completed, the documents that contain at least one of the query terms are listed above a
certain cut-off point.
4.2. The Second Step: Determination Of The Extended Set
This step handles cases 1 and 5 in Table I. The documents in the initial retrieval set are examined for their
association with those outside the set, via an outgoing link from either a query or non-query term. Each
document connected to a link from a term in one of the internal documents becomes an element in the extended
set. This process resembles a form of blind relevance feedback, where top-ranked documents in the initial
retrieval set are combined with the original query to form an extended query, in the sense that additional
documents are retrieved based on their connection to some of the initially retrieved documents. The difference
lies in the fact that human-generated connections (i.e. links) are used to retrieve additional documents as
opposed to word occurrences in initially retrieved documents, which are the basis for adding new terms in the
extended query.
Since external documents do not ordinarily contain query terms, it is not possible to calculate RSV for them
by simply applying the formula as in (2). The strategy used in assigning RSV for an external document is to let it
inherit the RSV from the source document containing the anchor term of the link, reflecting the similarity
between them. In other words, it is computed by first calculating the similarity between the two (i.e. an external
document and the corresponding internal document) and propagating the RSV of the internal document i
accordingly based on the similarity measure.
More formally, the RSV for the external More formally:
(3)
Where 0<= Sim(Din, Dex) <= 1. When there are more than one incoming links to the external document, the
RSV’s for individual links is calculated using the formula in (3) and the maximum value chosen. It is also
possible to combine them with the Dempster-Shafer combination rule (Rich & Knight, 1991) under the
assumption that the more the incoming links from the initial retrieval set, the more likely the external document
would satisfy the query (i.e. higher RSV). Since this aspect of accumulating evidence is incorporated when the
final RSV is calculated as described below, we take a conservative approach of giving a RSV that is not larger
than that of the most relevant document among those that point to the external document at hand.
4.3. The Third Step: Incorporation Of Link Information
After generating all the candidate documents for the extended set, the next step is to re-rank them using all
the link information across documents in the set. Not only can new documents (i.e. external documents) be
inserted at various places in the list of retrieved internal documents, but also the original ranks of the internal
documents can change with newly calculated RSV’s. From this step on, the distinction between internal and
external documents is no longer necessary. The basic scheme is to make the documents influence each other
when they are connected with links. It is conjectured that the more links candidate documents share, the more
likely they form a coherent body of documents for the given query since the candidates are supposed to be more
or less relevant to the query in the first place. Since different link types are likely to have different impact on the
bondage between documents (Pinski, 1976), they have been analyzed separately based on direct and indirect
links.
4.3.1. Impact By Direct Links
A direct link can be further classified based on the directionality (i.e. whether a link is an incoming or
outgoing link) and whether the anchor is a query term or non-query term. Considering the four cases, the
formula (4) below is used to calculate the effect of direct links on a given extended document D. Here the four
terms represent the following, respectively: 1) the effect of outgoing links starting from query terms in D, 2)
outgoing links starting from non-query terms in D, 3) incoming links from query terms in other documents, and
4) incoming links from non-query terms in other documents. They are combined with the Dempster-Shafer
combination rule represented by the operator.
Evidence Edl for an extended document ei by direct links is defined as follows:
(4)
with:
Where t are the parameters indicating the strength or the importance of the four types of links, and r, s,
t, and w are the numbers of the four different types of links starting from or pointing to D, respectively. Also, k,
l, m, and n are the documents corresponding to the four terms represented above.
It should be noted that the symbol does not have the usual meaning of summation but the meaning of the
operation based on Dempster-Shafer combination rule. This not only ensures that the value of each term
never exceeds 1, but also reflects the intuition that the link information provides additional evidence that the
document satisfies the query, which should be gathered in a principled way. The new RSV value for Considering
both the direct and indirect
w
nnin
t
mmim
s
llil
r
kkik
QDRSVE
QDRSVE
QDRSVE
QDRSVE
1
1
1
1
,
,
,
,
(5)
extended document ei can be computed in the following way:
4.3.2. Impact By Indirect Links
When two documents A and B have links to the same document, an association between them is assumed.
Likewise, to a document that has separate links to the two documents (see Figure 1). In other words, a bi-
directional indirect link can be established between two documents A and B by a common destination from the
links starting from them, or by a common source of two (or more) links starting from another document C,
which point to A and B, respectively. In order to compute the strength of the indirect link between Di and Dj, two
aspects are taken into account. The first is to do with how many links leave from Di and Dj separately and how
many of them point to the same destination Dk. The second aspect is about how many links arrive in Di and Dj
starting from Dk at the same time. The evidence Eil for an extended document ei by indirect links can be
calculated as follows:calculated as follows:
(6)
with:
Where gives the number of links. Li and Lj reflects each link starting from document i and j. Lij indicates a
pair of two links leaving from Di and Dj and arriving at the same destination document k. Indirect links(Mij)
created by a document k pointing to Di and Dj can be calculated in a similar way. M is the same as L except that
it arrives at Di and Dj instead of leaving from them.
(7)
Considering both the direct and indirect effect from links, the final RSV for D is calculated as follows:
It should be noted that while the parameter 4 is explicitly shown in this formula, other parameters are
hidden in Edl(Dei) as in the formula (4). The exact values for t ’s determined by experiments.
V. EXPERIEMENTS AND EVALUATIONS
5.1. Test Collections
The re-ranking scheme described above was implemented and a series of experiments run to understand the
effects of using link information in document retrieval. There are a myriad of hypertext documents available in
the World Wide Web, which contain interesting links, but it is not possible to use them for the experiments
because there are no relevant judgments for them. Although the CACM collection could be used by treating
references as links to other documents as in (Savoy, 1996), they would probably not serve the purpose of this
research because the nature of the links in CACM are quite different from typical hypertext links.
The choice made here was a test collection developed by ETRI (Electronics and Telecommunications
Research Institute) in Korea, which consists of encyclopedia documents containing a large number of hypertext
links published by Kyemong Publishing Co. The collection cosists of 23,113 documents (9.4 MB) in Korean,
which are basically explanations of titles such as ”comet” and ”gene”, and 46 natural language queries. There
are many hypertext links whose anchor words are the titles of other documents. In other words, following a link
by clicking on an anchor word will lead the user to a new document whose title is the same as the anchor word.
Considering that one of the main purposes of this paper is to improve retrieval performance by taking advantage
of link information, a survey on link status for data sets is strongly recommended. The general status of link
information in KYEMONG data set is shown in Table 2 below.
Table 2. Different link status information by link occurrence
Information on minimum, maximum, whole and average link frequency per document can be obtained from
Table II above. The frequencies follow the way like this (outgoing link incominglink≪indirect link). In
direct link, the difference of frequency between minimum and maximum value, relatively, is not larger than
indirect link.
Link frequency by document id and document frequency by each link frequency will be shown for direct and
indirect link in Figure 3, 4 and 5.
Figure 3. Link frequency by documents and document frequency by link frequency for outgoing links
After arranging each document by link frequency in an increasing order, a maximum link frequency is
obtained from among all documents corresponding to high percentage and the value plotted. For example, it is 6
at 40% in outgoing links, as will be seen in Figure 6.
By capturing link information for data set through the above Figures and Tables, roughly, the conclusions below
can be made;
- In the frequency level, the direct links have a lot of resemblances (see Table 2). In the case of direct link,
there is a similar link frequency distribution for about 90% of documents, but that of indirect link vary from
0 to about 800,000 (see Figure 6).
- When we take into account that the size of average document is 411KB, the number of indirect link is so
high. It may cause the opposite effects.
- There’s no relation on frequency in the three types of links. For example, the document id having
maximum frequency for outgoing link is 5018 named title ”korea”, which has the incoming frequency of
66 much lower than maximum value 1,918, same as in the case of indirect link.
Figure 4. Link frequency by documents and document frequency by link frequency for in-coming links
Figure 5. Link frequency by documents and document frequency by link frequency for in-direct links
Figure 6. Maximum link frequency for the top percentage of whole document
5.2. Experiments And Results
Getting better search performance by using link information is simply not the final purpose of this experiment
rather, it is finding the best parameter values for the algorithm and showing its superiority. The final goal was to
demonstrate that utilization of newly identified 5-types link information improves retrieval effectiveness and
understand the roles of 5 different types of links considered in the retrieval scheme.
In order to understand the roles played by each of the link types, different numbers for the five parameters
were plugged in and the 11-point average precision measured. The performance of the information retrieval can
be divided into recall and precision. You can use a wide variety of assessment methods depending on your point
of view. 11-point average precision is used for a global evaluation of the entire test documents. It was calculated
by two steps, thus measuring the precision based on the recall to each document and computing an average
value of the entire documents set. This method is widely used to evaluate the general-purpose information
retrieval performance in a reasonable way.
In Table 3, the results are summarized for different numbers of link types used in formula (4), (7).
α t Here’s indicates different link types considered: outgoing links from query and non-query terms,
incoming links from query and non-query terms, and indirect links. The first group of rows from the upper part
of the Table shows a case where only one of the link types was used. For example, the very first row shows the
11-poin taverage precision when only outgoing links from a query term are considered in re-ranking documents.
The values shown in the Table (e.g.0.2 in the first case) are the ones with the best performance in the particular
case. For example, when only α0 and α1 were given a non-zero value(i.e. when only outgoing links were
considered), the combination of 0.6 and 0.8 gave the best results among all the other combinations of values
considered between 0.1 and 1.0 with an interval 0.1.
As can be seen in Table III, the effectiveness increased as the right combination of link types was
progressively added. When only one link type was considered, which is the incoming link from a query term,
the 11-point average precision was 0.4591, but when another link type (outgoing links from a query term) was
added, the value increased to 0.5034. The best value in each category increased to 0.5086, 0.5123, and 0.5126,
showing steady improvements. This indicates that when the right parameter values are used, the five link types
together contribute to enhancement of retrieval effectiveness. However, it was found that the extent of its
influence is different.
A close examination of the results indicates that different contribution is made by different link types,
however. Incoming links from a query term (signified by a non-zero value for 2) are shown to have the strongest
impact in all the five groups. This seems to be due to the general characteristics of the links indicating that the
target page is important. Outgoing links from a query term are the next most important type. This can be seen
most clearly in the second group, where the best score was obtained when α0 was given a non-zero value in
addition to α2. These results exemplify that both link characteristics and query words have a great influence on
the search result.
In the other groups with three, four, and five non-zero parameters used, the value for α0 is the second biggest.
Similarly, we can see the next one is incoming links with non-query terms. In summary, links leaving from a
query term are more important than those leaving from a non-query term, and incoming links are more
important than outgoing links. That is, it is much safer to increase RSV for a document that is pointed to by
links whose anchor is a query term. Indirect links seem to have the least impact on retrieval effectiveness,
because of having a lot of links as explained in Table 2, which gives misleading effects on ranking value.
Table 3. Comparisons among contributions made by different link types
There are some idiosyncratic aspects in these experiments steps. Because of the characteristics of the
hypertexts in the encyclopedia collection, a link leaving from a query term seldom retrieves a new external
document. This is because most of them should have been retrieved by the title word that is the same as the
query word. The only documents that have been left out would be those below the cutoff point (30 in the
experiments). This observation indicates that most of the improvement we obtained seems to stem from the re-
ranking scheme rather than from newly retrieved documents. Nonetheless, many cases were observed where
some relevant documents that would never be retrieved by query terms only were actually retrieved because of
the link information.
To get the results for the baseline, KRISTAL-2002 retrieval system using morphological analyzer for Korean
text retrieval is used. This system developed by KISTI(Korea Institute of Science and Technology Information)
isb ased on the vector space model and includes most of the well-tested weighting schemes. The search
performance of KRISTAL is proven through man yresearches (Kim,2005). We can use that result as a baseline.
Table 4 shows the comparison between the baseline case and the best case that took into account the entire five
different link types at various recall levels. As shown in the Table, the average improvement obtained over the
base line was about 44.8% when an appropriate combination of hypertext links wa sused.
Table 4. Comparison between the best case and the baseline
This study is focused on understanding the characteristics of each link. To understand the characteristics of
the algorithm proposed in this study, a comparison with PageRank methods was performed. The experiment was
carried out from the perspective of both literature and algorithm implementation. The algorithm and parameter
values used by Brin and Page (1998) were adopted in this study.
Table 5. Comparison between the proposed method and PageRank
Division The proposed method PageRank
Processing method Create and expand the result setRe-ranking the expanded result set. Re-ranking for the result set
Link characteristics
Consideration of the 4 kinds of link characteristics – Link direction (input / output), Direct or indirect query word,
Division of expanded set (internal / external)
Link direction (input / output)
Link type used
5 kinds of link(Outgoing query word, outgoing non-
query word, incoming query word, incoming non-query word, indirect link)
1 kind of link (input link)
Calculation of link relationships
Considering the relationship between the number of links and the linked document Depends on the number of links
Ranking priority Reflect the relevance between documents and link impact
The importance of a page based on the number of input links
(PageRank value)
Calculation complexity Slightly higher normal
Applicable target Small web Massive web
The maximum performance of the 11-
point average value0.5126 0.4730
In this study, effectiveness of five types of links was measured depending on the nature of the link. Because
extended result set was used in re-ranking process and the weight of the extended result set is computed
beforehand like equation (2), link types were not distinguished as extended set. Indirect links are related to each
other too much and have low degree of influence. The kinds of indirect link were not classified because it was
meaningless to distinguish between the query word or indirect links. PageRank shows the maximum
performance of the 11-point average value of 0.4730. This is similar to the value of the maximum performance
when using only the incoming links proposed algorithm in this paper. The incoming links is important in the
influence of the link. But how to distinguish the types of links and application seems to be more important. In
particular, as shown in Table 3, whether or not containing the query words in link title have a big impact on the
search performance.
VI. CONCLUSION AND FUTURE WORKSWith the growing popularity of web documents, hypertext documents are ubiquitous. As such, it becomes
more and more important to make use of link information existing in such documents for information retrieval.
The main goal of this paper is to present a way in which this link information can be used for retrieval purposes,
and show that this information in fact improves retrieval effectiveness and reveals the role of 5 types of links
defined in previous sections. This study emphasizes that direct rather than indirect links are more important, and
incoming links are more important that outgoing links in direct links, and separation of the link title using the
words of the query is more important. For more details, refer to the experimental section. To accomplish the
goal, this paper proposes a scheme for using link information to improve retrieval effectiveness and
experimental results that proved the hypothesis that such information would be useful. The scheme has been
implemented first in the context of the efforts to build a fully-distributed, agent-based digital library prototype
(Joo, 1998), which can be modified continuously.
In the experiments using the ETRI-Kyemong test collection containing a large number of hyperlinks, as much
as 44.8 % increase was obtained in 11-point average precision, over the baseline where no link information was
used. It was observed that by using both incoming and outgoing links, we could get documents that would have
never been retrieved by matching them against the query terms. At the same time, it was possible to re-rank
those that had already been retrieved so that relevant ones could be moved toward the top of the list. This
method was compared with PageRank and results of a similar level as that used in the input link were
confirmed.
While the results show that this approach to using link information is promising, there are some limitations.
First of all, the size of the documents in the collection used was small, and all the anchors are important key
terms that correspond directly to the titles of other documents. Additional valuations should be done with more
typical document sets with hypertext, but such has not yet been found. Even with the same collection, it would
be desirable to compare these results with the blind relevance feedback strategy where a set of top-ranked
documents from the initial retrieval are combined with the original query to generate an expanded one. Still
another extension to the current work is to devise a variety of schemes for indirect links and test their efficacy.
REFERENCES
Acharyya, S, and Ghosh, J. (2004). Outlink Estimation For PageRank Computation Under Missing Data.
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters,
486-487.
Adekannbi, Janet. (2012). Motivation and Disciplinary Distribution of Web Links Between African And World
Universities: An Exploratory Study. Journal of Emerging Trends in Computing and Information Sciences,
3(3), 373-383.
Arasu, A. et al. (2001), Searching the Web. ACM Transaction on Internet Technology (TOIT), 1(1), 2-43, Aug.
2001.
Bar-Ilan, J., Mat-Hassan, M., and Levene, M. (2006). Methods For Comparing Rankings of Search Engine
Results. Computer Networks, 50(10), 1448-1463.
Berkhin, P. (2005), A Survey on PageRank Computing. Internet Mathematics, 2(1), 73-120.
Brin, S. and Page, L. (1998). Anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th
international World Wide Web Conference, Brisbane, Australlia, 107-117.
Can, F., Nuray, R., and Sevdik, A. (2004). Automatic Performance Evaluation of Web Search Engines.
Information Processing and Management, 40(3), 495-514.
Croft, W. B. and Turtle, H. (1989). A retrieval model for incorporating hypertext links. Proceedings of the 1989
the second annual ACM conference on Hypertext, 213-224.
Dhyani, D., Keong N. W., and Bhowmick, S. (2002). A Survey of Web Metrics. ACM Computing Surveys,
34(4), 469-503.
Dunlop, M. D. and Van Rijsbergen, C. J. (1993). Hypermedia and free text retrieval. Information Processing and
Management, 29(3), 287-298.
Eiron, N., McCurley, K., and Tomlin, J. (2004). Ranking the Web Frontier. Proceeding of the 13th Int'l Conf. on
World Wide Web (WWW), 309 - 318.
Frakes, William B. and Baeza-Yates, Ricardo. (1997). Information Retrieval: Data Structures and Algorithms.
Prentice Hall, Upper Saddle River, New Jersey, 363-392.
Frisse, M. E. (1998). Searching for informations in a hypertext medical handbook. Communications of ACM.
31(7), 880-886.
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, No. 178, 471-479.
Google, http://www.google.com.
Haveliwala, T. H. (2003). Topic-sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search.
IEEE Transactions on Knowledge and Data Engineering, 15(4), 784-796.
Hou, J. and Zhang, Y. (2003). Effectively Finding Relevant Web Pages from Linkage Information. IEEE
Transaction on knowledge and data engineering, 15(4), 940–951.
Islam, Anwarul and Md. Saiful Alam. 2011. Webometric study of private universities in Bangladesh. Malaysian
Journal of Library & Information Science, 16(2), 115-126.
Joo, W. K. et al. (1998). Separating Link Information from Documents in a Digital Library. Proceedings of the
25th KISS Spring Conference, Seoul, Korea.
Kleinberg, Jon M.. (1999). Authoritative Sources in a Hypertlinked Environment. Journal of the ACM, 46(5),
604-632.
KRISTAL, http://giis.kisti.re.kr.
Kim, KwangYoung et al. (2005). An Improvement of Information Retrieval System using Word Location
Information. Proceedings of the IEEE International Conference on Intelligent Agents, Web Technology
and Internet Commerce (IAWTIC-05), 90-95.
Lee, Kyung-Soon, Park, Y. C. and Choi, K. S. (2001). Re-ranking model based on document clusters.
Information Processing and Management, 37(1), 1-14.
Lucarella, Dario. (1990). A model for hypertext-based information retrieval. in Hypertext Concepts, Systems,
and Applications, (ed) Rizk, Streitz, and Andrie, 81-94.
Pinski, G. and Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with
application to the literature of physics. Information Processing and Management, 12(5), 297-312.
Rich, Elaine and Knight, K. (1991). Artificial Intelligence 2nd edition, McGraw-Hill, New York, NY.
Savoy, Jacques. (1996). An extended vector-processing scheme for searching information in hypertext systems.
Information Processing and Management, 32(2), 155-170.
Witten, Ian H., Moffat, A. and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Document
and Images. Morgan Kaufmann Publishers, San Francisco, CA.
Zhang, Yi, et al. (2006). XRank: Learning More from Web User Behaviors. Proceedings of The Sixth IEEE
International Conference on Computer and Information Technology, 36.
Yoshida, Y. et al. (2008). What's Going on in Search Engine Rankings. Proceeding of the 22nd International
Conference on Advanced Information Networking and Applications - Workshops, 1199-1204.
KRISTAL, http://giis.kisti.re.kr