1 a relation-based page rank algorithm for semantic web search engines
TRANSCRIPT
a formal point of view) the steps followed by a user
during the process of query definition. Let us imagine
that a user is interested in pages containing three generic
keywords k1, k2, and k3 (associated to as many generic
concepts c1, c2, and c3). The user begins query definition
by specifying a pair including a keyword and its related
concept. Let us assume that he or she starts with k1 and
c1. It is reasonable to assume that after specifying
keyword k1, the user inserts a second keyword (for
example, k2, together with concept c2) expecting either to
find pages where k1 and k2 (that is, c1 and c2) are related
in some way or to find pages where k1 is linked to some
other keywords/concepts that will be specified later. In a
similar way, when he or she specifies k3 and c3, he or
she would be expecting to further adjust the result set in
order to find pages showing also relations between k3
and k1 (not k2 since in the ontology, there is no relation
linking c3 with c2). Let us consider a very trivial example
assuming that there exists only two pages p1 and p2
containing all the keywords (and associated concepts)
specified by the user. This represents the (initial) result
set for the given query. We want to rank those pages in
order to present to the user first the page that best fits
his or her query. The semantic annotations and page
subgraphs for these pages are illustrated in Figs. 4c, 4d,
4e, and 4f. In the first page, both c2 and c3 are linked to
c1 through a single relation (Fig. 4c), while in the second
page there exists two relations linking c3 to c1. However,
c2 is not linked in any way to c1 (Fig. 4f). Since we
cannot assume which could be the concepts or the
relations more important with respect to user query, we
can provide a significant measure of page relevance by
computing the probability that a page is the one of
interest to the user (that is, its relevance) by calculating
the probability that c2 is linked to c1 and c3 is linked to
c1 through the relations in the user’s mind (either r112 or
r212 and r1
13 or r213, respectively). Let us compute
P ð�rij; Q; pÞ, which is the probability of finding in a
particular page p a relation �rij between concepts i and j
that could be the one of interest to the user (because of
query Q). According to the probability theory, this can be
defined as P ð�rij; pÞ ¼ �ij=�ij ¼ �ij (note that it does not
depend on Q). We call it the relation probability. Thus, for
the first page, we have P ð�r12; p1Þ ¼ �12=�12 ¼ �12 ¼ 1=2
and P ð�r13; p1Þ ¼ �13=�13 ¼ �13 ¼ 1=2. For the second page,
we have P ð�r12; p2Þ ¼ �12=�12 ¼ �12 ¼ 0 and P ð�r13; p2Þ ¼�13=�13 ¼ �13 ¼ 1. Based on the considerations above,
we can compute the joint probability P ðQ; pÞ ¼P ðð�r12; pÞ \ ð�r13; pÞÞ. The dependency on Q is due to the
fact that only concepts given in Q are taken into account.
Since the events ð�r12; pÞ and ð�r13; pÞ are not correlated,
P ðQ; pÞ can be rewritten as P ðQ; pÞ ¼ P ð�r12; pÞ � P ð�r13; pÞ.Thus, for the specific example being considered, it is
P ðQ; p1Þ ¼ 1=4 and P ðQ; p2Þ ¼ 0, respectively, for the first
and second page. This allows placing the first page
before the second one in the ordered result set. However,to preserve the behavior of common search strategies, a
way for assigning a score different than zero to pages in
which there exists concepts not related to other concepts
will have to be identified.Another critical situation is illustrated in Fig. 5. In this
case, the user specifies a query composed by concepts c1, c2,and c3 over a novel ontology. Based on the considerationsabove, a measure of page relevance can be computed byestimating, for each concept, the probability of having arelation between that concept and another concept and thatsuch relation is exactly the one in the user’s mind. However,it can be demonstrated that this probability can beexpressed also in different terms, capable of taking intoaccount situations in which a particular concept can berelated to more than one concept (that is, the case of thespecific example being considered, as well as of commonsituations in any concrete search scenario). Specifically, theprobability that each concept is related to other concepts isgiven by the probability of having c1 linked to c2 and c2
linked to c3 or c1 linked to c2 and c1 linked to c3 or c2 linkedto c3 and c1 linked to c3. The situations above can bemodeled again by using graph theory. In fact, having eachconcept related to at least another concept in the query isequivalent to considering all the possible spanning forests(a collection of spanning trees, one for each connectedcomponent in the graph) for page subgraph GQ;p given thequery Q. In Fig. 6, all the possible spanning forests (trees, inthis case) of the page subgraph in Fig. 5d are shown. We callSFf
Q;p the fth page spanning forest computed over GQ;p. Wedefine P ðSFf
Q;pÞ as the probability that SFfQ;p is the spanning
forest of interest to the user. By simplifying the notation and
LAMBERTI ET AL.: A RELATION-BASED PAGE RANK ALGORITHM FOR SEMANTIC WEB SEARCH ENGINES 129
Fig. 5. (a) An ontology graph. (b) Query subgraph. (c) An example of an annotated page. (d) Page subgraph built upon the given ontology/query.
Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.
replacing �rij, p with �rpij, the probability for page p can becomputed as
P ðQ; pÞ ¼ P�
�rp12 \ �rp23
� �\ SF 1
Q;p
� �[ �rp12 \ �rp13
� �\ SF 2
Q;p
� �
[ �rp23 \ �rp13
� �\ SF 3
Q;p
� ��:
ð1Þ
Since the events are not correlated, it is also
P ðQ; pÞ¼P �rp12 \ �rp23
� ��P SF 1
Q;p
� �þP �rp12 \ �rp13
� ��P SF 2
Q;p
� �
þ P �rp23 \ �rp13
� ��P SF 3
Q;p
� �
¼P �rp12
� ��P �rp23
� ��P SF 1
Q;p
� �þP �rp12
� ��P �rp13
� ��P SF 2
Q;p
� �
þ P �rp23
� ��P �rp13
� ��P SF 3
Q;p
� �;
ð2Þ
where P ð�rij;pÞ can be replaced with �ij ¼ �ij=�ij.Since the probability for a single page spanning forest
to be the one of interest to the user is the same withrespect to the remaining ones, if we define �Q;p asthe number of spanning forests for GQ;p, we haveP ðSF 1
Q;pÞ ¼ P ðSF 2Q;pÞ ¼ P ðSF 3
Q;pÞ ¼ 1=�Q;p. Thus, the ex-pression for P ðQ; pÞ can be rewritten again as
P ðQ; pÞ ¼P �rp12
� �� P �rp23
� �þ P �rp12
� �� P �rp13
� �þ P �rp23
� �� P �rp13
� ��Q;p
;
ð3Þ
and according to the definition of relation probability, it is
P ðQ; pÞ ¼ �12 � �23 þ �12 � �13 þ �23 � �13½ �=�Q;p: ð4Þ
Given the ontology and the query selected for theconsidered example, (4) can be used to compute a relevancescore for each page in the result set and to provide a rankingwithin the result set itself. As expected, (4) works well alsofor the example in Fig. 4, where �Q;p ¼ 1 (since the pagesubgraph already constitutes the only spanning forest).Nevertheless, P ðQ; pÞ can still assume a value equal of zerofor all those pages in which there exists concepts that do notshow any relation with other concepts but is still present, asa keyword, in the annotated page. In the following, we willanalyze this issue in detail, and we will show how to extendthe methodology above in order to come to a general rulefor ranking all the pages in the (initial) result set.
We consider again an example represented by two pages(depicted in Fig. 7 and based on the same ontology as inFig. 5a), where concept c4 (in the first page) and concept c2
(in the second page) do not show any relations with theremaining concepts. If we compute P ðQ; p1Þ and P ðQ; p2Þusing (4) (which is still valid since the page annotationrefers to the same ontology), we get a relevance score equalto zero. Based on the definition of relevance score providedabove, in order to find a score different than zero allowingeach page to be ranked with respect to other pages, we haveto relax the condition of having each concept related to each otherconcept. Since by definition, in a spanning forest, there doesnot exist any cycles, removing one edge means removing alink between a couple of concepts. That is, edges from allthe page spanning forests have to be progressivelyremoved, thus obtaining constrained page spanning forestscomposed by a decreasing number of edges (and, equiva-lently, of connected concepts). We maintain the term“spanning” in order to recall that each constrained pagespanning forest originates from a true spanning forest inwhich for all the connected components of the graph, all the
130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009
Fig. 6. All the possible spanning forests (trees) that could be obtained from GQ;p in Fig. 5d.
Fig. 7. (a) An annotated page p1 where concept c4 is not linked to any other concepts. (b) Page subgraph for a query Q specifying c1, c2, c3, and c4.
(c) Annotation of a second page p2, where c2 is not linked to any other concepts. (d) Page subgraph for the same query.
Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.