VIPAS: Virtual Link Powered Authority Search in the Web
Chi-Chun Lin and Ming-Syan ChenNetwork Database LaboratoryNational Taiwan University
M.-S. Chen NTU 2
Outline Motivation and Goal Preliminaries and Related work
Introduction to Link-analysis Defects of Traditional Link-analysis and
Ideas for Improvement System Framework and Algorithms Implementation and Experimental Results Conclusions
M.-S. Chen NTU 3
Motivation and Goal To find the most relevant pages satisfying
the user’s information need in the Web Traditional means for this task
Keyword-based search engines Problems
Some relevant pages do not contain the keywords in the page text
An alternative method Analyze the links contained in Web pages
instead of ranking by keywords
M.-S. Chen NTU 4
HITS (1/3)
Authority pages A page pointed to by many other pages
Hub pages A page pointing to many other pages
Mutual reinforcement An authority pointed to by many hub pages is
an even better authority A hub pointing to many authority pages is an
even better hub Based on this argument, the goal of HITS is to
find the set of best authority pages
M.-S. Chen NTU 5
HITS (2/3)
q1
q2
q3
page pxp := sum of yq
for all qp
Let xp and yp denote the authority and hub score of page p, respectively
q1
q2
q3page pyp := sum of xq
for all pq
M.-S. Chen NTU 6
HITS (3/3)
Iterative algorithm1. Obtain a set of Web pages using a keyword-
based query and expand it to form a base set2. Assign each page of the base set an initial
authority and hub score of 13. According to its links, update the scores of
each page4. Normalize the scores so that
(xp)2=1 and (yp)2=1 for all p in the base set5. Do steps 3 and 4 iteratively until the scores
converge
M.-S. Chen NTU 7
The Problem with HITS Links in Web pages only reflect page
creators’ judgment Sometimes a link will not be put in the
page even though its destination is very relevant e.g: There will be no link to a company’s
competitor in the same industry in its homepage
We argue: Page readers’ considerationshould be of equal importance
M.-S. Chen NTU 8
The Notion of Virtual Links The basic idea
Identify pages that are heavily accessed within a period, and form a “hot set” from these pages
Create “virtual links” for pages in the hot set and incorporate them into the computation of authority scores
Design a Web warehouse for this task and utilize it to identify authoritative Web pages
M.-S. Chen NTU 9
System Framework
PageArchive
Keyword& RankingDatabase
Web Pages
AuthorityEvaluator
QueryInterface
ClickstreamDatabase
ClickingObserver
Virtual LinkCreator
virtual links
page content
& links
keywords
scores
query results
M.-S. Chen NTU 10
Creating Virtual Links Scenario: A user interested in Java-related
Web pages came to our system She submitted a query with keyword “java” Assume that the query result contains 100
URLs She clicked top 1-10 of the 100 URLs except
the 6th
The hot set consists of the 9 URLs clicked
M.-S. Chen NTU 11
Creating Virtual Links (cont’d)
URL 1
URL 2
Virtual Hub
URL 5
URL 6
URL 7
URL 10
2 criteria
URL 1
URL 2
Hub 1
URL 5
URL 6
URL 7
URL 10
Hub 2
Hub n
M.-S. Chen NTU 12
Algorithm VIPAS(Virtual LInk Powered Authority Search)
Initialization Phase1. For a query term, perform the regular HITS analysis2. Collect a base set of pages with computed authority
and hub scores and store them in the database Virtual Link Collection Phase3. Monitor the user behavior to see whether a URL in
the list is clicked by the user or not4. After a period of user behavior observation, put URLs
that are often accessed into the “hot set”
5. Create virtual links for pages in the hot set
M.-S. Chen NTU 13
Algorithm VIPAS (cont’d)
Refinement Phase6. For each page in the hot set, compute its new
authority and hub scores7. Run several iterations of score updating for pages in
the base set
2 flavors VIPAS-VH(VIPAS with virtual links from a Virtual Hub) VIPAS-TH(VIPAS with virtual links from Top Hubs)
M.-S. Chen NTU 14
Finding Hot Sets
1. In an observing period, pay attention to clicks of continuous URLs in the list
2. When a user continuously clicks several URLs and then skips some URLs following, we mark those that have been skipped
3. Exclude pages marked with a frequency greater than from the forming of hot sets
4. Among pages left, those that are accessed by at least % users are put into the hot set
Some relevant URLs that have already been browsed by the user will be skipped
M.-S. Chen NTU 15
Finding Hot Sets (cont’d)
1. http://java.sun.com/2. http://www.sun.com/java/3. http://www.javaworld.com/4. http://java.oreilly.com/5. http://www.jars.com/6. …………..
clicked
clicked
skipped
clicked
clicked
1. http://java.sun.com/2. http://www.sun.com/java/3. http://www.javaworld.com/4. http://java.oreilly.com/5. http://www.jars.com/6. …………..
skipped
clicked
skipped
clicked
clickedURL 4 is marked,but URL 1 is not
URL 4 is marked
M.-S. Chen NTU 16
Assigning Weights to Virtual Links
0...
)067.0(4321
1
6
4
)133.0(4321
2
6
4
)200.0(4321
3
6
4
)267.0(4321
4
6
4
,16,15,1
4,1
3,1
2,1
1,1
nwww
w
w
w
w
Clickstream 1: (t1,t2,t3,t4,x1,x2)
Clickstream 2: (t3,x1,t1)
0...
)444.0(21
2
3
2
)222.0(21
1
3
2
,25,24,22,2
3,2
1,2
nwwww
w
w
n pages in the hot set: t1,t2,…,tn
M.-S. Chen NTU 17
Final weight:
For period Ti where i 2
Assigning Weights to Virtual Links (cont’d)
)()(
1
)(
1,
1
1
TN
wTw
TN
khk
h
100.02
0200.0)(
245.02
222.0267.0)(
12
11
Tw
Tw
)(3
2)(
3
1)(
)()(
'1
)(
1,
'
ihihih
i
TN
khk
ih
TwTwTw
TN
wTw
i
(1/3 is the degeneration factor)
M.-S. Chen NTU 18
Computing the New Scores
Let xp and yp denote the authority and hub
score of page p, respectively For each page p, we update p’s authority
score by
Similarly, we update p’s hub score by
Epqq Epqq
qpqAqp ywyx ),( : ' )',( : '
'',
Eqpq Eqpq
qqpHqp xwxy ),( : ' )',( : '
'',
M.-S. Chen NTU 19
User-behavior Observation Use an ASP script
1. The Source of Java(TM) Technologyhttp://java.sun.com/
2. ………………….http://….
3. ………http://…
plain URL http://java.sun.com/ replaced bywrapper.asp?URL=http://java.sun.com/
1. Increment the click count ofhttp://java.sun.com/
2. Record the time3. Redirect the user to
http://java.sun.com/
Query result for keyword: “Java”
Query result page
M.-S. Chen NTU 20
Implementation and Experiments Experimental testbed
NTUEE website (http://www.ee.ntu.edu.tw/)
Data collection 03/28/’02 ~ 05/31/’02
ParametersParameter Value
20%
40%
A 10
H 10
M.-S. Chen NTU 21
Evaluation Method For a keyword, we manually select a list
of authority pages and compare it with the output of each algorithm
Discrepancy coefficient
SN URL (H denotes http://www.ee.ntu.edu.tw) Title
5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]
8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
n
kRn
kk
1
)(
M.-S. Chen NTU 22
Discrepancy Coefficient – Regular HITSRank SN URL (H denotes http://www.ee.ntu.edu.tw) Title
1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
2 93 H/professor_c.html Faculty members of NTUEE
3 34 H/prodata_c.html Faculty members of NTUEE
4 94 H/professor_e.html Faculty members of NTUEE
5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
6 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]
7 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]
8 5892 H/html_2000/WWW/faculty/NoSort.html [no title]
9 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
10 8904 H/html_2000/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
41 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)
67.133
)341()25()11(
M.-S. Chen NTU 23
Discrepancy Coefficient – VIPAS-VHRank SN URL (H denotes http://www.ee.ntu.edu.tw) Title
1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
2 93 H/professor_c.html Faculty members of NTUEE
3 34 H/prodata_c.html Faculty members of NTUEE
4 94 H/professor_e.html Faculty members of NTUEE
5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
6 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]
7 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]
8 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]
9 5892 H/html_2000/WWW/faculty/NoSort.html [no title]
10 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)
23
)36()25()11(
M.-S. Chen NTU 24
Evaluation Method Grouping coefficient
Stability The standard deviation of each algorithm’s
discrepancy coefficients for all of the keywords
n
kRn
kk
1
2])[(
M.-S. Chen NTU 25
Grouping Coefficient – Regular HITS
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)
25.173
]67.13)341[(]67.13)25[(]67.13)11[( 222
Rank SN URL (H denotes http://www.ee.ntu.edu.tw) Title
1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
2 93 H/professor_c.html Faculty members of NTUEE
3 34 H/prodata_c.html Faculty members of NTUEE
4 94 H/professor_e.html Faculty members of NTUEE
5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
6 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]
7 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]
8 5892 H/html_2000/WWW/faculty/NoSort.html [no title]
9 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
10 8904 H/html_2000/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
41 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]
M.-S. Chen NTU 26
Grouping Coefficient – VIPAS-VH
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)
41.13
]2)341[(]2)25[(]2)11[( 222
Rank SN URL (H denotes http://www.ee.ntu.edu.tw) Title
1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
2 93 H/professor_c.html Faculty members of NTUEE
3 34 H/prodata_c.html Faculty members of NTUEE
4 94 H/professor_e.html Faculty members of NTUEE
5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu
6 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]
7 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]
8 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]
9 5892 H/html_2000/WWW/faculty/NoSort.html [no title]
10 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….
M.-S. Chen NTU 27
Experimental Results
0
5
10
15
20
25
1 2 3 4 5 6 7 8
Dis
crep
ancy
Coe
ffic
ient
HITS
VIPAS-VH
VIPAS-TH
0
4
8
12
16
20
1 2 3 4 5 6 7 8
Keyword
Gro
upin
g C
oeff
icie
nt
HITS
VIPAS-VH
VIPAS-TH
M.-S. Chen NTU 28
Experimental Results (cont’d)
0123456789
HITS VIPAS-VH VIPAS-TH
Sta
bilit
y
M.-S. Chen NTU 29
Conclusions Link-analysis algorithms are popular in
Web information retrieval But they need further improvement
In our work, we built a Web warehouse Incorporate user feedback into the
identification of authoritative resources(Algorithm VIPAS)
Experimental results show that VIPAS is very effective and the warehouse is able to retrieve much more valuable information for users