evaluation of bipartite-graph-based web page clustering shim wonbo m1 chikayama-taura lab
DESCRIPTION
Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand Accurate Lately updated UnscalableTRANSCRIPT
Evaluation of Bipartite-graph-based Web Page Clustering
Shim WonboM1 Chikayama-Taura Lab
Background
Open Directory Project
Used by Google, Lycos, etc. Categorizing Web pages by hand
Accurate
Lately updated Unscalable
World Wide Web
Rapid increase (= # of clusters changes) Daily updated (= cluster centers move)
Due to these two properties of the Web.. A Web page clustering system without
human effort is needed.
Purpose
Constructing a Web page clustering system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately
Agenda
Introduction Related Work Proposal Comparison Conclusion
Clustering Algorithm
Text-based clustering Use of word as feature Generally used algorithm
Link-based clustering Focus on link structure Especially used in clustering Web pages
k-means Algorithm
k = 3point: vector expressionof each document
Problems of k-means Algorithm
k depends on the data set. Outliers sensitively effect clustering
result.
Hierarchical Clustering
BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]
Hierarchical Clustering
# of clusters can be determined by condition.
Clustering a large number of points (pages) results in many I/O accesses.
Use of Link Structure
Web pages include not only text but also links.
People link Web pages to other related pages.
Linked Web pages may share the same topic
Extraction of Web Community based on Link Analysis
An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]
Terminology
Fans and Centers
Bipartite Graph Complete BG Dense BG
Fan Center
(a) CBG (b) DBG
pq
),( tt qqpp
An Approach to Find Related Communities Based on Bipartite Graphs
DefinitionThe set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where T: Fans I: Centers p: # of out-link q: # of in-link
tt qqpp ,
pq
DBG(T, I, 2, 3)
DBG Extraction Algorithm (pt = 2, qt = 3)
1. Gathering related nodes
threshold = 1
DBG Extraction Algorithm(pt = 2, qt = 3)
2. Extracting a DBG
1 2
1
21
0
2
3
2
2
3
3
3
DBG-based Web Community
O High speed (O( #links ))O Finding out topics over the Web
X Possibility of extracting disrelated Web page group
Comparison
Text-based clustering Accurate Difficult to determine the center of cluster
Community topology based on DBG Inaccurate Can be used as topic selection
Refined Web Community Center of Cluster
Agenda
Introduction Related Word Proposal Comparison Conclusion
Proposal
1. Extract DBGs through link analysis2. Refine communities and fix centers wi
th DBSCAN3. Partition other pages to the nearest c
enter
Community Extraction
Extract DBGs from the Web Graph Disallow the same page to be included in
more than one Web community
Web Graph
Cluster Center Refinement
Find meaningful page sets1. Does the DBGs really have a topic?2. Is there any page in the community that is not r
elated the topic?
Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorith
m for Discovering Clusters in Large Spatial Databases with Noise, 1999]
DBSCANradius: rminP: m
r
Core
Density reachable
Community(Center of cluster)
Partitioning Remaining Pages
Feature: term’s appearance
1. Calculate distance between a remaining page and each center
2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster
3. Otherwise, attach the page to “Unclassified cluster”
Agenda
Introduction Related Word Proposal Experimental Result Conclusion
Target
Seed: 3,000 pages categorized to Computer/Software by ODP
70,000 pages departed from seed pages by 2 hops
Preprocess
Word ID Use words of a dictionary as base vectors Attribute the same ID to words sharing the same
derivation Add terms which appear in many documents
(IDF <= 8) Total: 29347
Link Extraction Elimination of links to pages which are not
collected.
# Communities
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8 9 10Threshold
# Com
munit
ies
(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)
# Community Members (pt=3, qt=3)
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0 2 4 6 8 10 12 14Nth Community
# Com
munit
y Mem
bers (3,3,2)
(3,3,3)(3,3,4)(3,3,5)(3,3,6)(3,3,7)(3,3,8)(3,3,9)
# Community Members
0
10000
20000
30000
40000
50000
60000
0 1 2 3 4 5 6 7 8 9 10Threshold
# Com
munit
y Mem
bers
(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)
Variance of Terms
After DBSCAN
Conclusion
Future Work
Applying to more large data set This may need parallel processing
Analyzing with