evaluation of bipartite-graph-based web page clustering shim wonbo m1 chikayama-taura lab

Evaluation of Bipartite-graph-based Web Page Clustering

Shim WonboM1 Chikayama-Taura Lab

Background

Open Directory Project

Used by Google, Lycos, etc. Categorizing Web pages by hand

Accurate

Lately updated Unscalable

World Wide Web

Rapid increase (= # of clusters changes) Daily updated (= cluster centers move)

Due to these two properties of the Web.. A Web page clustering system without

human effort is needed.

Purpose

Constructing a Web page clustering system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately

Agenda

Introduction Related Work Proposal Comparison Conclusion

Clustering Algorithm

Text-based clustering Use of word as feature Generally used algorithm

Link-based clustering Focus on link structure Especially used in clustering Web pages

k-means Algorithm

k = 3point: vector expressionof each document

Problems of k-means Algorithm

k depends on the data set. Outliers sensitively effect clustering

result.

Hierarchical Clustering

BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]

Hierarchical Clustering

# of clusters can be determined by condition.

Clustering a large number of points (pages) results in many I/O accesses.

Use of Link Structure

Web pages include not only text but also links.

People link Web pages to other related pages.

Linked Web pages may share the same topic

Extraction of Web Community based on Link Analysis

An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

Terminology

Fans and Centers

Bipartite Graph Complete BG Dense BG

Fan Center

(a) CBG (b) DBG

pq

),( tt qqpp

An Approach to Find Related Communities Based on Bipartite Graphs

DefinitionThe set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where T: Fans I: Centers p: # of out-link q: # of in-link

tt qqpp ,

pq

DBG(T, I, 2, 3)

DBG Extraction Algorithm (pt = 2, qt = 3)

1. Gathering related nodes

threshold = 1

DBG Extraction Algorithm(pt = 2, qt = 3)

2. Extracting a DBG

1 2

1

21

0

2

3

2

2

3

3

3

DBG-based Web Community

O High speed (O( #links ))O Finding out topics over the Web

X Possibility of extracting disrelated Web page group

Comparison

Text-based clustering Accurate Difficult to determine the center of cluster

Community topology based on DBG Inaccurate Can be used as topic selection

Refined Web Community Center of Cluster

Agenda

Introduction Related Word Proposal Comparison Conclusion

Proposal

1. Extract DBGs through link analysis2. Refine communities and fix centers wi

th DBSCAN3. Partition other pages to the nearest c

enter

Community Extraction

Extract DBGs from the Web Graph Disallow the same page to be included in

more than one Web community

Web Graph

Cluster Center Refinement

Find meaningful page sets1. Does the DBGs really have a topic?2. Is there any page in the community that is not r

elated the topic?

Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorith

m for Discovering Clusters in Large Spatial Databases with Noise, 1999]

DBSCANradius: rminP: m

r

Core

Density reachable

Community(Center of cluster)

Partitioning Remaining Pages

Feature: term’s appearance

1. Calculate distance between a remaining page and each center

2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster

3. Otherwise, attach the page to “Unclassified cluster”

Agenda

Introduction Related Word Proposal Experimental Result Conclusion

Target

Seed: 3,000 pages categorized to Computer/Software by ODP

70,000 pages departed from seed pages by 2 hops

Preprocess

Word ID Use words of a dictionary as base vectors Attribute the same ID to words sharing the same

derivation Add terms which appear in many documents

(IDF <= 8) Total: 29347

Link Extraction Elimination of links to pages which are not

collected.

# Communities

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10Threshold

# Com

munit

ies

(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)

# Community Members (pt=3, qt=3)

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0 2 4 6 8 10 12 14Nth Community

# Com

munit

y Mem

bers (3,3,2)

(3,3,3)(3,3,4)(3,3,5)(3,3,6)(3,3,7)(3,3,8)(3,3,9)

# Community Members

0

10000

20000

30000

40000

50000

60000

0 1 2 3 4 5 6 7 8 9 10Threshold

# Com

munit

y Mem

bers

(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)

Variance of Terms

After DBSCAN

Conclusion

Future Work

Applying to more large data set This may need parallel processing

Analyzing with

evaluation of bipartite-graph-based web page clustering shim wonbo m1 chikayama-taura lab

Documents