evaluation of bipartite-graph-based web page clustering shim wonbo m1 chikayama-taura lab

35
Evaluation of Bipartite- graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Upload: darlene-cook

Post on 20-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand  Accurate  Lately updated  Unscalable

TRANSCRIPT

Page 1: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Evaluation of Bipartite-graph-based Web Page Clustering

Shim WonboM1 Chikayama-Taura Lab

Page 2: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Background

Page 3: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Open Directory Project

Used by Google, Lycos, etc. Categorizing Web pages by hand

Accurate

Lately updated Unscalable

Page 4: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

World Wide Web

Rapid increase (= # of clusters changes) Daily updated (= cluster centers move)

Due to these two properties of the Web.. A Web page clustering system without

human effort is needed.

Page 5: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Purpose

Constructing a Web page clustering system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately

Page 6: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Agenda

Introduction Related Work Proposal Comparison Conclusion

Page 7: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Clustering Algorithm

Text-based clustering Use of word as feature Generally used algorithm

Link-based clustering Focus on link structure Especially used in clustering Web pages

Page 8: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

k-means Algorithm

k = 3point: vector expressionof each document

Page 9: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Problems of k-means Algorithm

k depends on the data set. Outliers sensitively effect clustering

result.

Page 10: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Hierarchical Clustering

BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]

Page 11: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Hierarchical Clustering

# of clusters can be determined by condition.

Clustering a large number of points (pages) results in many I/O accesses.

Page 12: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Use of Link Structure

Web pages include not only text but also links.

People link Web pages to other related pages.

Linked Web pages may share the same topic

Page 13: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Extraction of Web Community based on Link Analysis

An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

Page 14: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Terminology

Fans and Centers

Bipartite Graph Complete BG Dense BG

Fan Center

(a) CBG (b) DBG

pq

),( tt qqpp

Page 15: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

An Approach to Find Related Communities Based on Bipartite Graphs

DefinitionThe set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where T: Fans I: Centers p: # of out-link q: # of in-link

tt qqpp ,

pq

DBG(T, I, 2, 3)

Page 16: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

DBG Extraction Algorithm (pt = 2, qt = 3)

1. Gathering related nodes

threshold = 1

Page 17: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

DBG Extraction Algorithm(pt = 2, qt = 3)

2. Extracting a DBG

1 2

1

21

0

2

3

2

2

3

3

3

Page 18: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

DBG-based Web Community

O High speed (O( #links ))O Finding out topics over the Web

X Possibility of extracting disrelated Web page group

Page 19: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Comparison

Text-based clustering Accurate Difficult to determine the center of cluster

Community topology based on DBG Inaccurate Can be used as topic selection

Refined Web Community Center of Cluster

Page 20: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Agenda

Introduction Related Word Proposal Comparison Conclusion

Page 21: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Proposal

1. Extract DBGs through link analysis2. Refine communities and fix centers wi

th DBSCAN3. Partition other pages to the nearest c

enter

Page 22: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Community Extraction

Extract DBGs from the Web Graph Disallow the same page to be included in

more than one Web community

Web Graph

Page 23: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Cluster Center Refinement

Find meaningful page sets1. Does the DBGs really have a topic?2. Is there any page in the community that is not r

elated the topic?

Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorith

m for Discovering Clusters in Large Spatial Databases with Noise, 1999]

Page 24: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

DBSCANradius: rminP: m

r

Core

Density reachable

Community(Center of cluster)

Page 25: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Partitioning Remaining Pages

Feature: term’s appearance

1. Calculate distance between a remaining page and each center

2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster

3. Otherwise, attach the page to “Unclassified cluster”

Page 26: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Agenda

Introduction Related Word Proposal Experimental Result Conclusion

Page 27: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Target

Seed: 3,000 pages categorized to Computer/Software by ODP

70,000 pages departed from seed pages by 2 hops

Page 28: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Preprocess

Word ID Use words of a dictionary as base vectors Attribute the same ID to words sharing the same

derivation Add terms which appear in many documents

(IDF <= 8) Total: 29347

Link Extraction Elimination of links to pages which are not

collected.

Page 29: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

# Communities

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10Threshold

# Com

munit

ies

(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)

Page 30: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

# Community Members (pt=3, qt=3)

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0 2 4 6 8 10 12 14Nth Community

# Com

munit

y Mem

bers (3,3,2)

(3,3,3)(3,3,4)(3,3,5)(3,3,6)(3,3,7)(3,3,8)(3,3,9)

Page 31: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

# Community Members

0

10000

20000

30000

40000

50000

60000

0 1 2 3 4 5 6 7 8 9 10Threshold

# Com

munit

y Mem

bers

(2,2)(2,3)(2,4)(3,2)(3,3)(3,4)(4,2)(4,3)(4,4)

Page 32: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Variance of Terms

Page 33: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

After DBSCAN

Page 34: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Conclusion

Page 35: Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Future Work

Applying to more large data set This may need parallel processing

Analyzing with