web document clustering by sang-cheol seok. 1.introduction: web document clustering? why ? two...

Web Document Clustering

By Sang-Cheol Seok

1.Introduction: Web document clustering?

Why?

Two results for the same query ‘amazon’

Google : currently the most powerful search engine

Metacrawler : a search engine which cluster retrieved web documents.

2. Approaches

Using contents of documents Using user’s usage logs Using current search engines Using hyperlinks Other classical methods

(1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines.

clusters based on snippets are almost as good as clusters created using the full text of Web documents.

Suffix Tree Clustering (STC) : incremental, O(n) time algorithm

three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

(2) Using user’s usage logs

Advantage: relevancy information is objectively reflected by the usage logs

An experimental result on www.nasa.gov/

Cluster 1

/shuttle/missions/41-c/news/shuttle/missions/61-b…

Cluster 2

/history/apollo/sa-2/news//history/apollo/sa-2/images…

Cluster 3

/software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…

… ….

(3) Using current web search engines –

Metacrawler Step1: When MetaCrawler receives

a query, it posts the query to multiple search engines in parallel.

Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable )

Metacrawler at U. of Washington.

(4) Using hyperlinks Consider web documents as vertices and

the hyperlinks as direct edges in a direct graph.

Similarity-based clustering method was successfully used in image segmentation

Kleinberg’s HITS algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave

out the less popular ones.

(4) Using Hyperlinks: continued

cluster web documents based on both the textual and hyperlink

the hyperlink structure is used as the dominant factor in the similarity metric

(5) Other classical clustering methods

K-means method HAC (hierarchical agglomerative

clustering) DBSCAN (Density-based SCAN) And Single-link and group-average

methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

3. Key requirements and future challenges

(1) key requirements for Web document clustering methods Relevance Browsable Summaries Overlap Speed Incrementality for some methods.

3. Key requirements and future challenges: continued

(2) Concerns on current methods Each method has pros and cons. Using hyperlinks : the best accuracy

and still some room to improve and it does not overlap.

STC : best to browse and for incrementality.

Metacrawler : best to prune.

3. Key requirements and future challenges: continued

Future challenges We can not take advantage of all

pros of each method. Some pros work against other pros. So, we have to trade off. Moreover, we need to find

improvements.