web document clustering by sang-cheol seok. 1.introduction: web document clustering? why ? two...

12
Web Document Clustering By Sang-Cheol Seok

Upload: diana-pope

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

Web Document Clustering

By Sang-Cheol Seok

Page 2: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

1.Introduction: Web document clustering?

Why?

Two results for the same query ‘amazon’

Google : currently the most powerful search engine

Metacrawler : a search engine which cluster retrieved web documents.

Page 3: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

2. Approaches

Using contents of documents Using user’s usage logs Using current search engines Using hyperlinks Other classical methods

Page 4: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines.

clusters based on snippets are almost as good as clusters created using the full text of Web documents.

Suffix Tree Clustering (STC) : incremental, O(n) time algorithm

three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

Page 5: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(2) Using user’s usage logs

Advantage: relevancy information is objectively reflected by the usage logs

An experimental result on www.nasa.gov/

Cluster 1

/shuttle/missions/41-c/news/shuttle/missions/61-b…

Cluster 2

/history/apollo/sa-2/news//history/apollo/sa-2/images…

Cluster 3

/software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…

… ….

Page 6: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(3) Using current web search engines –

Metacrawler Step1: When MetaCrawler receives

a query, it posts the query to multiple search engines in parallel.

Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable )

Metacrawler at U. of Washington.

Page 7: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(4) Using hyperlinks Consider web documents as vertices and

the hyperlinks as direct edges in a direct graph.

Similarity-based clustering method was successfully used in image segmentation

Kleinberg’s HITS algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave

out the less popular ones.

Page 8: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(4) Using Hyperlinks: continued

cluster web documents based on both the textual and hyperlink

the hyperlink structure is used as the dominant factor in the similarity metric

Page 9: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

(5) Other classical clustering methods

K-means method HAC (hierarchical agglomerative

clustering) DBSCAN (Density-based SCAN) And Single-link and group-average

methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

Page 10: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

3. Key requirements and future challenges

(1) key requirements for Web document clustering methods Relevance Browsable Summaries Overlap Speed Incrementality for some methods.

Page 11: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

3. Key requirements and future challenges: continued

(2) Concerns on current methods Each method has pros and cons. Using hyperlinks : the best accuracy

and still some room to improve and it does not overlap.

STC : best to browse and for incrementality.

Metacrawler : best to prune.

Page 12: Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently

3. Key requirements and future challenges: continued

Future challenges We can not take advantage of all

pros of each method. Some pros work against other pros. So, we have to trade off. Moreover, we need to find

improvements.