web document clustering

Web Document Clustering

Department of Computer Science and Engineering

Southern Methodist University

Wenyi Ni

Why web document clustering is needed?

3.3 billion web pages on the internet Every time you post a query, the search

engine returns thousands of records. Did you efficiently find what you wanted? Web document clustering is a good choice. An example:

www.metacrawler.com

How to present a web document in a general model?1. TF-IDF Each web document is consisted by words. The more words they share, the more likely they are

similar. Each Web document D can be represented by the

following form: D = {d1,d2…, dn} Where n means that there are totally n different words in

the document collection. di represents the appearance of the ith word in the

document.(1 means exist, 0 means non-exist) The order of di is determined by the weight.

How to calculate the weight?

tfij is number of occurrences of the word tj in the Web document Di.

idfj is Inverse document frequency.

dfj is the number of Web documents in which word tj occurs in the document collection.

n is the total number of Web documents in the document collection.

jijij

jj

idftfw

df

nidf

log

How to calculate the similarity between two web documents Jaccard similarity measure:

Other common measures: Cosine, Dice, Overlap

n

h

n

h jhih

n

h jhih

dddd

DjDisim1 1

22

12

),(

Agglomerative Hierarchical clustering1. Start with regarding each document as an

individual cluster

2. Merge the most similar pair of documents or document clusters.(use the similarity measure)

3. Step 2 is iteratively executed until all objects are contained within a single cluster, which become the root of the tree.

K-means clustering

1.Arbitrary select K documents as seeds, they are the initial centroids of each cluster.

2.Assign all other documents to the closest centroid

3.Compute the centroid of each cluster again. Get new centroid of each cluster

4.Repeat step2,3, until the centroid of each cluster doesn’t change.

Some other refinement algorithm using TF-IDF model

Biselting K-means Scatter/Gather

Bisecting K-means

1.Select a cluster to split (There are several ways to select which cluster to split. No significance difference exists in terms of clustering accuracy). We normally choose the largest cluster or the one with the least overall similarity

2.Employ the basic k-means algorithm to subdivide the chosen cluster.

3.Repeat step 2 for a constant number of times. Then perform the split that produces clusters with the highest overall similarity

4.Repeat the above step1,2,3, until the desired number of clusters is reached

How to present a web document in STC model What is STC?

Suffix Tree clustering The whole web document is treated as a

string The identification of base clusters is the

creation of an inverted index of strings for the web document collection

A suffix tree example(courtesy form zemair):

Three strings. Each string is a document.1. Cat ate cheese2. Mouse ate cheese too 3. Cat ate mouse too.

STC algorithm(cont)1.Document cleaningDelete the word prefix and suffix, reduce plural to singular. Sentence boundaries are marked and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.2.Identify Base Cluster. Create an inverted index of strings from the web document collection with using a suffix tree. Each node of the suffix tree represents a group of documents and a string that is common to all of them. The label of the node represents the common string. Each node represents a base cluster.

STC algorithm(cont)

3.Score base clusters.Each base cluster is assigned a score The score formula: S(B)=|B|*f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in string P that has a

non-zero score The function f penalizes single word, linear for

string that is two to six words long. And become constant for longer string.

STC algorithm

4.Combine base clusters.The similarity measure used to combine base clusters is based on the overlap of their document sets:Bx and By with size |Bx| and |By| |BxBy| represents the number of documents common to both base clusters.Define the similarity of Bx and By to be 1 if: |Bx By|/|Bx|>0.5 and |Bx By|/|By|>0.5. Otherwise is 0. Two base clusters are connected if they have similarity of 1. Using a single-link clustering algorithm, all the connected base clusters are clustering together. All the documents in these base clusters constitute a web document cluster.

Link Based ModelIdea: Web pages that share common links each other are very likely to be tightly related Each web document P is represented as 2 vectors: Pout(N-dimension) and Pin(M-dimension) 1)Pout,i represents whether the web document P has a out-link in the ith item of vector Pout 2)Pin,j represents whether the web document P has a in-link in the jth item of vector Pin For example:Pout( link1, link2,…,linkn) represents all the out-link in web document collection.Document Pout,2= 1 means this document has link2 as out-link.

Link based algorithm 1.Filter irrelevant web documents A document is regarded irrelevant if the sum of in-links and out-links less than 2 2.Use near-common link of cluster to grantee intra-cluster cohesiveness Every cluster should have at least one 30% near common link 3.Assign each web document to cluster, generate base clusters. Similarity between the document and the corresponding cluster is above the similarity threshold The document has a link in common with near common links of the corresponding cluster 4.Generate final clusters by merging base clusters

How to evaluate the quality of the result clusters (cont)Entropy1)For each cluster, the class distribution of the data(we usually use TREC5,TREC6 document collection) is calculated first. 2)Using this class distribution, the entropy of each cluster j is calculated.

Ej = -Spijlog(pij) 3) The best quality is that all the documents in the cluster fall into the same class that is known before clustering

How to evaluate the quality of the result clustersF-measure

1)Calculate the recall and precision of that cluster for each given class.

2)For cluster j and it’s corresponding class i

Recall(i, j) = nij/ni

Percision(i, j) = nij/nj

F(i, j) = ( 2 * Recall(i, j) * Percision(i, j)) / ((Percision(i, j) + Recall(i, j))

Algorithm evaluation and comparisonTF-IDF based AHCGood cluster quality, time complexity O(n²)TF-IDF based K-meansLinear time complexity O(Kmn) Sensitive to outliersSTCBest for increment. Linear time complexity O(n), has memory problem.Link based Linear time complexity O(mn), low dimension, good cluster quality.

Future work

Each algorithm has its advantage and disadvantage. We need to refine these algorithms. Sometime we need trade off.

Still some room to make it better.1.increase the entropy or F-measure value of the result

clusters(The evaluation value is under 0.6 in almost all algorithm,while the best is 1)

2.decrease the response time(we often need to process a large document collection. We need a fast algorithm)

web document clustering

Documents

web document collection

web document di

document cleaningdelete

inverse document

tfidfeach web document

web pages

total number of web

chosen cluster