web document clustering
DESCRIPTION
Web Document Clustering. Department of Computer Science and Engineering Southern Methodist University Wenyi Ni. Why web document clustering is needed?. 3.3 billion web pages on the internet Every time you post a query, the search engine returns thousands of records. - PowerPoint PPT PresentationTRANSCRIPT
Web Document Clustering
Department of Computer Science and Engineering
Southern Methodist University
Wenyi Ni
Why web document clustering is needed?
3.3 billion web pages on the internet Every time you post a query, the search
engine returns thousands of records. Did you efficiently find what you wanted? Web document clustering is a good choice. An example:
www.metacrawler.com
How to present a web document in a general model?1. TF-IDF Each web document is consisted by words. The more words they share, the more likely they are
similar. Each Web document D can be represented by the
following form: D = {d1,d2…, dn} Where n means that there are totally n different words in
the document collection. di represents the appearance of the ith word in the
document.(1 means exist, 0 means non-exist) The order of di is determined by the weight.
How to calculate the weight?
tfij is number of occurrences of the word tj in the Web document Di.
idfj is Inverse document frequency.
dfj is the number of Web documents in which word tj occurs in the document collection.
n is the total number of Web documents in the document collection.
jijij
jj
idftfw
df
nidf
log
How to calculate the similarity between two web documents Jaccard similarity measure:
Other common measures: Cosine, Dice, Overlap
n
h
n
h jhih
n
h jhih
dddd
DjDisim1 1
22
12
),(
Agglomerative Hierarchical clustering1. Start with regarding each document as an
individual cluster
2. Merge the most similar pair of documents or document clusters.(use the similarity measure)
3. Step 2 is iteratively executed until all objects are contained within a single cluster, which become the root of the tree.
K-means clustering
1.Arbitrary select K documents as seeds, they are the initial centroids of each cluster.
2.Assign all other documents to the closest centroid
3.Compute the centroid of each cluster again. Get new centroid of each cluster
4.Repeat step2,3, until the centroid of each cluster doesn’t change.
Some other refinement algorithm using TF-IDF model
Biselting K-means Scatter/Gather
Bisecting K-means
1.Select a cluster to split (There are several ways to select which cluster to split. No significance difference exists in terms of clustering accuracy). We normally choose the largest cluster or the one with the least overall similarity
2.Employ the basic k-means algorithm to subdivide the chosen cluster.
3.Repeat step 2 for a constant number of times. Then perform the split that produces clusters with the highest overall similarity
4.Repeat the above step1,2,3, until the desired number of clusters is reached
How to present a web document in STC model What is STC?
Suffix Tree clustering The whole web document is treated as a
string The identification of base clusters is the
creation of an inverted index of strings for the web document collection
A suffix tree example(courtesy form zemair):
Three strings. Each string is a document.1. Cat ate cheese2. Mouse ate cheese too 3. Cat ate mouse too.
STC algorithm(cont)1.Document cleaningDelete the word prefix and suffix, reduce plural to singular. Sentence boundaries are marked and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.2.Identify Base Cluster. Create an inverted index of strings from the web document collection with using a suffix tree. Each node of the suffix tree represents a group of documents and a string that is common to all of them. The label of the node represents the common string. Each node represents a base cluster.
STC algorithm(cont)
3.Score base clusters.Each base cluster is assigned a score The score formula: S(B)=|B|*f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in string P that has a
non-zero score The function f penalizes single word, linear for
string that is two to six words long. And become constant for longer string.
STC algorithm
4.Combine base clusters.The similarity measure used to combine base clusters is based on the overlap of their document sets:Bx and By with size |Bx| and |By| |BxBy| represents the number of documents common to both base clusters.Define the similarity of Bx and By to be 1 if: |Bx By|/|Bx|>0.5 and |Bx By|/|By|>0.5. Otherwise is 0. Two base clusters are connected if they have similarity of 1. Using a single-link clustering algorithm, all the connected base clusters are clustering together. All the documents in these base clusters constitute a web document cluster.
Link Based ModelIdea: Web pages that share common links each other are very likely to be tightly related Each web document P is represented as 2 vectors: Pout(N-dimension) and Pin(M-dimension) 1)Pout,i represents whether the web document P has a out-link in the ith item of vector Pout 2)Pin,j represents whether the web document P has a in-link in the jth item of vector Pin For example:Pout( link1, link2,…,linkn) represents all the out-link in web document collection.Document Pout,2= 1 means this document has link2 as out-link.
Link based algorithm 1.Filter irrelevant web documents A document is regarded irrelevant if the sum of in-links and out-links less than 2 2.Use near-common link of cluster to grantee intra-cluster cohesiveness Every cluster should have at least one 30% near common link 3.Assign each web document to cluster, generate base clusters. Similarity between the document and the corresponding cluster is above the similarity threshold The document has a link in common with near common links of the corresponding cluster 4.Generate final clusters by merging base clusters
How to evaluate the quality of the result clusters (cont)Entropy1)For each cluster, the class distribution of the data(we usually use TREC5,TREC6 document collection) is calculated first. 2)Using this class distribution, the entropy of each cluster j is calculated.
Ej = -Spijlog(pij) 3) The best quality is that all the documents in the cluster fall into the same class that is known before clustering
How to evaluate the quality of the result clustersF-measure
1)Calculate the recall and precision of that cluster for each given class.
2)For cluster j and it’s corresponding class i
Recall(i, j) = nij/ni
Percision(i, j) = nij/nj
F(i, j) = ( 2 * Recall(i, j) * Percision(i, j)) / ((Percision(i, j) + Recall(i, j))
Algorithm evaluation and comparisonTF-IDF based AHCGood cluster quality, time complexity O(n²)TF-IDF based K-meansLinear time complexity O(Kmn) Sensitive to outliersSTCBest for increment. Linear time complexity O(n), has memory problem.Link based Linear time complexity O(mn), low dimension, good cluster quality.
Future work
Each algorithm has its advantage and disadvantage. We need to refine these algorithms. Sometime we need trade off.
Still some room to make it better.1.increase the entropy or F-measure value of the result
clusters(The evaluation value is under 0.6 in almost all algorithm,while the best is 1)
2.decrease the response time(we often need to process a large document collection. We need a fast algorithm)
End