clustering techniques and ir
DESCRIPTION
Clustering Techniques and IR. CSC 575 Intelligent Information Retrieval. Clustering Techniques and IR. Today Clustering Problem and Applications Clustering Methodologies and Techniques Applications of Clustering in IR. What is Clustering?. - PowerPoint PPT PresentationTRANSCRIPT
Clustering Techniques and IR
CSC 575
Intelligent Information Retrieval
Intelligent Information Retrieval 2
Clustering Techniques and IR
i Today4 Clustering Problem and Applications4 Clustering Methodologies and Techniques4 Applications of Clustering in IR
Intelligent Information Retrieval 3
What is Clustering?
i Cluster:4 a collection of data objects
that are “similar” to one another and thus can be treated collectively as one group
4 but as a collection, they are sufficiently different from other groups
Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters
Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters
Helps users understand the natural grouping or structure in a data set
Intelligent Information Retrieval 4
Clustering in IRi Objective of Clustering
4 assign items to automatically created groups based on similarity or association between items and groups
4 also called “automatic classification”4 “The art of finding groups in data.” -- Kaufmann and Rousseu
i Clustering in IR4 automatic thesaurus generation by clustering related terms4 automatic concept indexing (concepts are clusters of terms)4 automatic categorization of documents4 information presentation and browsing4 query generation and search refinement
Intelligent Information Retrieval 5
Applications of Clustering
i Clustering has wide applications in Pattern Recognitioni Spatial Data Analysis:
4 create thematic maps in GIS by clustering feature spaces4 detect spatial clusters and explain them in spatial data mining
i Image Processingi Market Researchi Information Retrieval
4 Document or term categorization4 Information visualization and IR interfaces
i Web Mining4 Cluster Web usage data to discover groups of similar access patterns4 Web Personalization
Intelligent Information Retrieval 6
Clustering Methodologies
i Two general methodologies4 Partitioning Based Algorithms4 Hierarchical Algorithms
i Partitioning Based4 divide a set of N items into K clusters (top-down)
i Hierarchical4 agglomerative: pairs of items or clusters are successively linked to
produce larger clusters4 divisive: start with the whole set as a cluster and successively
divide sets into smaller partitions
Intelligent Information Retrieval 7
Clustering Algorithmsi Similarity Measures and Features
4 most clustering algorithms are based on some measure of similarity (or distance) between items
i in IR these measures could be based on co-occurrence of terms, citations, or hyperlinks in documents
i terms can be clustered based on documents in which they co-occur, or based on lexical or semantic similarity measures
4 clustering requires the selection of features over which similarity among items is computed
i in document clustering, features are generally some or all of the terms in the collection
i often a small number of features must be selecting because many clustering algorithms break down in a “high-dimensional” space
4 similarity measures among the items can be represented as a symmetric similarity matrix, in which each entry is the similarity value between two items
Intelligent Information Retrieval 8
Distance or Similarity Measuresi Measuring Distance
4 In order to group similar items, we need a way to measure the distance between objects (e.g., records)
4 Note: distance = inverse of similarity4 Often based on the representation of objects as “feature vectors”
ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000
T1 T2 T3 T4 T5 T6Doc1 0 4 0 0 0 2Doc2 3 1 4 3 1 2Doc3 3 0 0 0 3 0Doc4 0 1 0 3 0 0Doc5 2 2 2 3 1 4
An Employee DB Term Frequencies for Documents
Which objects are more similar?
Intelligent Information Retrieval 9
Distance or Similarity Measuresi Pearson Correlation
4 Works well in case of user ratings (where there is at least a range of 1-5)4 Not always possible (in some situations we may only have implicit binary
values, e.g., whether a user did or did not select a document)4 Alternatively, a variety of distance or similarity measures can be used
i Common Distance Measures:
4 Manhattan distance:
4 Euclidean distance:
4 Cosine similarity:
( , ) 1 ( , )dist X Y sim X Y 2 2
( )( , )
i ii
i ii i
x ysim X Y
x y
2 2
1 1( , ) n ndist X Y x y x y
1 1 2 2( , ) n ndist X Y x y x y x y
1 2, , , nX x x x 1 2, , , nY y y y
Intelligent Information Retrieval 10
Clustering Similarity Measures
Simple Matching
Cosine Coefficient
Dice’s Coefficient
Jaccard’s Coefficient
i In vector-space model any of the similarity measures discussed before can be used in clustering
2 2
( )( , )
i ii
i ii i
x ysim X Y
x y
sim X Y x yi ii
( , ) sim X Y
x y
x y
i ii
i iii
( , )
2
2 2
sim X Yx y
x y x y
i ii
i i i iiii
( , )
2 2
Intelligent Information Retrieval 11
Distance (Similarity) Matrix
similarity (or distance) of to ij i jd D D
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
1 2
1 12 1
2 21 2
1 2
n
n
n
n n n
I I I
I d d
I d d
I d d
i Similarity (Distance) Matrix4 based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)4 (i, j) entry in the matrix is the distance (similarity) between items i and j
12
Example: Term Similarities in Documentsi Suppose we want to cluster terms that appear in a collection of
documents with different frequencies
i We need to compute a term-term similarity matrix4 For simplicity we use the dot product as similarity measure (note that this is the non-
normalized version of cosine similarity)
4 Example:
1
( , ) ( )i j jk
N
ikk
sim T T w w
Each term can be viewed as a vector of term
frequencies (weights)
Each term can be viewed as a vector of term
frequencies (weights)
T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2
N = total number of dimensions (in this case documents)wik = weight of term i in document k.
N = total number of dimensions (in this case documents)wik = weight of term i in document k.
Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7
Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7
Intelligent Information Retrieval 13
Similarity Matrix - ExampleT1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2
T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
Term-TermSimilarity Matrix
Term-TermSimilarity Matrix
sim T T w wi j jkikk
N
( , ) ( )
1
Intelligent Information Retrieval 14
Similarity Thresholdsi A similarity threshold is used to mark pairs that are “sufficiently” similar
4 The threshold value is application and collection dependent
T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
Using a threshold value of 10 in the previous example
Intelligent Information Retrieval 15
Graph Representationi The similarity matrix can be visualized as an undirected graph
4 each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
T1 T3
T4
T6T8
T5
T2
T7If no threshold is used, thenmatrix can be represented asa weighted graph
If no threshold is used, thenmatrix can be represented asa weighted graph
Intelligent Information Retrieval 16
Graph-Based Clustering Algorithms
i If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering
i Clique Method (complete link)4 all items within a cluster must be within the similarity threshold of all other
items in that cluster4 clusters may overlap4 generally produces small but very tight clusters
i Single Link Method4 any item in a cluster must be within the similarity threshold of at least one
other item in that cluster4 produces larger but weaker clusters
i Other methods4 star method - start with an item and place all related items in that cluster4 string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on
Intelligent Information Retrieval 17
Graph-Based Clustering Algorithmsi Clique Method
4 a clique is a completely connected subgraph of a graph4 in the clique method, each maximal clique in the graph becomes a cluster
T1 T3
T4
T6T8
T5
T2
T7
Maximal cliques (and therefore the clusters) in the previous example are:
{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}
Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.
Intelligent Information Retrieval 18
Graph-Based Clustering Algorithmsi Single Link Method
4 selected an item not in a cluster and place it in a new cluster4 place all other similar item in that cluster4 repeat step 2 for each item in the cluster until nothing more can be added4 repeat steps 1-3 for each item that remains unclustered
T1 T3
T4
T6T8
T5
T2
T7
In this case the single link method produces only two clusters:
{T1, T3, T4, T5, T6, T2, T8} {T7}
Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.
Intelligent Information Retrieval 19
Clustering with Existing Clustersi The notion of comparing item similarities can be extended to clusters themselves,
by focusing on a representative vector for each cluster4 cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid4 this methodology reduces the number of similarity computations in clustering4 clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be madei Partitioning Methods
4 reallocation method - start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning
4 Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed
i Hierarchical Agglomerative Methods4 starts with individual items and combines into clusters4 then successively combine smaller clusters to form larger ones4 grouping of individual items can be based on any of the methods discussed earlier
Intelligent Information Retrieval 20
Partitioning Algorithms: Basic Concept
i Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
i Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion4 Global optimal: exhaustively enumerate all partitions4 Heuristic methods: k-means and k-medoids
algorithms
i k-means (MacQueen’67)4 Each cluster is represented by the center of the
cluster
i k-medoids (Kaufman & Rousseeuw’87):4 Each cluster is represented by one of the objects in
the cluster
Intelligent Information Retrieval 21
K-Means Algorithmi The basic algorithm (based on reallocation method):
1. Select K initial clusters by (possibly) random assignment of some items to clusters and compute each of the cluster centroids.
2. Compute the similarity of each item xi to each cluster centroid and (re-)assign each item to the cluster whose centroid is most similar to xi.
3. Re-compute the cluster centroids based on the new assignments.4. Repeat steps 2 and 3 until three is no change in clusters from one iteration to the next.
Initial (arbitrary) assignment:C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}
Cluster Centroids
Example: Clustering Documents
T1 T2 T3 T4 T5
D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2C1 4/2 4/2 3/2 1/2 4/2C2 0/2 7/2 0/2 3/2 5/2C3 2/2 3/2 3/2 0/2 5/2
Intelligent Information Retrieval 22
Example: K-Means
D1 D2 D3 D4 D5 D6 D7 D8C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).
For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…
This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…
Example: K-Means
T1 T2 T3 T4 T5
D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2
C1 8/3 2/3 3/3 3/3 4/3C2 2/4 12/4 3/4 3/4 11/4C3 0/1 1/1 3/1 0/1 1/1
Now compute new cluster centroids using the original document-term matrix
This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
D1 D2 D3 D4 D5 D6 D7 D8C1 7.67 15.01 5.34 9.00 5.00 12.00 7.67 11.34C2 16.75 11.25 17.50 19.50 8.00 6.68 4.25 10.00C3 14.00 3.00 6.00 6.00 11.00 9.34 9.00 3.00
C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}New assignment
Note: This process is now repeated with new clusters. However, the next iteration in this exampleWill show no change to the clusters, thus terminating the algorithm.
Intelligent Information Retrieval 24
K-Means Algorithm
i Strength of the k-means:4 Relatively efficient: O(tkn), where n is # of objects, k
is # of clusters, and t is # of iterations. Normally, k, t << n
4 Often terminates at a local optimum
i Weakness of the k-means:4 Applicable only when mean is defined; what about
categorical data?4 Need to specify k, the number of clusters, in advance4 Unable to handle noisy data and outliers
i Variations of K-Means usually differ in:4 Selection of the initial k means4 Dissimilarity calculations4 Strategies to calculate cluster means
Intelligent Information Retrieval 25
Single Pass Methodi The basic algorithm:
1. Assign the first item T1 as representative for C1
2. for item Ti calculate similarity S with centroid for each existing cluster
3. If Smax is greater than threshold value, add item to corresponding cluster and recalculate centroid; otherwise use item to initiate new cluster
4. If another item remains unclustered, go to step 2
See: Example of Single Pass Clustering Technique
i This algorithm is simple and efficient, but has some problems4 generally does not produce optimum clusters4 order dependent - using a different order of processing items will result
in a different clustering
Hierarchical Clustering Algorithms
• Two main types of hierarchical clustering– Agglomerative:
• Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a time
Intelligent Information Retrieval 27
Hierarchical Algorithmsi Use distance matrix as clustering criteria
4 does not require the no. of clusters as input, but needs a termination condition
a
b
c
d
e
ab
cd
cde
abcde
Step 0 Step 1 Step 2 Step 3 Step 4
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative
Divisive
Intelligent Information Retrieval 28
Hierarchical Agglomerative Clusteringi HAC starts with unclustered data and performs successive pairwise joins
among items (or previous clusters) to form larger ones4 this results in a hierarchy of clusters which can be viewed as a dendrogram4 useful in pruning search in a clustered item set, or in browsing clustering results
A B C D E F G HI
Intelligent Information Retrieval 29
Hierarchical Agglomerative Clustering
i Some commonly used HACM methods4 Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster4 Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked to each other within a similarity threshold
4 Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity)
4 Ward’s method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method
Intelligent Information Retrieval 30
Hierarchical Agglomerative Clustering
i Basic procedure4 1. Place each of N documents into a class of its own.4 2. Compute all pairwise document-document similarity
coefficientsi Total of N(N-1)/2 coefficients
4 3. Form a new cluster by combining the most similar pair of current clusters i and j
i (use one of the methods described in the previous slide, e.g., complete link, Ward’s, etc.);
i update similarity matrix by deleting the rows and columns corresponding to i and j;
i calculate the entries in the row corresponding to the new cluster i+j.
4 4. Repeat step 3 if the number of clusters left is great than 1.
Intelligent Information Retrieval 31
Clustering Application: Discovery of Content Profiles
i Content Profiles4 Goal: automatically group together pages which partially deal with
similar concepts4 Method:
i identify concepts by clustering features (keywords) based on their common occurrences among pages (can also be done using association discovery or correlation analysis)
i cluster centroids represent pages in which features in the cluster appear frequently
4 Content profiles are derived from centroids after filtering out low-weight page in each centroid
4 The weight of a page in a profile represents the degree to which features in the corresponding cluster appear in that page.
Intelligent Information Retrieval 32
Keyword-Based Representation
w1 w2 w3P1 1 0 1P2 1 0 0P3 0 1 1P4 1 0 0P5 1 1 1P6 1 1 0P7 0 1 0P8 0 1 0
Terms P1 P2 P3 P4 P5 P6 P7 …
w1 1 1 0 1 1 1 0w2 0 0 1 0 1 1 1w3 1 0 1 0 1 0 0
Keyword weights can be:- Binary (as in this example)- Raw (or normalized) term frequency- TF x IDF
Keyword weights can be:- Binary (as in this example)- Raw (or normalized) term frequency- TF x IDF
Mining tasks can be performed on either of these matrices…
Intelligent Information Retrieval 33
Content Profiles – An Example
PROFILE 0 (Cluster Size = 3)--------------------------------------------------------------------------------------------------------------1.00 C.html (web, data, mining)1.00 D.html (web, data, mining)0.67 B.html (data, mining)
PROFILE 1 (Cluster Size = 4)-------------------------------------------------------------------------------------------------------------1.00 B.html (business, intelligence, marketing, ecommerce)1.00 F.html (business, intelligence, marketing, ecommerce)0.75 A.html (business, intelligence, marketing)0.50 C.html (marketing, ecommerce)0.50 E.html (intelligence, marketing)
PROFILE 2 (Cluster Size = 3)-------------------------------------------------------------------------------------------------------------1.00 A.html (search, information, retrieval)1.00 E.html (search, information, retrieval)0.67 C.html (information, retrieval)0.67 D.html (information, retireval)
Filtering threshold = 0.5
Intelligent Information Retrieval 34
Example: Assoc. for Consumer Research (ACR)
Intelligent Information Retrieval 35
How Content Profiles Are Generated1. Extract important features (e.g., word stems) from eachdocument:
Feature Freqconfer 12market 9develop 9intern 5ghana 3ismd 3contact 3
… …
Feature Freqpsychologi 11consum 9journal 6manuscript 5cultur 5special 4issu 4paper 4
… …… …
icmd.html
jcp.html
2. Build a global dictionary of all features(words) along with relevant statistics
Total Documents = 41
Feature-id Doc-freq Total-freq Feature0 6 44 19971 12 59 19982 13 76 19993 8 41 2000… … … …123 26 271 confer124 9 24 consid125 23 165 consum… … … …439 7 45 psychologi440 14 78 public441 11 61 publish… … … …549 1 6 vision550 3 8 volunt551 1 9 vot552 4 23 vote553 3 17 web… … … …
Intelligent Information Retrieval 36
How Content Profiles Are Generated3. Construct a document-word matrix with normalized tf-idf weights
doc-id/feature-id 0 1 2 3 4 5 …0 0.27 0.43 0.00 0.00 0.00 0.00 …1 0.07 0.10 0.00 0.00 0.00 0.00 …2 0.00 0.06 0.07 0.03 0.00 0.00 …3 0.00 0.00 0.00 0.00 0.00 0.00 …4 0.00 0.00 0.00 0.00 0.00 0.00 …5 0.00 0.00 0.05 0.06 0.00 0.00 …6 0.17 0.10 0.07 0.03 0.03 0.00 …7 0.14 0.09 0.08 0.02 0.02 0.00 …8 0.00 0.00 0.10 0.00 0.00 0.00 …9 0.00 0.07 0.00 0.00 0.00 0.00 …
10 0.02 0.02 0.00 0.00 0.00 0.00 …11 0.00 0.00 0.00 0.00 0.00 0.00 …12 0.00 0.00 0.00 0.00 0.00 0.00 …13 0.00 0.00 0.00 0.00 0.00 0.00 …14 0.00 0.00 0.00 0.00 0.00 0.00 …15 0.00 0.00 0.32 0.38 0.00 0.00 …… … … … … … … …
4. Now we can perform clustering on word (or documents) using one of the techniques described earlier (e.g., k-means clustering on features).
Intelligent Information Retrieval 37
How Content Profiles Are GeneratedExamples of feature (word) clusters obtained using k-means:
5. Content profiles are now generated from feature clusters based on centroids ofeach cluster (similar to usage profiles, but we have words instead of users/sessions).
CLUSTER 0----------anthropologianthropologistappropriassocibehavior...
CLUSTER 4----------consumissujournalmarketpsychologispecial
CLUSTER 10----------ballotresultvotvote...
CLUSTER 11----------advisoriappointcommittecouncil...
Weight Pageview ID Significant Features (stems)1.00 CFP: One World One Market world challeng busi co manag global0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern0.35 CFP: Journal of Global Marketing busi global0.32 CFP: Journal of Consumer Psychology busi manag global
Weight Pageview ID Significant Features (stems)1.00 CFP: Journal of Psych. & Marketing psychologi consum special market1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market0.72 CFP: Journal of Global Marketing journal special market0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special0.50 CFP: Society for Consumer Psychology psychologi consum special0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market
Intelligent Information Retrieval 38
Scatter/GatherCutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95
i Cluster-based browsing technique for large text collections4 Cluster sets of documents into general “themes”, like a table of contents 4 Display the contents of the clusters by showing topical terms and typical titles4 The user may then select (gather) clusters that seem interesting4 These clusters can then be re-clustered (scattered) to reveal more fine-grained
clusters of documents4 With each successive iteration of scattering and gathering, the clusters become
smaller and more detailed, eventually bottoming out at the level of individual documents
4 Clustering and re-clustering is entirely automated
i Originally used to give collection overviewi Evidence suggests more appropriate for displaying retrieval results in
context
Intelligent Information Retrieval 39
Scatter/Gather Interface
Intelligent Information Retrieval 40
Scatter/Gather Clusters
41
Clustering and Collaborative Filtering :: clustering based on ratings: movielens
42
Clustering and Collaborative Filtering :: tag clustering example
43
Hierarchical Clustering:: example – clustered search results
Can drill down within clusters to view sub-topics or to view the relevant subset of results