1 i256 applied natural language processing fall 2009 lecture 15 (text) clustering barbara rosario
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/1.jpg)
1
I256
Applied Natural Language Processing
Fall 2009
Lecture 15
(Text) clustering
Barbara Rosario
![Page 2: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/2.jpg)
2
Outline
• Motivation and applications for text clustering• Hard vs. soft clustering• Flat vs. hierarchical clustering• Similarity measures• Flat
– K-means
• Hierarchical– Agglomerative Clustering
![Page 3: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/3.jpg)
3
Text Clustering
• Finds overall similarities among groups of documents
• Finds overall similarities among groups of tokens (words, adjectives…)
• Goal is to place similar objects in the same groups and to assign dissimilar objects to different groups
![Page 4: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/4.jpg)
4
Motivation
• Smoothing for statistical language models– Generalization
• Forming bins (by inducing the bins from the data)
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 5: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/5.jpg)
5
Motivation
• Aid for Question-Answering and Information Retrieval
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 6: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/6.jpg)
6
Word Similarity
Find semantically related words by combining similarity
evidence from multiple indicators
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 7: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/7.jpg)
7
Word clustering
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 8: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/8.jpg)
8
Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93
Clustering of nouns
![Page 9: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/9.jpg)
9
Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93
![Page 10: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/10.jpg)
10
Clustering of adjectives• Cluster adjectives based on the nouns
they modify
• Multiple syntactic clues for modification
Predicting the semantic orientation of adjectives,
V Hatzivassiloglou, KR McKeown, EACL 1997
![Page 11: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/11.jpg)
11
Document clustering
Classification
![Page 12: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/12.jpg)
12
Scatter/Gather: Clustering a Large Text Collection
Cutting, Pedersen, Tukey & Karger 92, 93
Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like a table of contents
• Display the contents of the clusters by showing topical terms and typical titles
• User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”
![Page 13: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/13.jpg)
13
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
![Page 14: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/14.jpg)
14
S/G Example: query on “star”
Encyclopedia text
14 sports
8 symbols 47 film, tv
68 film, tv (p) 7 music
97 astrophysics
67 astronomy(p) 12 steller phenomena
10 flora/fauna 49 galaxies, stars
29 constellations
7 miscelleneous
Clustering and re-clustering is entirely automated
![Page 15: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/15.jpg)
15
![Page 16: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/16.jpg)
16
![Page 17: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/17.jpg)
17
![Page 18: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/18.jpg)
18
Motivation: Visualization & EDA• Exploratory data analysis (EDA) (related to
visualization)– Get a feeling for what the data look like– Try to find overall trends or patterns in text collections
![Page 19: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/19.jpg)
19
Visualization
• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
• “Project” these onto a 2D graphical representation
• Looks neat, but difficult to detect patterns– Usefulness debatable
![Page 20: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/20.jpg)
20
Motivation:Clustering for Information Retrieval
• The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval.– Cluster hypothesis. Documents in the same
cluster behave similarly with respect to relevance to information needs.
• Tends to place similar docs together
![Page 21: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/21.jpg)
21
Search result clustering
• Instead of lists, clusters the search results, so that similar documents appear together.
• It is often easier to scan a few coherent groups than many individual documents. – Particularly useful if a search term has
different word senses. – Vivísimo search engine (http://vivisimo.com)
![Page 22: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/22.jpg)
22
![Page 23: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/23.jpg)
23
Motivation: unsupervised classification
• Classification when labeled data is not available – Also called unsupervised classification– Results of clustering depends only on the
natural division in the data, not on any pre-existing categorization scheme
![Page 24: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/24.jpg)
24
Classification
Class1
Class2
![Page 25: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/25.jpg)
25
Clustering
![Page 26: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/26.jpg)
26
Clustering
![Page 27: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/27.jpg)
27
Methods
• Hard/soft clustering
• Flat/hierarchical clustering
• Similarity measures
• Merging methods
![Page 28: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/28.jpg)
28
Text Clustering
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
Term 1
Term 2
![Page 29: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/29.jpg)
29
Text Clustering
Term 1
Term 2
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
![Page 30: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/30.jpg)
30
Hard/soft Clustering
– Hard Clustering -- Each object belongs to a single cluster
– Soft Clustering -- Each object is probabilistically assigned to clusters
![Page 31: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/31.jpg)
31
Soft clustering
• A variation of many clustering methods
• Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters– A sample might belong to cluster A with
probability 0.4 and to cluster B with probability 0.6
• More appropriate for NLP tasks
![Page 32: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/32.jpg)
32
Flat Vs. Hierarchical
• Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other
• Hierarchical clustering produces a hierarchy of nodes– Leaves are the single objects of the clustered
set– Node represents the cluster that contains all
the nodes of its descendant
![Page 33: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/33.jpg)
33
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
![Page 34: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/34.jpg)
34
Flat Vs. Hierarchical
• Flat– Preferable is efficiency is a consideration or data sets
are very large– K-means is a very simple methods that should
probably be used fist on anew data set because its results are often sufficient
– K-means assumes a simple Euclidean representation save so cannot be used for many data set, for example nominal data like colors
– In such cases use EM (expectation minimization)
![Page 35: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/35.jpg)
35
Flat Vs. Hierarchical
• Hierarchical– Preferable for detailed data analysis– Provide more information than flat clustering– Does not require us to pre-specify the number
of clusters– Less efficient: the most common hierarchical
clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of most flat clustering methods
![Page 36: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/36.jpg)
36
Clustering issues
• Two main issues
• Similarity measure
• How to cluster data point together (o not)– Clustering algorithms– Merging criteria
![Page 37: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/37.jpg)
37
Similarity
• Vector-space representation and similarity computation
• Select important distributional properties of a word
• Create a vector of length n for each word to be classified
• Viewing the n-dimensional vector as a point in an n-dimensional space, cluster points that are near one another
![Page 38: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/38.jpg)
38
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 39: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/39.jpg)
39
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 40: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/40.jpg)
40
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 41: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/41.jpg)
41
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 42: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/42.jpg)
42
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 43: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/43.jpg)
43
Pair-wise Document Similarity
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
ABCD
How to compute document similarity?
![Page 44: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/44.jpg)
44
Pair-wise Document Similarity(no normalization for simplicity)
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
ABCD
t
iii
t
t
wwDDsim
wwwD
wwwD
12121
2,22212
1,12111
),(
...,,
...,,
9)11()42(),(
0),(
0),(
0),(
0),(
11)32()51(),(
DCsim
DBsim
CBsim
DAsim
CAsim
BAsim
![Page 45: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/45.jpg)
45
Pair-wise Document Similarity(cosine normalization)
normalized cosine
)()(
),(
edunnormaliz ),(
...,,
...,,
1
22
1
21
121
21
12121
2,22212
1,12111
t
ii
t
ii
t
iii
t
iii
t
t
ww
wwDDsim
wwDDsim
wwwD
wwwD
![Page 46: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/46.jpg)
46
Document/Document Matrix
....
.....
.....
....
....
...
21
2212
1121
21
nnn
t
t
t
ddD
ddD
ddD
DDD
jiij DDd to of similarity
![Page 47: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/47.jpg)
47
Similarity
• And many other similarity measures!
![Page 48: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/48.jpg)
48
Flat Clustering: K-means
• K-means is the most important flat clustering algorithm.
• Objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ of the documents in a cluster ω:
![Page 49: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/49.jpg)
49
K-Means Clustering
• Decide on a pair-wise similarity measure
1. Compute K centroids
2. Assign each document to nearest center, forming new clusters
3. Unless terminate condition, repeat 1-2
![Page 50: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/50.jpg)
50
K-means algorithmA K-means example
for K = 2 in R2
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
![Page 51: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/51.jpg)
51
K-means algorithm
• Convergence of the position of the two centroids
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
![Page 52: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/52.jpg)
52
K-means
• Residual sum of squares or RSS: measure of how well the centroids represent the members of their clusters – RSS: squared distance of each vector from its
centroid summed over all vectors– RSS is the objective function in K-means and
our goal is to minimize it
![Page 53: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/53.jpg)
53
Model-based clustering
• Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. (Flat)
• The model that we recover from the data then defines clusters and an assignment of documents to clusters.
• EM (expectation-maximization)
![Page 54: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/54.jpg)
54
Hierarchical Clustering
• Agglomerative or bottom-up:– Initialization: Start with each sample in its own cluster– Merge the two closest clusters– Each iteration: Find two most similar clusters and merge
them– Termination: All the objects are in the same cluster
• Divisive or top-down:– Start with all elements in one cluster– Partition one of the current clusters in two– Repeat until all samples are in singleton clusters
![Page 55: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/55.jpg)
55
Agglomerative Clustering
A B C D E F G HI
![Page 56: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/56.jpg)
56
Agglomerative Clustering
A B C D E F G HI
![Page 57: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/57.jpg)
57
Agglomerative Clustering
A B C D E F G HI
![Page 58: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/58.jpg)
58
Merging nodes/Clustering function
• Each node is a combination of the documents combined below it
• We represent the merged nodes as a vector of term weights
• This vector is referred to as the cluster centroid
![Page 59: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/59.jpg)
59
Clustering functionsaka Merging criteria
• Extending the distance measure from samples to sets of samples
Similarity of 2 most similar members
Similarity of 2 least similar members
Average similarity between members
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 60: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/60.jpg)
60
Single-link merging criteria
Merge closest pair of clusters:Single-link: clusters are close if any of their points are
dist(A,B) = min dist(a,b) for aA, bB
each word type isa single-point cluster
merge
![Page 61: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/61.jpg)
61
Fast, but tend to get long, stringy, meandering clusters ...
Bottom-Up Clustering – Single-Link
![Page 62: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/62.jpg)
62
Bottom-Up Clustering – Complete-Link
Again, merge closest pair of clusters:Complete-link: dist(A,B) = max dist(a,b) for aA, bB
distancebetweenclusters
![Page 63: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/63.jpg)
63
Bottom-Up Clustering – Complete-Link
distancebetweenclusters
Slow to find closest pair – need quadratically many distances
![Page 64: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/64.jpg)
64
Choosing k
• How to select an appropriate level of granularity?
• Too small, and clusters provide insufficient generalization
• Too large, and they are inappropriately generalized
![Page 65: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/65.jpg)
65
Choosing k
• In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form
• This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram)
• It would be nice if we could find an optimal k from the data
• We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters.
• And there are some theoretical measures
![Page 66: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/66.jpg)
66
How to evaluate clusters?
• In practice, it’s hard to do– Different algorithms’ results look good and
bad in different ways– It’s difficult to distinguish their outcomes
• In theory, define an evaluation function– Typically choose something easy to measure
(e.g., the sum of the average distance in each class)
![Page 67: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/67.jpg)
67
How to evaluate clusters?
• Perform task-based evaluation• Test the resulting clusters intuitively, i.e.,
inspect them and see if they make sense. Not advisable.
• Have an expert generate clusters manually, and test the automatically generated ones against them.
• Test the clusters against a predefined classification if there is one
From Michael Collins’s slides (MIT 6.864 NLP course)
![Page 68: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d3f5503460f94a17d8c/html5/thumbnails/68.jpg)
68
Resources
• FCLUSTER - A tool for fuzzy cluster analysis
• LNKnet Pattern Classification Software • Principal Direction Divisive Partitioning • k-means clustering • Text Clustering
– http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf (Chapter 16 and 17)