1 i256 applied natural language processing fall 2009 lecture 15 (text) clustering barbara rosario

1

I256

Applied Natural Language Processing

Fall 2009

Lecture 15

(Text) clustering

Barbara Rosario

2

Outline

• Motivation and applications for text clustering• Hard vs. soft clustering• Flat vs. hierarchical clustering• Similarity measures• Flat

– K-means

• Hierarchical– Agglomerative Clustering

3

Text Clustering

• Finds overall similarities among groups of documents

• Finds overall similarities among groups of tokens (words, adjectives…)

• Goal is to place similar objects in the same groups and to assign dissimilar objects to different groups

4

Motivation

• Smoothing for statistical language models– Generalization

• Forming bins (by inducing the bins from the data)

From Michael Collins’s slides (MIT 6.864 NLP course)

5

Motivation

• Aid for Question-Answering and Information Retrieval


6

Word Similarity

Find semantically related words by combining similarity

evidence from multiple indicators


7

Word clustering


8

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Clustering of nouns

9

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

10

Clustering of adjectives• Cluster adjectives based on the nouns

they modify

• Multiple syntactic clues for modification

Predicting the semantic orientation of adjectives,

V Hatzivassiloglou, KR McKeown, EACL 1997

11

Document clustering

Classification

12

Scatter/Gather: Clustering a Large Text Collection

Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”

13

From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf



14

S/G Example: query on “star”

Encyclopedia text

14 sports

8 symbols 47 film, tv

68 film, tv (p) 7 music

97 astrophysics

67 astronomy(p) 12 steller phenomena

10 flora/fauna 49 galaxies, stars

29 constellations

7 miscelleneous

Clustering and re-clustering is entirely automated

18

Motivation: Visualization & EDA• Exploratory data analysis (EDA) (related to

visualization)– Get a feeling for what the data look like– Try to find overall trends or patterns in text collections

19

Visualization

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation

• Looks neat, but difficult to detect patterns– Usefulness debatable

20

Motivation:Clustering for Information Retrieval

• The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval.– Cluster hypothesis. Documents in the same

cluster behave similarly with respect to relevance to information needs.

• Tends to place similar docs together

21

Search result clustering

• Instead of lists, clusters the search results, so that similar documents appear together.

• It is often easier to scan a few coherent groups than many individual documents. – Particularly useful if a search term has

different word senses. – Vivísimo search engine (http://vivisimo.com)

23

Motivation: unsupervised classification

• Classification when labeled data is not available – Also called unsupervised classification– Results of clustering depends only on the

natural division in the data, not on any pre-existing categorization scheme

24

Classification

Class1

Class2

25

Clustering

26

Clustering

27

Methods

• Hard/soft clustering

• Flat/hierarchical clustering

• Similarity measures

• Merging methods

28

Text Clustering

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

29

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

30

Hard/soft Clustering

– Hard Clustering -- Each object belongs to a single cluster

– Soft Clustering -- Each object is probabilistically assigned to clusters

31

Soft clustering

• A variation of many clustering methods

• Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters– A sample might belong to cluster A with

probability 0.4 and to cluster B with probability 0.6

• More appropriate for NLP tasks

32

Flat Vs. Hierarchical

• Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other

• Hierarchical clustering produces a hierarchy of nodes– Leaves are the single objects of the clustered

set– Node represents the cluster that contains all

the nodes of its descendant

33





34


• Flat– Preferable is efficiency is a consideration or data sets

are very large– K-means is a very simple methods that should

probably be used fist on anew data set because its results are often sufficient

– K-means assumes a simple Euclidean representation save so cannot be used for many data set, for example nominal data like colors

– In such cases use EM (expectation minimization)

35


• Hierarchical– Preferable for detailed data analysis– Provide more information than flat clustering– Does not require us to pre-specify the number

of clusters– Less efficient: the most common hierarchical

clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of most flat clustering methods

36

Clustering issues

• Two main issues

• Similarity measure

• How to cluster data point together (o not)– Clustering algorithms– Merging criteria

37

Similarity

• Vector-space representation and similarity computation

• Select important distributional properties of a word

• Create a vector of length n for each word to be classified

• Viewing the n-dimensional vector as a point in an n-dimensional space, cluster points that are near one another

38

Similarity


39

Similarity


40

Similarity


41

Similarity


42

Similarity


43

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

ABCD

How to compute document similarity?

44

Pair-wise Document Similarity(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim

45

Pair-wise Document Similarity(cosine normalization)

normalized cosine

)()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD

46

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity

47

Similarity

• And many other similarity measures!

48

Flat Clustering: K-means

• K-means is the most important flat clustering algorithm.

• Objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ of the documents in a cluster ω:

49

K-Means Clustering

• Decide on a pair-wise similarity measure

1. Compute K centroids

2. Assign each document to nearest center, forming new clusters

3. Unless terminate condition, repeat 1-2

50

K-means algorithmA K-means example

for K = 2 in R2





51

K-means algorithm

• Convergence of the position of the two centroids





52

K-means

• Residual sum of squares or RSS: measure of how well the centroids represent the members of their clusters – RSS: squared distance of each vector from its

centroid summed over all vectors– RSS is the objective function in K-means and

our goal is to minimize it

53

Model-based clustering

• Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. (Flat)

• The model that we recover from the data then defines clusters and an assignment of documents to clusters.

• EM (expectation-maximization)

54

Hierarchical Clustering

• Agglomerative or bottom-up:– Initialization: Start with each sample in its own cluster– Merge the two closest clusters– Each iteration: Find two most similar clusters and merge

them– Termination: All the objects are in the same cluster

• Divisive or top-down:– Start with all elements in one cluster– Partition one of the current clusters in two– Repeat until all samples are in singleton clusters

55

Agglomerative Clustering

A B C D E F G HI

56


A B C D E F G HI

57


A B C D E F G HI

58

Merging nodes/Clustering function

• Each node is a combination of the documents combined below it

• We represent the merged nodes as a vector of term weights

• This vector is referred to as the cluster centroid

59

Clustering functionsaka Merging criteria

• Extending the distance measure from samples to sets of samples

Similarity of 2 most similar members

Similarity of 2 least similar members

Average similarity between members


60

Single-link merging criteria

Merge closest pair of clusters:Single-link: clusters are close if any of their points are

dist(A,B) = min dist(a,b) for aA, bB

each word type isa single-point cluster

merge

61

Fast, but tend to get long, stringy, meandering clusters ...

Bottom-Up Clustering – Single-Link

62

Bottom-Up Clustering – Complete-Link

Again, merge closest pair of clusters:Complete-link: dist(A,B) = max dist(a,b) for aA, bB

distancebetweenclusters

63

Bottom-Up Clustering – Complete-Link

distancebetweenclusters

Slow to find closest pair – need quadratically many distances

64

Choosing k

• How to select an appropriate level of granularity?

• Too small, and clusters provide insufficient generalization

• Too large, and they are inappropriately generalized

65

Choosing k

• In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form

• This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram)

• It would be nice if we could find an optimal k from the data

• We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters.

• And there are some theoretical measures

66

How to evaluate clusters?

• In practice, it’s hard to do– Different algorithms’ results look good and

bad in different ways– It’s difficult to distinguish their outcomes

• In theory, define an evaluation function– Typically choose something easy to measure

(e.g., the sum of the average distance in each class)

67

How to evaluate clusters?

• Perform task-based evaluation• Test the resulting clusters intuitively, i.e.,

inspect them and see if they make sense. Not advisable.

• Have an expert generate clusters manually, and test the automatically generated ones against them.

• Test the clusters against a predefined classification if there is one


68

Resources

• FCLUSTER - A tool for fuzzy cluster analysis

• LNKnet Pattern Classification Software • Principal Direction Divisive Partitioning • k-means clustering • Text Clustering

– http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf (Chapter 16 and 17)

http://fuzzy.cs.uni-magdeburg.de/fcluster/

http://fuzzy.cs.uni-magdeburg.de/fcluster/

http://www.ll.mit.edu/IST/lnknet/index.html

http://www.ll.mit.edu/IST/lnknet/index.html

http://www-users.cs.umn.edu/~boley/Distribution/PDDP.html

http://www.cs.cmu.edu/~dellaert/software/




1 i256 applied natural language processing fall 2009 lecture 15 (text) clustering barbara rosario

Documents

automated slide

word clustering

clustering of nouns

nlp course slide

miscelleneous clustering

text collections slide

visualization use clustering

soft clustering flat