data mining summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/nguyentrithanh... · 2016. 8....

Data Mining Summer school

CLUSTER ANALYSIS Asso. Prof. NGUYEN Tri Thanh

University of Engineering and Technology

August 11, 2016 1

August 11, 2016 Data Mining: Concepts and Techniques 2

Outline

1.  Introduction 2.  Clustering applications

3.  A Categorization of Major Clustering Methods

4.  Data representation

5.  Flat clustering

6.  Hierarchical clustering

7.  Cluster evaluation

8.  Summary

9.  Discussion


Introduction

n  Cluster analysis (clustering) n  Given a set of objects, group similar data according to

the characteristics into clusters n  Cluster: A collection of data objects

n  similar (or related) to one another within the same group n  dissimilar (or unrelated) to the objects in other groups

n  How to identify the similarity? n  How to identify the number of clusters?


Introduction

n  Clustering is embedded in human naturally n  Group animals, plants n  Group students n  Group customers n  Facebook groups n  Interest groups n  …


Clustering vs Classification

n  Clustering n  No predefined classes (unknown number of clusters) n  Unlabeled data objects n  Unsupervised learning

n  Classification n  Predefined classes n  Labeled data objects n  Predict/identify the class of an unlabeled object n  Supervised learning


Clustering applications

n  Image Processing and Pattern Recognition n  Spatial Data Analysis

n  Create thematic maps in GIS by clustering feature spaces

n  Detect spatial clusters or for other spatial mining tasks n  Economic Science (especially market research) n  WWW

n  Document classification n  Cluster Weblog data to discover groups of similar

access patterns


Clustering Applications

n  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing

programs

n  Land use: Identification of areas of similar land use in an earth

observation database

n  Insurance: Identifying groups of motor insurance policy holders with a

high average claim cost

n  City-planning: Identifying groups of houses according to their house

type, value, and geographical location

n  Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults


Categorization of Clustering Approaches

n  Partitioning approach: n  Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors n  Typical methods: k-means, k-medoids, CLARANS

n  Hierarchical approach: n  Create a hierarchical decomposition of the set of data (or objects)

using some criterion n  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

n  Density-based approach: n  Based on connectivity and density functions n  Typical methods: DBSACN, OPTICS, DenClue

n  Grid-based approach: n  based on a multiple-level granularity structure n  Typical methods: STING, WaveCluster, CLIQUE


Major Clustering Approaches (II)

n  Model-based: n  A model is hypothesized for each of the clusters and tries to find the

best fit of that model to each other n  Typical methods: EM, SOM, COBWEB

n  Frequent pattern-based: n  Based on the analysis of frequent patterns n  Typical methods: p-Cluster

n  User-guided or constraint-based: n  Clustering by considering user-specified or application-specific

constraints n  Typical methods: COD (obstacles), constrained clustering

n  Link-based clustering: n  Objects are often linked together in various ways n  Massive links can be used to cluster objects: SimRank, LinkClus

Flat clustering


Data object representation

n  A set/vector of features/attributes n  Person: name, age, sex, job, … n  Text: set of distinct words

n  Similarity n  The distance: the smaller the more similar n  The similarity: the bigger the more similar

August 11, 2016 11

∑∑∑

==

==•

=n

i in

i i

n

i ii

dd

ddddddddsim

1221

21

1 21

21

2121 ),(

21 2121 )(),( ∑ =

−=n

i ii dddddis

August 11, 2016 12

K-means Clustering

n  Number of clusters (k) is known in advance

n  Clusters are represented by the centroid of the documents that belong to that cluster

n  Cluster membership is determined by the most similar cluster centroid

∑∈

=Sdd

Sc 1

same. remain the centroidscluster or the iterations econsecutiv in twocluster each toassigned are documents same thei.e. converges, process theuntil 2 Step toGo 4.

members.cluster computednewly theusing centroidcluster therecomputecluster each For 3.

cluster. ingcorrespond theodocument tt assign tha and centroidsimilar most thefinddocument each for i.e.

centroids,cluster theosimility t their toaccording clusters todocumentsAssign 2.random.at doneusually is This centroids.cluster as used be to from documents Select 1. Sk

August 11, 2016 13

The K-Means Clustering Method

n  Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign

August 11, 2016 14

K-means Clustering Discussion

n  In step 2 documents are moved between clusters in order to maximize the intra-cluster similarity

n  The clustering maximizes the criterion function (a measure for evaluating clustering quality)

n  In distance-based k-means clustering the criterion function is the sum of squared errors (based on Euclidean distance and means)

n  For k-means clustering of documents a function based on centroids and similarity is used

∑ ∑= ∈

=k

i Ddji

ij

dcsimJ1

),(

August 11, 2016 15

K-means Clustering Discussion (cont’d)

n  Clustering that maximizes this function is called minimum variance clustering

n  K-means algorithm produces minimum variance clustering, but does not guarantee that it always finds the global maximum of the criterion function

n  After each iteration the value of J increases, but it may converge to a local maximum

n  The result greatly depends on the initial choice of cluster centroids

Sample data


August 11, 2016 17

K-means Clustering Example (result) Clustering of CCSU Departments data with 6 TFIDF attributes (k = 2)

Hierarchical clustering


August 11, 2016 19

Hierarchical Partitioning

n  Produces a nested sequence of partitions of the set of data points n  Can be displayed as a tree (called dendrogram) with a single cluster

including all points at the root and singleton clusters (individual points) at the leaves

n  Example of hierarchical partitioning of set of numbers {1, 2, 4, 5, 8, 10}

The similarity measure used in this example is computed as (10-d)/10 where d is the distance between data points or cluster centers

August 11, 2016 20

Approaches to Hierarchical Partitioning

n  Agglomerative n  Starts with the data points and at each step merges the

two closest (most similar) points (or clusters at later steps) until a single cluster remains

n  Divisible

n  Starts with the original set of points and at each step splits a cluster until only individual points remain

n  To implement this approach we need to decide which cluster to split and how to perform the split

August 11, 2016 21

Approaches to Hierarchical Partitioning (cont’d)

n  The agglomerative approach is more popular as it needs only the definition of a distance or similarity function on clusters/points

August 11, 2016 22

Approaches to Hierarchical Partitioning (cont’d)

n  For data points in the Euclidean space the Euclidean distance is the best choice

n  For documents represented as TF-IDF vectors the preferred measure is the cosine similarity defined as follows

∑∑∑

==

==•

=n

i in

i i

n

i ii

dd

ddddddddsim

1221

21

1 21

21

2121 ),(

August 11, 2016 23

Agglomerative Hierarchical Clustering

n  There are several versions of this approach depending on how similarity on clusters sim(S1,S2) is defined (S1,S2 are sets of documents) n  Similarity between cluster centroids, i.e. sim(S1,S2)=sim(c1,c2),

where the centroid c of cluster S is

n  Maximum similarity between documents from each cluster (nearest neighbor clustering)

∑∈

=Sdd

Sc 1

),(max),( 21,212211

ddsimSSsimSdSd ∈∈

=

August 11, 2016 24

Agglomerative Hierarchical Clustering (cont’d)

n  Minimum similarity between documents from each cluster (farthest neighbor clustering)

n  Average similarity between documents from each cluster

),(min),( 21,212211

ddsimSSsimSdSd ∈∈

=

∑∈∈

=2211 ,

2121

21 ),(1),(SdSd

ddsimSS

SSsim

August 11, 2016 25

Agglomerative Clustering Algorithm

n  S is the initial set of documents and G is the clustering tree n  k and q are control parameters that stop merging

n  when a desired number of clusters (k) is reached

n  or when the similarity between the clusters to be merged becomes less than a specified threshold (q)

August 11, 2016 26

Agglomerative Clustering Algorithm (cont’d)

For n documents both time and space complexity of the algorithm are O (n2)

2 toGo 7.

hierarchy) thecluster to new theadd and , and (merge},{.6

from and Remove 5.) than less

is clustersclosest theof similarity theif(stopexitthen If4.clusters)closest

twothefind(),(maxarg),(thatsuch,,Find.3)reachedisclustersofnumberdesiredtheif(stopexitthenIf.2

) fromdocument a containing oneeach clusters,singleton withinitialize(}|}{{.1

),(

jiji

ji

ji

jijiji

SSSSGGGSS

qq),Ssim(S

SSsimjiGSSkG

SGSddG

∪=

<

=∈

≤

∈←

August 11, 2016 27

Agglomerative Clustering Example 1

August 11, 2016 28

Agglomerative Clustering Example 1

DIANA clustering algorithm


document single a have clusters all when Stop 5.

1 goto,1

diameter biggest thehaving cluster Select the.4

Update.0 no is thereuntilRepeat .3}{}{,0 if ,max having Find 2.2.

)||()||( calculate, allFor .12.}{

others thefromdistinct most theis that ddocument a Find.1);documents

all ofcluster singleton withinitialize(},...,,{.0

1 1

2

21

)m(m-

)d(dg

G lddSlld

ddavgddavglSddS

GgGdddG

m

i

m

jji

h

hhhh

Sdi

Sdiii

n

ii

∑∑

∑∑

= =

∉∉

−

>

∪←>

−−−=∉

←

←

←

Clustering evaluation


August 11, 2016 31

Web Content Mining

n  Evaluating Clustering n  Similarity-Based Criterion Functions

n  Probabilistic Criterion Functions

n  MDL-Based Model and Feature Evaluation

n  Classes to Clusters Evaluation

n  Precision, Recall and F-measure

n  Entropy

August 11, 2016 32

Similarity-Based Criterion Functions (distance)

n  Basic idea: the cluster center ci (centroid or mean in case of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di

n  Alternative formulation based on pairwise distance between cluster members

2

1∑∑= ∈

−=k

i Dxie

i

cxJ ∑∈

=iDxi

i xD

c 1

2

1 ,

121∑ ∑= ∈

−=k

i Dxxlj

ie

ilj

xxD

J

August 11, 2016 33

Similarity-Based Criterion Functions (cosine similarity)

n  For document clustering the (centroid) cosine similarity is used

n  Equivalent form based on pairwise similarity

n  Another formulation based on intracluster similarity (used to controls merging of clusters in hierarchical agglomerative clustering)

∑ ∑= ∈

=k

i Ddjis

ij

dcsimJ1

),(ji

jiji dc

dcdcsim

•=),(

∑ ∑= ∈

=k

i Dxxkj

iS

ikj

ddsimD

J1 ,

),(121

∑∈

=ij Ddj

ii dD

c 1

∑∑ ∑== ∈

==k

iii

k

i Dxxlj

iS DsimDddsim

DJ

ilj 11 ,)(

21),(1

21

Avegrage pair-wise similarity

August 11, 2016 34

Classes to Clusters Evaluation

n  Assume that the classification of the documents in a sample is known, i.e. each document has a class label

n  Cluster the sample without using the class labels

n  Assign to each cluster the class label of the majority of documents in it

n  Compute the error as the proportion of documents with different class and cluster label

n  Or compute the accuracy as the proportion of documents with the same class and cluster label

August 11, 2016 35

Classes to Clusters Evaluation (Example)

August 11, 2016 36

Confusion matrix (contingency table)

TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)

FNTNFPTPFNFPError

+++

+= FNTNFPTP

TNTPAccuracy+++

+=

August 11, 2016 37

Precision and Recall

FNTPTP

FPTPTP

+=

+=

Recall

Precision

August 11, 2016 38

F-Measure

∑ ∑

∑

∑

∑∑

= =

=

==

==

=

=

=

+=

==

m

i

k

j ij

k

j iji

m

ikj

i

k

j ij

ijm

i ij

ij

nn

nn

jiFnnF

jiRjiPjiRjiPjiF

n

njiR

n

njiP

1 1

1

1,...,1

11

),(

clustering whole theEvaluating),(),(),(),(2),(

),(),(

recall andprecision Combining

max

Generalized confusion matrix form classes and k clusters

Total number of documents

August 11, 2016 39

F-Measure (Example)

August 11, 2016 40

Entropy

n  Consider the class label as a random event and evaluate its probability distribution in each cluster

n  The probability of class i in cluster j is estimated by the proportion of occurrences of class label i in cluster j

n  The entropy is as a measure of “impurity” and accounts for the average information in an arbitrary message about the class label

∑=

= m

iij

ijij

n

np

1

∑=

−=m

iijijj ppH

1

log

August 11, 2016 41

Entropy (cont’d)

n  To evaluate the whole clustering we sum up the entropies of individual clusters weighted with the proportion of documents in each

∑=

=k

ij

j Hnn

H1

August 11, 2016 42

Entropy (Examples)

n  A “pure” cluster where all documents have a single class label has entropy of 0

n  The highest entropy is achieved when all class labels have the same probability

n  For example, for a two class problem the 50-50 situation has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1

August 11, 2016 43

Entropy (Examples) (cont’d)

n  Compare the entropies of the previously discussed clusterings for attributes offers and students

869204.0

54log

54

51log

51

205

155log

155

1510log

1510

2015)(

878176.096log

96

93log

93

209

113log

113

118log

118

2011)(

=⎟⎠

⎞⎜⎝

⎛ −−+⎟⎠

⎞⎜⎝

⎛ −−=

=⎟⎠

⎞⎜⎝

⎛ −−+⎟⎠

⎞⎜⎝

⎛ −−=

studentsH

offersH


Summary

n  Cluster analysis groups objects based on their similarity and has wide applications

n  Measure of similarity can be computed for various types of data

n  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

n  Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

n  There are still lots of research issues on cluster analysis

Reference

n  J. Han and M. Kamber, Data Mining-Concepts and Techniques, Morgan Kaufmann, 2006

n  Z. Markov and D. T. Larose, Data mining the web, uncovering patterns in Web content, structure and usage, John Wiley & Sons, 2007

n  Nguyễn Hà Nam, Nguyễn Trí Thành, Hà Quang Thụy, Giáo trình khai phá dữ liệu, NXB ĐHQGHN, 2014


Discussion


data mining summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/nguyentrithanh... · 2016. 8....

Documents