data mining summer schoolfit.uet.vnu.edu.vn/dmss2016/files/lecture/nguyentrithanh... · 2016. 8....
TRANSCRIPT
Data Mining Summer school
CLUSTER ANALYSIS Asso. Prof. NGUYEN Tri Thanh
University of Engineering and Technology
August 11, 2016 1
August 11, 2016 Data Mining: Concepts and Techniques 2
Outline
1. Introduction 2. Clustering applications
3. A Categorization of Major Clustering Methods
4. Data representation
5. Flat clustering
6. Hierarchical clustering
7. Cluster evaluation
8. Summary
9. Discussion
August 11, 2016 Data Mining: Concepts and Techniques 3
Introduction
n Cluster analysis (clustering) n Given a set of objects, group similar data according to
the characteristics into clusters n Cluster: A collection of data objects
n similar (or related) to one another within the same group n dissimilar (or unrelated) to the objects in other groups
n How to identify the similarity? n How to identify the number of clusters?
August 11, 2016 Data Mining: Concepts and Techniques 4
Introduction
n Clustering is embedded in human naturally n Group animals, plants n Group students n Group customers n Facebook groups n Interest groups n …
August 11, 2016 Data Mining: Concepts and Techniques 5
Clustering vs Classification
n Clustering n No predefined classes (unknown number of clusters) n Unlabeled data objects n Unsupervised learning
n Classification n Predefined classes n Labeled data objects n Predict/identify the class of an unlabeled object n Supervised learning
August 11, 2016 Data Mining: Concepts and Techniques 6
Clustering applications
n Image Processing and Pattern Recognition n Spatial Data Analysis
n Create thematic maps in GIS by clustering feature spaces
n Detect spatial clusters or for other spatial mining tasks n Economic Science (especially market research) n WWW
n Document classification n Cluster Weblog data to discover groups of similar
access patterns
August 11, 2016 Data Mining: Concepts and Techniques 7
Clustering Applications
n Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing
programs
n Land use: Identification of areas of similar land use in an earth
observation database
n Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
n City-planning: Identifying groups of houses according to their house
type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
August 11, 2016 Data Mining: Concepts and Techniques 8
Categorization of Clustering Approaches
n Partitioning approach: n Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors n Typical methods: k-means, k-medoids, CLARANS
n Hierarchical approach: n Create a hierarchical decomposition of the set of data (or objects)
using some criterion n Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
n Density-based approach: n Based on connectivity and density functions n Typical methods: DBSACN, OPTICS, DenClue
n Grid-based approach: n based on a multiple-level granularity structure n Typical methods: STING, WaveCluster, CLIQUE
August 11, 2016 Data Mining: Concepts and Techniques 9
Major Clustering Approaches (II)
n Model-based: n A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other n Typical methods: EM, SOM, COBWEB
n Frequent pattern-based: n Based on the analysis of frequent patterns n Typical methods: p-Cluster
n User-guided or constraint-based: n Clustering by considering user-specified or application-specific
constraints n Typical methods: COD (obstacles), constrained clustering
n Link-based clustering: n Objects are often linked together in various ways n Massive links can be used to cluster objects: SimRank, LinkClus
Flat clustering
August 11, 2016 Data Mining: Concepts and Techniques 10
Data object representation
n A set/vector of features/attributes n Person: name, age, sex, job, … n Text: set of distinct words
n Similarity n The distance: the smaller the more similar n The similarity: the bigger the more similar
August 11, 2016 11
∑∑∑
==
==•
=n
i in
i i
n
i ii
dd
ddddddddsim
1221
21
1 21
21
2121 ),(
21 2121 )(),( ∑ =
−=n
i ii dddddis
August 11, 2016 12
K-means Clustering
n Number of clusters (k) is known in advance
n Clusters are represented by the centroid of the documents that belong to that cluster
n Cluster membership is determined by the most similar cluster centroid
∑∈
=Sdd
Sc 1
same. remain the centroidscluster or the iterations econsecutiv in twocluster each toassigned are documents same thei.e. converges, process theuntil 2 Step toGo 4.
members.cluster computednewly theusing centroidcluster therecomputecluster each For 3.
cluster. ingcorrespond theodocument tt assign tha and centroidsimilar most thefinddocument each for i.e.
centroids,cluster theosimility t their toaccording clusters todocumentsAssign 2.random.at doneusually is This centroids.cluster as used be to from documents Select 1. Sk
August 11, 2016 13
The K-Means Clustering Method
n Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassign reassign
August 11, 2016 14
K-means Clustering Discussion
n In step 2 documents are moved between clusters in order to maximize the intra-cluster similarity
n The clustering maximizes the criterion function (a measure for evaluating clustering quality)
n In distance-based k-means clustering the criterion function is the sum of squared errors (based on Euclidean distance and means)
n For k-means clustering of documents a function based on centroids and similarity is used
∑ ∑= ∈
=k
i Ddji
ij
dcsimJ1
),(
August 11, 2016 15
K-means Clustering Discussion (cont’d)
n Clustering that maximizes this function is called minimum variance clustering
n K-means algorithm produces minimum variance clustering, but does not guarantee that it always finds the global maximum of the criterion function
n After each iteration the value of J increases, but it may converge to a local maximum
n The result greatly depends on the initial choice of cluster centroids
Sample data
August 11, 2016 Data Mining: Concepts and Techniques 16
August 11, 2016 17
K-means Clustering Example (result) Clustering of CCSU Departments data with 6 TFIDF attributes (k = 2)
Hierarchical clustering
August 11, 2016 Data Mining: Concepts and Techniques 18
August 11, 2016 19
Hierarchical Partitioning
n Produces a nested sequence of partitions of the set of data points n Can be displayed as a tree (called dendrogram) with a single cluster
including all points at the root and singleton clusters (individual points) at the leaves
n Example of hierarchical partitioning of set of numbers {1, 2, 4, 5, 8, 10}
The similarity measure used in this example is computed as (10-d)/10 where d is the distance between data points or cluster centers
August 11, 2016 20
Approaches to Hierarchical Partitioning
n Agglomerative n Starts with the data points and at each step merges the
two closest (most similar) points (or clusters at later steps) until a single cluster remains
n Divisible
n Starts with the original set of points and at each step splits a cluster until only individual points remain
n To implement this approach we need to decide which cluster to split and how to perform the split
August 11, 2016 21
Approaches to Hierarchical Partitioning (cont’d)
n The agglomerative approach is more popular as it needs only the definition of a distance or similarity function on clusters/points
August 11, 2016 22
Approaches to Hierarchical Partitioning (cont’d)
n For data points in the Euclidean space the Euclidean distance is the best choice
n For documents represented as TF-IDF vectors the preferred measure is the cosine similarity defined as follows
∑∑∑
==
==•
=n
i in
i i
n
i ii
dd
ddddddddsim
1221
21
1 21
21
2121 ),(
August 11, 2016 23
Agglomerative Hierarchical Clustering
n There are several versions of this approach depending on how similarity on clusters sim(S1,S2) is defined (S1,S2 are sets of documents) n Similarity between cluster centroids, i.e. sim(S1,S2)=sim(c1,c2),
where the centroid c of cluster S is
n Maximum similarity between documents from each cluster (nearest neighbor clustering)
∑∈
=Sdd
Sc 1
),(max),( 21,212211
ddsimSSsimSdSd ∈∈
=
August 11, 2016 24
Agglomerative Hierarchical Clustering (cont’d)
n Minimum similarity between documents from each cluster (farthest neighbor clustering)
n Average similarity between documents from each cluster
),(min),( 21,212211
ddsimSSsimSdSd ∈∈
=
∑∈∈
=2211 ,
2121
21 ),(1),(SdSd
ddsimSS
SSsim
August 11, 2016 25
Agglomerative Clustering Algorithm
n S is the initial set of documents and G is the clustering tree n k and q are control parameters that stop merging
n when a desired number of clusters (k) is reached
n or when the similarity between the clusters to be merged becomes less than a specified threshold (q)
August 11, 2016 26
Agglomerative Clustering Algorithm (cont’d)
For n documents both time and space complexity of the algorithm are O (n2)
2 toGo 7.
hierarchy) thecluster to new theadd and , and (merge},{.6
from and Remove 5.) than less
is clustersclosest theof similarity theif(stopexitthen If4.clusters)closest
twothefind(),(maxarg),(thatsuch,,Find.3)reachedisclustersofnumberdesiredtheif(stopexitthenIf.2
) fromdocument a containing oneeach clusters,singleton withinitialize(}|}{{.1
),(
jiji
ji
ji
jijiji
SSSSGGGSS
qq),Ssim(S
SSsimjiGSSkG
SGSddG
∪=
<
=∈
≤
∈←
August 11, 2016 27
Agglomerative Clustering Example 1
August 11, 2016 28
Agglomerative Clustering Example 1
DIANA clustering algorithm
August 11, 2016 Data Mining: Concepts and Techniques 29
document single a have clusters all when Stop 5.
1 goto,1
diameter biggest thehaving cluster Select the.4
Update.0 no is thereuntilRepeat .3}{}{,0 if ,max having Find 2.2.
)||()||( calculate, allFor .12.}{
others thefromdistinct most theis that ddocument a Find.1);documents
all ofcluster singleton withinitialize(},...,,{.0
1 1
2
21
)m(m-
)d(dg
G lddSlld
ddavgddavglSddS
GgGdddG
m
i
m
jji
h
hhhh
Sdi
Sdiii
n
ii
∑∑
∑∑
= =
∉∉
−
>
∪←>
−−−=∉
←
←
←
Clustering evaluation
August 11, 2016 Data Mining: Concepts and Techniques 30
August 11, 2016 31
Web Content Mining
n Evaluating Clustering n Similarity-Based Criterion Functions
n Probabilistic Criterion Functions
n MDL-Based Model and Feature Evaluation
n Classes to Clusters Evaluation
n Precision, Recall and F-measure
n Entropy
August 11, 2016 32
Similarity-Based Criterion Functions (distance)
n Basic idea: the cluster center ci (centroid or mean in case of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di
n Alternative formulation based on pairwise distance between cluster members
2
1∑∑= ∈
−=k
i Dxie
i
cxJ ∑∈
=iDxi
i xD
c 1
2
1 ,
121∑ ∑= ∈
−=k
i Dxxlj
ie
ilj
xxD
J
August 11, 2016 33
Similarity-Based Criterion Functions (cosine similarity)
n For document clustering the (centroid) cosine similarity is used
n Equivalent form based on pairwise similarity
n Another formulation based on intracluster similarity (used to controls merging of clusters in hierarchical agglomerative clustering)
∑ ∑= ∈
=k
i Ddjis
ij
dcsimJ1
),(ji
jiji dc
dcdcsim
•=),(
∑ ∑= ∈
=k
i Dxxkj
iS
ikj
ddsimD
J1 ,
),(121
∑∈
=ij Ddj
ii dD
c 1
∑∑ ∑== ∈
==k
iii
k
i Dxxlj
iS DsimDddsim
DJ
ilj 11 ,)(
21),(1
21
Avegrage pair-wise similarity
August 11, 2016 34
Classes to Clusters Evaluation
n Assume that the classification of the documents in a sample is known, i.e. each document has a class label
n Cluster the sample without using the class labels
n Assign to each cluster the class label of the majority of documents in it
n Compute the error as the proportion of documents with different class and cluster label
n Or compute the accuracy as the proportion of documents with the same class and cluster label
August 11, 2016 35
Classes to Clusters Evaluation (Example)
August 11, 2016 36
Confusion matrix (contingency table)
TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)
FNTNFPTPFNFPError
+++
+= FNTNFPTP
TNTPAccuracy+++
+=
August 11, 2016 37
Precision and Recall
FNTPTP
FPTPTP
+=
+=
Recall
Precision
August 11, 2016 38
F-Measure
∑ ∑
∑
∑
∑∑
= =
=
==
==
=
=
=
+=
==
m
i
k
j ij
k
j iji
m
ikj
i
k
j ij
ijm
i ij
ij
nn
nn
jiFnnF
jiRjiPjiRjiPjiF
n
njiR
n
njiP
1 1
1
1,...,1
11
),(
clustering whole theEvaluating),(),(),(),(2),(
),(),(
recall andprecision Combining
max
Generalized confusion matrix form classes and k clusters
Total number of documents
August 11, 2016 39
F-Measure (Example)
August 11, 2016 40
Entropy
n Consider the class label as a random event and evaluate its probability distribution in each cluster
n The probability of class i in cluster j is estimated by the proportion of occurrences of class label i in cluster j
n The entropy is as a measure of “impurity” and accounts for the average information in an arbitrary message about the class label
∑=
= m
iij
ijij
n
np
1
∑=
−=m
iijijj ppH
1
log
August 11, 2016 41
Entropy (cont’d)
n To evaluate the whole clustering we sum up the entropies of individual clusters weighted with the proportion of documents in each
∑=
=k
ij
j Hnn
H1
August 11, 2016 42
Entropy (Examples)
n A “pure” cluster where all documents have a single class label has entropy of 0
n The highest entropy is achieved when all class labels have the same probability
n For example, for a two class problem the 50-50 situation has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1
August 11, 2016 43
Entropy (Examples) (cont’d)
n Compare the entropies of the previously discussed clusterings for attributes offers and students
869204.0
54log
54
51log
51
205
155log
155
1510log
1510
2015)(
878176.096log
96
93log
93
209
113log
113
118log
118
2011)(
=⎟⎠
⎞⎜⎝
⎛ −−+⎟⎠
⎞⎜⎝
⎛ −−=
=⎟⎠
⎞⎜⎝
⎛ −−+⎟⎠
⎞⎜⎝
⎛ −−=
studentsH
offersH
August 11, 2016 Data Mining: Concepts and Techniques 44
Summary
n Cluster analysis groups objects based on their similarity and has wide applications
n Measure of similarity can be computed for various types of data
n Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
n Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches
n There are still lots of research issues on cluster analysis
Reference
n J. Han and M. Kamber, Data Mining-Concepts and Techniques, Morgan Kaufmann, 2006
n Z. Markov and D. T. Larose, Data mining the web, uncovering patterns in Web content, structure and usage, John Wiley & Sons, 2007
n Nguyễn Hà Nam, Nguyễn Trí Thành, Hà Quang Thụy, Giáo trình khai phá dữ liệu, NXB ĐHQGHN, 2014
August 11, 2016 Data Mining: Concepts and Techniques 45
Discussion
August 11, 2016 Data Mining: Concepts and Techniques 46