data mining taylor statistics 202: data...
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningHierarchical clustering
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
December 2, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Description
Produces a set of nested clusters organized as ahierarchical tree.
Can be visualized as a dendrogram: A tree like diagramthat records the sequences of merges or splits.
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram – A tree like diagram that records the sequences of
merges or splits
A clustering and its dendrogram.
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Strengths
Do not have to assume any particular number of clusters.Each horizontal cut of the tree yields a clustering.
The tree may correspond to a meaningful taxonomy: (e.g.,animal kingdom, phylogeny reconstruction, ...)
Need only a similarity or distance matrix forimplementation.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Agglomerative
Start with the points as individual clusters.
At each step, merge the closest pair of clusters until onlyone cluster (or some fixed number k clusters) remain.
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Divisive
Start with one, all-inclusive cluster.
At each step, split a cluster until each cluster contains apoint (or there are k clusters).
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Agglomerative Clustering Algorithm
1 Compute the proximity matrix.
2 Let each data point be a cluster.3 While there is more than one cluster:
1 Merge the two closest clusters.2 Update the proximity matrix.
The major difference is the computation of proximity of twoclusters.
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Starting point for agglomerative clustering.
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Intermediate Situation
After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate point, with 5 clusters.
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
We will merge C2 and C5.
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
After Merging
The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How do we update proximity matrix?
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Proximity Matrix
We need a notion of similarity between clusters.
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Single linkage uses the minimum distance.
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Complete linkage uses the maximum distance.
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Group average linkage uses the average distance betweengroups.
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
× ×
Centroid uses the distance between the centroids of the clusters(presumes one can compute centroids...)
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar (closest) points in the different clusters – Determined by one pair of points, i.e., by one link in
the proximity graph.
1 2 3 4 5
Proximity matrix and dendrogram of single linkage.
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Distance matrix for nested clusterings
111 0.24 0.22 0.37 0.34 0.23222 0.15 0.20 0.14 0.25
333 0.15 0.28 0.11444 0.29 0.22
555 0.39666
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
Nested cluster representation and dendrogram of single linkage.
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Single linkage
Can handle irregularly shaped regions fairly naturally.
Sensitive to noise and outliers in the form of “chaining”.
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (single linkage)
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (single linkage)
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Cluster Similarity: MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters – Determined by all pairs of points in the two clusters
1 2 3 4 5
Proximity matrix and dendrogram of complete linkage.
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2 5
3
4
Nested cluster and dendrogram of complete linkage.
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Complete linkage
Less sensitive to noise and outliers than single linkage.
Regions are generally compact, but may violate“closeness”. That is, points may much closer to somepoints in neighbouring cluster than its own cluster.
This manifests itself as breaking large clusters.
Clusters are biased to be globular.
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
5
3
4
Nested cluster and dendrogram of group average linkage.
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Average linkage
Given two elements of the partition Cr ,Cs , we mightconsider
dGA(Cr ,Cs) =1
|Cr ||Cs |∑
xxx∈Cr ,yyy∈Cs
d(xxx ,yyy)
A compromise between single and complete linkage.
Shares globular clusters of complete, less sensitive thansingle.
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (average linkage)
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (average linkage)
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Ward’s linkage
Similarity of two clusters is based on the increase insquared error when two clusters are merged.
Similar to average if dissimilarity between points isdistance squared. Hence, it shares many properties ofaverage linkage.
A hierarchical analogue of K -means.
Sometimes used to initialize K -means.
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (Ward’s linkage)
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (Ward’s linkage)
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (complete linkage)
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (single linkage)
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (average linkage)
38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (Ward’s linkage)
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Computational issues
O(n2) space since it uses the proximity matrix.
O(n3) time in many cases as there are N steps, and ateach step a matrix of size N2 must be updated and/orsearched.
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Statistical issues
Once a decision is made to combine two clusters, it cannotbe undone.
No objective function is directly minimized.
Different schemes have problems with one or more of thefollowing:
Sensitivity to noise and outliers.Difficulty handling different sized clusters and convexshapes.Breaking large cluster.
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
42 / 1