unsupervised learning: clustering - cs.helsinki.fi · k-means: pseudocode i input: a set of n...

37
Unsupervised learning: Clustering 248 ,

Upload: phungcong

Post on 25-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Unsupervised learning: Clustering

248 ,

Page 2: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Partitional clustering: basic idea

I Each data vector xi is assigned to one of K clusters

I Typically K and a proximity measure is selected by the user,while the chosen algorithm then learns the actual partitions

I In the example below, K = 3 and the partitions are shownusing color (red, green, blue)

X X

249 ,

Page 3: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Hierarchical clustering: basic idea

X

⇒x1

x14

x25x6

x19

x1

x19

x25

x6

x14

I In this approach, data vectors are arranged in a tree, wherenearby (‘similar’) vectors xi and xj are placed close to eachother in the tree

I Any horizontal cut corresponds to a partitional clustering

I In the example above, the 3 colors have been added manuallyfor emphasis (they are not produced by the algorithm)

250 ,

Page 4: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Motivation for clustering

Understanding the data:

I Information retrieval:

organizing a set of documents for easy browsing (for example ahierarchical structure to the documents), as we saw in the carrot2application:

251 ,

Page 5: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Biology:

creating a taxonomy of species, finding groups of genes with similarfunction, etc

252 ,

Page 6: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Medicine:

understanding the relations among diseases or psychologicalconditions, to aid in discovering the most useful treatments

522 14. Unsupervised Learning

4060

80100

CNS

CNS

CNS

RENA

L

BREA

ST

CNSCN

S

BREA

ST

NSCL

C

NSCL

C

RENA

LRE

NAL

RENA

LRENA

LRE

NAL

RENA

L

RENA

L

BREA

STNS

CLC

RENA

L

UNKN

OWN

OVA

RIAN

MELAN

OMA

PROST

ATEOVA

RIAN

OVA

RIAN

OVA

RIAN

OVA

RIAN

OVA

RIAN

PROST

ATE

NSCL

CNS

CLC

NSCL

C

LEUK

EMIA

K562B-repro

K562A-repro

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

COLO

NCO

LON

COLO

NCO

LON

COLO

N

COLO

NCO

LON

MCF

7A-repro

BREA

STMCF

7D-repro

BREA

ST

NSCL

C

NSCL

CNS

CLC

MELAN

OMA

BREA

STBR

EAST

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering withaverage linkage to the human tumor microarray data.

chical structure produced by the algorithm. Hierarchical methods imposehierarchical structure whether or not such structure actually exists in thedata.

The extent to which the hierarchical structure produced by a dendro-gram actually represents the data itself can be judged by the copheneticcorrelation coe!cient. This is the correlation between the N(N !1)/2 pair-wise observation dissimilarities dii! input to the algorithm and their corre-sponding cophenetic dissimilarities Cii! derived from the dendrogram. Thecophenetic dissimilarity Cii! between two observations (i, i!) is the inter-group dissimilarity at which observations i and i! are first joined togetherin the same cluster.

The cophenetic dissimilarity is a very restrictive dissimilarity measure.First, the Cii! over the observations must contain many ties, since only N!1of the total N(N ! 1)/2 values can be distinct. Also these dissimilaritiesobey the ultrametric inequality

Cii! " max{Cik, Ci!k} (14.40)

253 ,

Page 7: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Business:

grouping customers by their preferences or shopping behavior, forinstance for targeted advertisement campaigns

For example:

I Customers who follow advertisements carefully, and when inthe shop buy only what is on sale

I Customers who do not seem to react to advertisements at all

I Customers who are attracted by advertisements, also buy otherthings in the store while there...

To whom should you send advertisements?

254 ,

Page 8: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Other motivations: simplifying the data for furtherprocessing/transmission

I Summarization:

reduce the effective amount of data by considering only theprototypes rather than the original data vectors

I ‘Lossy’ compression:

saving disk space by only storing a prototype vector which is‘close enough’

255 ,

Page 9: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

What is a cluster?

I Clusters are called well separated if every point is closer (moresimilar) to all other points in its cluster than to any point insome other cluster.

I Commonly, clusters are represented by ‘cluster prototypes’ or‘centers’. In this case it makes sense to require that each pointis closer to its cluster prototype than to any other prototype.496 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Well-separated clusters. Eachpoint is closer to all of the points in itscluster than to any point in anothercluster.

(b) Center-based clusters. Eachpoint is closer to the center of itscluster than to the center of anyother cluster.

(c) Contiguity-based clusters. Eachpoint is closer to at least one pointin its cluster than to any point inanother cluster.

(d) Density-based clusters. Clus-ters are regions of high density sep-arated by regions of low density.

(e) Conceptual clusters. Points in a cluster share some generalproperty that derives from the entire set of points. (Points in theintersection of the circles belong to both.)

Figure 8.2. Different types of clusters as illustrated by sets of two-dimensional points.

8.2 K-means

Prototype-based clustering techniques create a one-level partitioning of thedata objects. There are a number of such techniques, but two of the mostprominent are K-means and K-medoid. K-means defines a prototype in termsof a centroid, which is usually the mean of a group of points, and is typically

256 ,

Page 10: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I We can also define clusters based on contiguity, requiring onlythat there are no ‘gaps’ in the clusters.

I Alternatively, we can use the density of various regions of thespace to define clusters, as in (d) below.

496 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Well-separated clusters. Eachpoint is closer to all of the points in itscluster than to any point in anothercluster.

(b) Center-based clusters. Eachpoint is closer to the center of itscluster than to the center of anyother cluster.

(c) Contiguity-based clusters. Eachpoint is closer to at least one pointin its cluster than to any point inanother cluster.

(d) Density-based clusters. Clus-ters are regions of high density sep-arated by regions of low density.

(e) Conceptual clusters. Points in a cluster share some generalproperty that derives from the entire set of points. (Points in theintersection of the circles belong to both.)

Figure 8.2. Different types of clusters as illustrated by sets of two-dimensional points.

8.2 K-means

Prototype-based clustering techniques create a one-level partitioning of thedata objects. There are a number of such techniques, but two of the mostprominent are K-means and K-medoid. K-means defines a prototype in termsof a centroid, which is usually the mean of a group of points, and is typically

257 ,

Page 11: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Finally, we can use more sophisticated (perhapsapplication-specific) notions of clusters, though inhigh-dimensional cases such notions may be difficult toidentify.

496 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Well-separated clusters. Eachpoint is closer to all of the points in itscluster than to any point in anothercluster.

(b) Center-based clusters. Eachpoint is closer to the center of itscluster than to the center of anyother cluster.

(c) Contiguity-based clusters. Eachpoint is closer to at least one pointin its cluster than to any point inanother cluster.

(d) Density-based clusters. Clus-ters are regions of high density sep-arated by regions of low density.

(e) Conceptual clusters. Points in a cluster share some generalproperty that derives from the entire set of points. (Points in theintersection of the circles belong to both.)

Figure 8.2. Different types of clusters as illustrated by sets of two-dimensional points.

8.2 K-means

Prototype-based clustering techniques create a one-level partitioning of thedata objects. There are a number of such techniques, but two of the mostprominent are K-means and K-medoid. K-means defines a prototype in termsof a centroid, which is usually the mean of a group of points, and is typically 258 ,

Page 12: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

K-means

I We now describe a simple and often used partional clusteringmethod: K-means

I For simplicity, we will here describe it in the Euclidean space,but extensions are possible (see textbook and exercise set 6)

I Notation:

xi the i :th data vector, i = 1, . . . ,NN the number of data vectorsn the number of attributes, i.e. length of vector xiK the number of clusters (user-specified)cj the prototype vector for the j :th clusterai the cluster assignment of data vector xi . ai ∈ {1, . . . ,K}Cj the set of indices i of the xi belonging to cluster j ,

i.e. Cj = {i : ai = j}

259 ,

Page 13: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

K-means: pseudocode

I Input: A set of N points xi , and the desired number ofclusters K

I Output: A partition of the points into K clusters, i.e. anassignment ai ∈ {1, . . . ,K} corresponding to each xi definingto which cluster each data vector belongs. Also returns the Kcentroids cj , j = 1, . . . ,K .

I Pseudocode:

8.2 K-means 497

applied to objects in a continuous n-dimensional space. K-medoid defines aprototype in terms of a medoid, which is the most representative point for agroup of points, and can be applied to a wide range of data since it requiresonly a proximity measure for a pair of objects. While a centroid almost nevercorresponds to an actual data point, a medoid, by its definition, must be anactual data point. In this section, we will focus solely on K-means, which isone of the oldest and most widely used clustering algorithms.

8.2.1 The Basic K-means Algorithm

The K-means clustering technique is simple, and we begin with a descriptionof the basic algorithm. We first choose K initial centroids, where K is a user-specified parameter, namely, the number of clusters desired. Each point isthen assigned to the closest centroid, and each collection of points assigned toa centroid is a cluster. The centroid of each cluster is then updated based onthe points assigned to the cluster. We repeat the assignment and update stepsuntil no point changes clusters, or equivalently, until the centroids remain thesame.

K-means is formally described by Algorithm 8.1. The operation of K-meansis illustrated in Figure 8.3, which shows how, starting from three centroids, thefinal clusters are found in four assignment-update steps. In these and otherfigures displaying K-means clustering, each subfigure shows (1) the centroidsat the start of the iteration and (2) the assignment of the points to thosecentroids. The centroids are indicated by the “+” symbol; all points belongingto the same cluster have the same marker shape.

Algorithm 8.1 Basic K-means algorithm.1: Select K points as initial centroids.2: repeat3: Form K clusters by assigning each point to its closest centroid.4: Recompute the centroid of each cluster.5: until Centroids do not change.

In the first step, shown in Figure 8.3(a), points are assigned to the initialcentroids, which are all in the larger group of points. For this example, we usethe mean as the centroid. After points are assigned to a centroid, the centroidis then updated. Again, the figure for each step shows the centroid at thebeginning of the step and the assignment of points to those centroids. In thesecond step, points are assigned to the updated centroids, and the centroids

260 ,

Page 14: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Details:

I In line 1, the simplest solution is to initialize the cj to equal Krandom vectors from the input data

I In line 3, for each datapoint i , set ai := arg minj ||xi − cj ||2

I In line 4, for each cluster j = 1, . . . ,K we set

cj =1

|Cj |∑i∈Cj

xi ,

i.e. each cluster centroid is set to the mean of the datavectors which were assigned to that cluster in line 3.

261 ,

Page 15: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

K-means: 2D example

(a)

!2 0 2

!2

0

2 (b)

!2 0 2

!2

0

2 (c)

!2 0 2

!2

0

2 (d)

!2 0 2

!2

0

2 (e)

!2 0 2

!2

0

2 (f)

!2 0 2

!2

0

2

�(a)

!2 0 2

!2

0

2 (b)

!2 0 2

!2

0

2 (c)

!2 0 2

!2

0

2 (d)

!2 0 2

!2

0

2 (e)

!2 0 2

!2

0

2 (f)

!2 0 2

!2

0

2

� I Data from the ‘Old faithful’ geyser (horizontal axis is durationof eruption, vertical axis is waiting time to the next eruption,both scaled to zero mean and unit variance)

262 ,

Page 16: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

K-means: objective function

I Consider the following measure of the goodness of theclustering

SSE =K∑j=1

∑xi∈Cj

||cj − xi ||22

that is, take the sum of the squared Euclidean distance fromeach datapoint xi to the prototype vector cj of the cluster towhich it belongs.

I We will show that in each step of the K-means algorithm theSSE value either stays the same or decreases. (Note: here weaim for ease of understanding rather than give a formal proofwith lots of notation.)

263 ,

Page 17: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I At any point in the algorithm, we have two sets ofparameters: The N cluster assignments ai (which directlydetermine the Cj), and the K centroids cj .

I First, we see that, while holding the centroids cj fixed,recomputing the assignments ai such that each datapoint xi isassigned to the cluster j wih the closest cluster centroid cj , i.e.

ai = arg minj||xi − cj ||22

is the optimal clustering of the datapoints in terms ofminimizing the SSE, for fixed cj , j = 1, . . .K .

264 ,

Page 18: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Hence, regardless of what the cluster assignments were at thebeginning of step 3 of the algorithm, at the end of that stepthe SSE cannot have increased (as it is now optimal for thegiven cluster centroids).

I Next, we show a similar property for step 4, namely that for agiven cluster assignment, the centroid given by the mean ofthe data vectors belonging to the cluster

cj =1

|Cj |∑xi∈Cj

xi ,

is optimal in terms of minimizing the SSE objective.

265 ,

Page 19: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Isolate a given cluster j . Denote the p:th component of cj by cjp andsimilarly the p:th component of xi by xip. The SSE for this cluster isequal to:

SSEj =∑xi∈Cj

||cj − xi ||22 =∑xi∈Cj

n∑p=1

(cjp − xip)2

Now take the partial derivative of SSEj with respect to cjp′ :

∂ SSEj

∂cjp′=

∑xi∈Cj

n∑p=1

∂cjp′(cjp − xip)2

=∑xi∈Cj

∂cjp′(cjp′ − xip′)2

=∑xi∈Cj

2(cjp′ − xip′) = 0

⇒ cjp′ =1

|Cj |∑xi∈Cj

xip′ ,

266 ,

Page 20: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Thus, line 3 select the optimal assignments ai given thecentroids cj , and line 4 selects the optimal centroids cj giventhe assignments ai (where optimality is with respect tominimizing the SSE)

I Hence, the SSE never increases during the course of thealgorithm

I Given that there are a finite number (KN) of possibleassignments, the algorithm is guaranteed to converge to astable state in a finite number of steps. (In practice, thenumber of iterations to convergence is typically much smallerthan this!)

267 ,

Page 21: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Space and running time complexity

I Space requirements are modest, as (in addition to the dataitself) we only need to store:

1. The index of the assigned cluster for each datapoint xi

2. The cluster centroid for each cluster

I The running time is linear in all the relevant parameters, i.e.O(INKn), where I is the number of iterations, N the numberof samples, K the number of clusters, and n the number ofdimensions (attributes).

(The number of iterations I typically does not depend heavily on

the other parameters.)

268 ,

Page 22: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Influence of initialization

I The algorithm only guarantees that the SSE is non-increasing.It is still local search, and does not in general reach the globalminimum.

Example 1: 8.2 K-means 503

(a) Iteration 1. (b) Iteration 2. (c) Iteration 3. (d) Iteration 4.

Figure 8.5. Poor starting centroids for K-means.

cluster, the centroids will redistribute themselves so that the “true” clustersare found. However, Figure 8.7 shows that if a pair of clusters has only oneinitial centroid and the other pair has three, then two of the true clusters willbe combined and one true cluster will be split.

Note that an optimal clustering will be obtained as long as two initialcentroids fall anywhere in a pair of clusters, since the centroids will redistributethemselves, one to each cluster. Unfortunately, as the number of clustersbecomes larger, it is increasingly likely that at least one pair of clusters willhave only one initial centroid. (See Exercise 4 on page 559.) In this case,because the pairs of clusters are farther apart than clusters within a pair, theK-means algorithm will not redistribute the centroids between pairs of clusters,and thus, only a local minimum will be achieved.

Because of the problems with using randomly selected initial centroids,which even repeated runs may not overcome, other techniques are often em-ployed for initialization. One e!ective approach is to take a sample of pointsand cluster them using a hierarchical clustering technique. K clusters are ex-tracted from the hierarchical clustering, and the centroids of those clusters areused as the initial centroids. This approach often works well, but is practicalonly if (1) the sample is relatively small, e.g., a few hundred to a few thousand(hierarchical clustering is expensive), and (2) K is relatively small comparedto the sample size.

The following procedure is another approach to selecting initial centroids.Select the first point at random or take the centroid of all points. Then, foreach successive initial centroid, select the point that is farthest from any ofthe initial centroids already selected. In this way, we obtain a set of initial

269 ,

Page 23: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Example 2: 8.2 K-means 505

(a) Iteration 1. (b) Iteration 2.

(c) Iteration 3. (d) Iteration 4.

Figure 8.7. Two pairs of clusters with more or fewer than two initial centroids within a pair of clusters.

is less susceptible to initialization problems (bisecting K-means) and usingpostprocessing to “fixup” the set of clusters produced.

Time and Space Complexity

The space requirements for K-means are modest because only the data pointsand centroids are stored. Specifically, the storage required is O((m + K)n),where m is the number of points and n is the number of attributes. The timerequirements for K-means are also modest—basically linear in the number ofdata points. In particular, the time required is O(I !K !m!n), where I is thenumber of iterations required for convergence. As mentioned, I is often smalland can usually be safely bounded, as most changes typically occur in the

I One possible solution: Run the algorithm from many randominitial conditions, select the end result with the smallest SSE.(Nevertheless, it may still find very ‘bad’ solutions almost allthe time.)

270 ,

Page 24: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

How to select the number of clusters?

I Not a priori clear what the ‘optimal’ number of clusters is:8.1 Overview 491

(a) Original points. (b) Two clusters.

(c) Four clusters. (d) Six clusters.

Figure 8.1. Different ways of clustering the same set of points.

in the sense of Chapter 4 is supervised classification; i.e., new, unlabeledobjects are assigned a class label using a model developed from objects withknown class labels. For this reason, cluster analysis is sometimes referredto as unsupervised classification. When the term classification is usedwithout any qualification within data mining, it typically refers to supervisedclassification.

Also, while the terms segmentation and partitioning are sometimesused as synonyms for clustering, these terms are frequently used for approachesoutside the traditional bounds of cluster analysis. For example, the termpartitioning is often used in connection with techniques that divide graphs intosubgraphs and that are not strongly connected to clustering. Segmentationoften refers to the division of data into groups using simple techniques; e.g.,an image can be split into segments based only on pixel intensity and color, orpeople can be divided into groups based on their income. Nonetheless, somework in graph partitioning and in image and market segmentation is relatedto cluster analysis.

8.1.2 Di!erent Types of Clusterings

An entire collection of clusters is commonly referred to as a clustering, and inthis section, we distinguish various types of clusterings: hierarchical (nested)versus partitional (unnested), exclusive versus overlapping versus fuzzy, andcomplete versus partial.

Hierarchical versus Partitional The most commonly discussed distinc-tion among di!erent types of clusterings is whether the set of clusters is nested

I The more clusters, the lower SSE, so need some form of‘model selection’ approach

I Will discuss this a bit more in the context of clusteringvalidation strategies later

271 ,

Page 25: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Hierarchical clustering

I Dendrogram representation:

I Nested cluster structure

I Binary tree with datapoints (objects) as leaves

I Cutting the tree at any height produces a partitional clustering

I Example 1:516 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

p1 p2 p3 p4

(a) Dendrogram.

p1

p2

p3p4

(b) Nested cluster diagram.

Figure 8.13. A hierarchical clustering of four points shown as a dendrogram and as nested clusters.

relationships and the order in which the clusters were merged (agglomerativeview) or split (divisive view). For sets of two-dimensional points, such as thosethat we will use as examples, a hierarchical clustering can also be graphicallyrepresented using a nested cluster diagram. Figure 8.13 shows an example ofthese two types of figures for a set of four two-dimensional points. These pointswere clustered using the single-link technique that is described in Section 8.3.2.

8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm

Many agglomerative hierarchical clustering techniques are variations on a sin-gle approach: starting with individual points as clusters, successively mergethe two closest clusters until only one cluster remains. This approach is ex-pressed more formally in Algorithm 8.3.

Algorithm 8.3 Basic agglomerative hierarchical clustering algorithm.

1: Compute the proximity matrix, if necessary.2: repeat3: Merge the closest two clusters.4: Update the proximity matrix to reflect the proximity between the new

cluster and the original clusters.5: until Only one cluster remains.

272 ,

Page 26: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Example 2:8.3 Agglomerative Hierarchical Clustering 521

6

4

52

1

33

2

4

5

1

(a) Complete link clustering.

0.4

0.3

0.2

0.1

03 6 4 1 2 5

(b) Complete link dendrogram.

Figure 8.17. Complete link clustering of the six points shown in Figure 8.15.

are merged first. However, {3, 6} is merged with {4}, instead of {2, 5} or {1}because

dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4))

= max(0.15, 0.22)

= 0.22.

dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))

= max(0.15, 0.25, 0.28, 0.39)

= 0.39.

dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))

= max(0.22, 0.23)

= 0.23.

Group Average

For the group average version of hierarchical clustering, the proximity of twoclusters is defined as the average pairwise proximity among all pairs of pointsin the di!erent clusters. This is an intermediate approach between the singleand complete link approaches. Thus, for group average, the cluster proxim-

I Height of horizontal connectors indicate the dissimilaritybetween the combined clusters (details a bit later)

273 ,

Page 27: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

General approaches to hierarchical clustering:

I Divisive approach:

1. Start with one cluster containing all the datapoints.

2. Repeat for all non-singleton clusters:

I Split the cluster in two using some partitional clusteringapproach (e.g. K-means)

I Agglomerative approach:

1. Start with each datapoint being its own cluster

2. Repeat until there is just one cluster left:

I Select the pair of clusters which are most similar and jointhem into a single cluster

(The agglomerative approach is much more common, and we willexclusively focus on it in what follows.)

274 ,

Page 28: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Need a similarity/proximity measure for pairs of clusters (inaddition to similarity of pairs of datapoints). E.g. need to compared(Cred,Cgreen), d(Cred,Cblue), and d(Cgreen,Cblue):

I Notation

xi the i :th data vector, i = 1, . . . ,NCa the set of indices i of the xi belonging to cluster a,

d(Ca,Cb) dissimilarity between clusters Ca and Cb

275 ,

Page 29: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I ‘Single-link’ (=‘MIN’)

d(Ca,Cb) = mini∈Ca, j∈Cb

d(xi , xj),

where d(xi , xj) is the dissimilarity between the two datapoints(objects) xi and xj .

8.3 Agglomerative Hierarchical Clustering 517

Defining Proximity between Clusters

The key operation of Algorithm 8.3 is the computation of the proximity be-tween two clusters, and it is the definition of cluster proximity that di!er-entiates the various agglomerative hierarchical techniques that we will dis-cuss. Cluster proximity is typically defined with a particular type of clusterin mind—see Section 8.1.2. For example, many agglomerative hierarchicalclustering techniques, such as MIN, MAX, and Group Average, come froma graph-based view of clusters. MIN defines cluster proximity as the prox-imity between the closest two points that are in di!erent clusters, or usinggraph terms, the shortest edge between two nodes in di!erent subsets of nodes.This yields contiguity-based clusters as shown in Figure 8.2(c). Alternatively,MAX takes the proximity between the farthest two points in di!erent clustersto be the cluster proximity, or using graph terms, the longest edge betweentwo nodes in di!erent subsets of nodes. (If our proximities are distances, thenthe names, MIN and MAX, are short and suggestive. For similarities, however,where higher values indicate closer points, the names seem reversed. For thatreason, we usually prefer to use the alternative names, single link and com-plete link, respectively.) Another graph-based approach, the group averagetechnique, defines cluster proximity to be the average pairwise proximities (av-erage length of edges) of all pairs of points from di!erent clusters. Figure 8.14illustrates these three approaches.

(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.

Figure 8.14. Graph-based definitions of cluster proximity

If, instead, we take a prototype-based view, in which each cluster is repre-sented by a centroid, di!erent definitions of cluster proximity are more natural.When using centroids, the cluster proximity is commonly defined as the prox-imity between cluster centroids. An alternative technique, Ward’s method,also assumes that a cluster is represented by its centroid, but it measures theproximity between two clusters in terms of the increase in the SSE that re-

(Note that when working with similarity measures s(·, ·) weinstead take the object pair with maximum similarity!)

276 ,

Page 30: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Alternatively, we can try enforced that clusters should have allpairs of points reasonably close to each other. This gives

‘Complete-link’ (=‘MAX’)

d(Ca,Cb) = maxi∈Ca, j∈Cb

d(xi , xj),

where d(xi , xj) is the dissimilarity between the two datapoints(objects) xi and xj .

8.3 Agglomerative Hierarchical Clustering 517

Defining Proximity between Clusters

The key operation of Algorithm 8.3 is the computation of the proximity be-tween two clusters, and it is the definition of cluster proximity that di!er-entiates the various agglomerative hierarchical techniques that we will dis-cuss. Cluster proximity is typically defined with a particular type of clusterin mind—see Section 8.1.2. For example, many agglomerative hierarchicalclustering techniques, such as MIN, MAX, and Group Average, come froma graph-based view of clusters. MIN defines cluster proximity as the prox-imity between the closest two points that are in di!erent clusters, or usinggraph terms, the shortest edge between two nodes in di!erent subsets of nodes.This yields contiguity-based clusters as shown in Figure 8.2(c). Alternatively,MAX takes the proximity between the farthest two points in di!erent clustersto be the cluster proximity, or using graph terms, the longest edge betweentwo nodes in di!erent subsets of nodes. (If our proximities are distances, thenthe names, MIN and MAX, are short and suggestive. For similarities, however,where higher values indicate closer points, the names seem reversed. For thatreason, we usually prefer to use the alternative names, single link and com-plete link, respectively.) Another graph-based approach, the group averagetechnique, defines cluster proximity to be the average pairwise proximities (av-erage length of edges) of all pairs of points from di!erent clusters. Figure 8.14illustrates these three approaches.

(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.

Figure 8.14. Graph-based definitions of cluster proximity

If, instead, we take a prototype-based view, in which each cluster is repre-sented by a centroid, di!erent definitions of cluster proximity are more natural.When using centroids, the cluster proximity is commonly defined as the prox-imity between cluster centroids. An alternative technique, Ward’s method,also assumes that a cluster is represented by its centroid, but it measures theproximity between two clusters in terms of the increase in the SSE that re-

(Again, for similarity measures s(·, ·) we instead takeminimum of the objectwise similarities!)

277 ,

Page 31: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I An intermediate criterion is ‘Group average’

d(Ca,Cb) =1

|Ca||Cb|∑

i∈Ca, j∈Cb

d(xi , xj),

8.3 Agglomerative Hierarchical Clustering 517

Defining Proximity between Clusters

The key operation of Algorithm 8.3 is the computation of the proximity be-tween two clusters, and it is the definition of cluster proximity that di!er-entiates the various agglomerative hierarchical techniques that we will dis-cuss. Cluster proximity is typically defined with a particular type of clusterin mind—see Section 8.1.2. For example, many agglomerative hierarchicalclustering techniques, such as MIN, MAX, and Group Average, come froma graph-based view of clusters. MIN defines cluster proximity as the prox-imity between the closest two points that are in di!erent clusters, or usinggraph terms, the shortest edge between two nodes in di!erent subsets of nodes.This yields contiguity-based clusters as shown in Figure 8.2(c). Alternatively,MAX takes the proximity between the farthest two points in di!erent clustersto be the cluster proximity, or using graph terms, the longest edge betweentwo nodes in di!erent subsets of nodes. (If our proximities are distances, thenthe names, MIN and MAX, are short and suggestive. For similarities, however,where higher values indicate closer points, the names seem reversed. For thatreason, we usually prefer to use the alternative names, single link and com-plete link, respectively.) Another graph-based approach, the group averagetechnique, defines cluster proximity to be the average pairwise proximities (av-erage length of edges) of all pairs of points from di!erent clusters. Figure 8.14illustrates these three approaches.

(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.

Figure 8.14. Graph-based definitions of cluster proximity

If, instead, we take a prototype-based view, in which each cluster is repre-sented by a centroid, di!erent definitions of cluster proximity are more natural.When using centroids, the cluster proximity is commonly defined as the prox-imity between cluster centroids. An alternative technique, Ward’s method,also assumes that a cluster is represented by its centroid, but it measures theproximity between two clusters in terms of the increase in the SSE that re-

(With similarity measures s(·, ·) we also just take the averagevalue.)

278 ,

Page 32: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Centroid-based hierarchical clustering:

d(Ca,Cb) = d(ca, cb),

where the prototypes ca and cb are the cluster prototypesgiven by the means of the vectors in each cluster:

ca =1

|Ca|∑i∈Ca

xi and cb =1

|Cb|∑i∈Cb

xi

I Ward’s method is based on using prototypes (centroids) foreach cluster, and measuring the dissimilarity between clustersas the increase in SSE (sum of squared errors from datapointsto their prototype) resulting from combining the two clusters

279 ,

Page 33: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Example 1:8.3 Agglomerative Hierarchical Clustering 519

0.6

0.5

0.4

0.3

0.2

0.1

0

52

3

4

6

1

0 0.1 0.2 0.3 0.4 0.5 0.6

Figure 8.15. Set of 6 two-dimensional points.

Point x Coordinate y Coordinatep1 0.40 0.53p2 0.22 0.38p3 0.35 0.32p4 0.26 0.19p5 0.08 0.41p6 0.45 0.30

Table 8.3. xy coordinates of 6 points.

p1 p2 p3 p4 p5 p6p1 0.00 0.24 0.22 0.37 0.34 0.23p2 0.24 0.00 0.15 0.20 0.14 0.25p3 0.22 0.15 0.00 0.15 0.28 0.11p4 0.37 0.20 0.15 0.00 0.29 0.22p5 0.34 0.14 0.28 0.29 0.00 0.39p6 0.23 0.25 0.11 0.22 0.39 0.00

Table 8.4. Euclidean distance matrix for 6 points.

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximityof two clusters is defined as the minimum of the distance (maximum of thesimilarity) between any two points in the two di!erent clusters. Using graphterminology, if you start with all points as singleton clusters and add linksbetween points one at a time, shortest links first, then these single links com-bine the points into clusters. The single link technique is good at handlingnon-elliptical shapes, but is sensitive to noise and outliers.

Example 8.4 (Single Link). Figure 8.16 shows the result of applying thesingle link technique to our example data set of six points. Figure 8.16(a)shows the nested clusters as a sequence of nested ellipses, where the numbersassociated with the ellipses indicate the order of the clustering. Figure 8.16(b)shows the same information, but as a dendrogram. The height at which twoclusters are merged in the dendrogram reflects the distance of the two clusters.For instance, from Table 8.4, we see that the distance between points 3 and 6

8.3 Agglomerative Hierarchical Clustering 519

0.6

0.5

0.4

0.3

0.2

0.1

0

52

3

4

6

1

0 0.1 0.2 0.3 0.4 0.5 0.6

Figure 8.15. Set of 6 two-dimensional points.

Point x Coordinate y Coordinatep1 0.40 0.53p2 0.22 0.38p3 0.35 0.32p4 0.26 0.19p5 0.08 0.41p6 0.45 0.30

Table 8.3. xy coordinates of 6 points.

p1 p2 p3 p4 p5 p6p1 0.00 0.24 0.22 0.37 0.34 0.23p2 0.24 0.00 0.15 0.20 0.14 0.25p3 0.22 0.15 0.00 0.15 0.28 0.11p4 0.37 0.20 0.15 0.00 0.29 0.22p5 0.34 0.14 0.28 0.29 0.00 0.39p6 0.23 0.25 0.11 0.22 0.39 0.00

Table 8.4. Euclidean distance matrix for 6 points.

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximityof two clusters is defined as the minimum of the distance (maximum of thesimilarity) between any two points in the two di!erent clusters. Using graphterminology, if you start with all points as singleton clusters and add linksbetween points one at a time, shortest links first, then these single links com-bine the points into clusters. The single link technique is good at handlingnon-elliptical shapes, but is sensitive to noise and outliers.

Example 8.4 (Single Link). Figure 8.16 shows the result of applying thesingle link technique to our example data set of six points. Figure 8.16(a)shows the nested clusters as a sequence of nested ellipses, where the numbersassociated with the ellipses indicate the order of the clustering. Figure 8.16(b)shows the same information, but as a dendrogram. The height at which twoclusters are merged in the dendrogram reflects the distance of the two clusters.For instance, from Table 8.4, we see that the distance between points 3 and 6

I Single-link:520 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

6

4

52

1

3

32

4

5

1

(a) Single link clustering.

0.2

0.15

0.1

0.05

03 6 2 5 4 1

(b) Single link dendrogram.

Figure 8.16. Single link clustering of the six points shown in Figure 8.15.

is 0.11, and that is the height at which they are joined into one cluster in thedendrogram. As another example, the distance between clusters {3, 6} and{2, 5} is given by

dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))

= min(0.15, 0.25, 0.28, 0.39)

= 0.15.

Complete Link or MAX or CLIQUE

For the complete link or MAX version of hierarchical clustering, the proximityof two clusters is defined as the maximum of the distance (minimum of thesimilarity) between any two points in the two di!erent clusters. Using graphterminology, if you start with all points as singleton clusters and add linksbetween points one at a time, shortest links first, then a group of points isnot a cluster until all the points in it are completely linked, i.e., form a clique.Complete link is less susceptible to noise and outliers, but it can break largeclusters and it favors globular shapes.

Example 8.5 (Complete Link). Figure 8.17 shows the results of applyingMAX to the sample data set of six points. As with single link, points 3 and 6

(The heights in the dendrogram correspond to the dissimilaritiesd(Ca,Cb) when clusters Ca and Cb are combined.)

280 ,

Page 34: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

Example 2:8.3 Agglomerative Hierarchical Clustering 519

0.6

0.5

0.4

0.3

0.2

0.1

0

52

3

4

6

1

0 0.1 0.2 0.3 0.4 0.5 0.6

Figure 8.15. Set of 6 two-dimensional points.

Point x Coordinate y Coordinatep1 0.40 0.53p2 0.22 0.38p3 0.35 0.32p4 0.26 0.19p5 0.08 0.41p6 0.45 0.30

Table 8.3. xy coordinates of 6 points.

p1 p2 p3 p4 p5 p6p1 0.00 0.24 0.22 0.37 0.34 0.23p2 0.24 0.00 0.15 0.20 0.14 0.25p3 0.22 0.15 0.00 0.15 0.28 0.11p4 0.37 0.20 0.15 0.00 0.29 0.22p5 0.34 0.14 0.28 0.29 0.00 0.39p6 0.23 0.25 0.11 0.22 0.39 0.00

Table 8.4. Euclidean distance matrix for 6 points.

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximityof two clusters is defined as the minimum of the distance (maximum of thesimilarity) between any two points in the two di!erent clusters. Using graphterminology, if you start with all points as singleton clusters and add linksbetween points one at a time, shortest links first, then these single links com-bine the points into clusters. The single link technique is good at handlingnon-elliptical shapes, but is sensitive to noise and outliers.

Example 8.4 (Single Link). Figure 8.16 shows the result of applying thesingle link technique to our example data set of six points. Figure 8.16(a)shows the nested clusters as a sequence of nested ellipses, where the numbersassociated with the ellipses indicate the order of the clustering. Figure 8.16(b)shows the same information, but as a dendrogram. The height at which twoclusters are merged in the dendrogram reflects the distance of the two clusters.For instance, from Table 8.4, we see that the distance between points 3 and 6

8.3 Agglomerative Hierarchical Clustering 519

0.6

0.5

0.4

0.3

0.2

0.1

0

52

3

4

6

1

0 0.1 0.2 0.3 0.4 0.5 0.6

Figure 8.15. Set of 6 two-dimensional points.

Point x Coordinate y Coordinatep1 0.40 0.53p2 0.22 0.38p3 0.35 0.32p4 0.26 0.19p5 0.08 0.41p6 0.45 0.30

Table 8.3. xy coordinates of 6 points.

p1 p2 p3 p4 p5 p6p1 0.00 0.24 0.22 0.37 0.34 0.23p2 0.24 0.00 0.15 0.20 0.14 0.25p3 0.22 0.15 0.00 0.15 0.28 0.11p4 0.37 0.20 0.15 0.00 0.29 0.22p5 0.34 0.14 0.28 0.29 0.00 0.39p6 0.23 0.25 0.11 0.22 0.39 0.00

Table 8.4. Euclidean distance matrix for 6 points.

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximityof two clusters is defined as the minimum of the distance (maximum of thesimilarity) between any two points in the two di!erent clusters. Using graphterminology, if you start with all points as singleton clusters and add linksbetween points one at a time, shortest links first, then these single links com-bine the points into clusters. The single link technique is good at handlingnon-elliptical shapes, but is sensitive to noise and outliers.

Example 8.4 (Single Link). Figure 8.16 shows the result of applying thesingle link technique to our example data set of six points. Figure 8.16(a)shows the nested clusters as a sequence of nested ellipses, where the numbersassociated with the ellipses indicate the order of the clustering. Figure 8.16(b)shows the same information, but as a dendrogram. The height at which twoclusters are merged in the dendrogram reflects the distance of the two clusters.For instance, from Table 8.4, we see that the distance between points 3 and 6

I Complete-link:

(The heights in the dendrogram correspond to the dissimilaritiesd(Ca,Cb) when clusters Ca and Cb are combined.)

8.3 Agglomerative Hierarchical Clustering 521

6

4

52

1

33

2

4

5

1

(a) Complete link clustering.

0.4

0.3

0.2

0.1

03 6 4 1 2 5

(b) Complete link dendrogram.

Figure 8.17. Complete link clustering of the six points shown in Figure 8.15.

are merged first. However, {3, 6} is merged with {4}, instead of {2, 5} or {1}because

dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4))

= max(0.15, 0.22)

= 0.22.

dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))

= max(0.15, 0.25, 0.28, 0.39)

= 0.39.

dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))

= max(0.22, 0.23)

= 0.23.

Group Average

For the group average version of hierarchical clustering, the proximity of twoclusters is defined as the average pairwise proximity among all pairs of pointsin the di!erent clusters. This is an intermediate approach between the singleand complete link approaches. Thus, for group average, the cluster proxim-

281 ,

Page 35: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Cluster shapes:

I Single-link can produce arbitrarily shaped clusters (joiningquite different objects which have some intermediate links thatconnect them)

I Complete-link tends to produce fairly compact, globularclusters. Problems with clusters of different sizes.

I Group average is a compromise between the two

single link complete link

I Lack of a global objective function:

I In contrast to methods such as K-means, the agglomerativehierarchical clustering methods do not have a natural objectivefunction that is being optimized. Even Ward’s method doesnot give even local minima in terms of minimizing the SSE!

282 ,

Page 36: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Monotonicity:

If the dissimilarity between a pair clusters merged at any point in thealgorithm is always at least as large as the dissimilarity of the pair ofclusters merged in the previous step, the clustering is monotonic.

I Single-link, complete-link, and group average: Yes!

I Centroid-based hierarchical clustering: No! Example:

d1 = (1 + ε, 1), d2 = (5, 1), d3 = (3, 1 + 2√

3).The first combination (of d1 and d2) occurs ata distance of 4− ε. The point o = (3 + ε/2, 1).The next combination occurs at distance of√

(2√

3)2 + (ε/2)2 ≈ 2√

3 ≈ 3.4641 < 4− ε

�= ( + �, ) = ( , )

− �= ( , +

√)

= ( + �/ , )

√+ �/ ≈ . + �/

�283 ,

Page 37: Unsupervised learning: Clustering - cs.helsinki.fi · K-means: pseudocode I Input: A set of N points x i, and the desired number of clusters K I Output: A partition of the points

I Computational complexity

I The main storage requirement is the matrix of pairwiseproximities, containing a total of N(N − 1)/2 entries for Ndatapoints. So the space complexity is: O(N2).

I Computing the proximity matrix takes O(N2). Next, there areO(N) iterations, where in each one we need to find theminimum of the pairwise dissimilarities between the clusters.Trivially implemented this would lead to an O(N3) algorithm,but techniques exist to avoid exhaustive search at each step,yielding complexities in the range O(N2) to O(N2 logN).

(Compare this to K-means, which only requires O(NK ) for Kclusters.)

Hence, hierarchical clustering is directly applicable only torelatively small datasets.

284 ,