clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...clustering applications 13...

63
CE-717: Machine Learning Sharif University of Technology Fall 2020 Soleymani Clustering

Upload: others

Post on 21-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

CE-717: Machine Learning Sharif University of Technology

Fall 2020

Soleymani

Clustering

Page 2: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Unsupervised learning

2

Clustering: partitioning of data into groups of similar

data points.

Density estimation

Parametric & non-parametric density estimation

Dimensionality reduction: data representation using a

smaller number of dimensions while preserving (perhaps

approximately) some properties of the data.

Page 3: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering: Definition

3

We have a set of unlabeled data points ๐’™(๐‘–)๐‘–=1

๐‘and we intend

to find groups of similar objects (based on the observed

features)

high intra-cluster similarity

low inter-cluster similarity

๐‘ฅ1

๐‘ฅ2

Page 4: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering: Another Definition

4

Density-based definition:

Clusters are regions of high density that are separated from

one another by regions of low density

๐‘ฅ1

๐‘ฅ2

Page 5: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Difficulties

5

Clustering is not as well-defined as classification

Clustering is subjective

Natural grouping may be ambiguous

Page 6: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Purpose

6

Preprocessing stage to index, compress, or reduce the data

Representing high-dimensional data in a low-dimensional space(e.g., for visualization purposes).

Knowledge discovery from data: As a tool to understand thehidden structure in data or to group them To gain insight into the structure of the data (prior to classifier design)

Provides information about the internal structure of the data

To group or partition the data when no label is available

Page 7: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Applications

7

Information retrieval (search and browsing) Cluster text docs or images based on their content

Cluster groups of users based on their access patterns on

webpages

Page 8: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering of docs

8

Google news

Page 9: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Applications

9

Information retrieval (search and browsing) Cluster text docs or images based on their content

Cluster groups of users based on their access patterns on

webpages

Cluster users of social networks by interest

(community detection).

Page 10: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Social Network: Community Detection

10

Page 11: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Applications

11

Information retrieval (search and browsing) Cluster text docs or images based on their content

Cluster groups of users based on their access patterns on

webpages

Cluster users of social networks by interest (community

detection).

Bioinformatics cluster similar proteins together (similarity w.r.t. chemical

structure and/or functionality etc)

or cluster similar genes according to microarray data

Page 12: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Gene clustering

12

Microarrays measures the expression of all genes

Clustering genes can help to determine new functions for

unknown genes by grouping genes

Page 13: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Applications

13

Information retrieval (search and browsing)

Cluster text docs or images based on their content

Cluster groups of users based on their access patterns on webpages

Cluster users of social networks by interest (communitydetection).

Bioinformatics

Cluster similar proteins together (similarity wrt chemical structureand/or functionality etc) or similar genes according to microarraydata

Market segmentation

Clustering customers based on the their purchase history and theircharacteristics

Image segmentation

Many more applications

Page 14: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

14

Hierarchical Partitional

Categorization of Clustering Algorithms

Partitional algorithms: Construct various partitions and then evaluate

them by some criterion

the desired number of clusters K must be specified.

Hierarchical algorithms: Create a hierarchical decomposition of the set of

objects using some criterion

Page 15: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering methods we will discuss

15

Objective based clustering

K-means

EM-style algorithm for clustering for mixture of Gaussians (in

the next lecture)

Hierarchical clustering

Page 16: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Partitional Clustering

16

๐’ณ = ๐’™(๐‘–)๐‘–=1

๐‘

๐’ž = {๐’ž1, ๐’ž2, โ€ฆ , ๐’ž๐พ}

โˆ€๐‘—, ๐’ž๐‘— โ‰  โˆ…

๐‘—=1ฺ‚๐พ ๐’ž๐‘— = ๐’ณ

โˆ€๐‘–, ๐‘—, ๐’ž๐‘– โˆฉ ๐’ž๐‘— = โˆ… (disjoint partitioning for hard clustering)

Since the output is only one set of clusters the user

has to specify the desired number of clusters K.

Hard clustering: Each data can belong to one cluster only

Nonhierarchical, each instance is placed in

exactly one of K non-overlapping clusters.

Page 17: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Partitioning Algorithms: Basic Concept

Construct a partition of a set of ๐‘ objects into a set

of ๐พ clusters

The number of clusters ๐พ is given in advance

Each object belongs to exactly one cluster in hard

clustering methods

K-means is the most popular partitioning algorithm

17

Page 18: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Objective Based Clustering

18

Input: A set of ๐‘ points, also a distance/dissimilaritymeasure

Output: a partition of the data.

k-median: find centers ๐œ1, ๐œ2, โ€ฆ , ๐œ๐พ to minimize

๐‘–=1

๐‘

min๐‘—โˆˆ1,โ€ฆ,๐พ

๐‘‘(๐’™ ๐‘– , ๐’„๐‘—)

k-means: find centers ๐œ1, ๐œ2, โ€ฆ , ๐œ๐พ to minimize

๐‘–=1

๐‘

min๐‘—โˆˆ1,โ€ฆ,๐พ

๐‘‘2(๐’™ ๐‘– , ๐’„๐‘—)

Page 19: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Distance Measure

19

Let ๐‘‚1 and ๐‘‚2 be two objects from the universe of

possible objects. The distance (dissimilarity) between ๐‘‚1and ๐‘‚2 is a real number denoted by ๐‘‘(๐‘‚1, ๐‘‚2)

Specifying the distance ๐‘‘(๐’™, ๐’™โ€ฒ) between pairs (๐’™, ๐’™โ€ฒ).

E.g., for texts: # keywords in common, edit distance

Example: Euclidean distance in the space of features

Page 20: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means Clustering

Input: a set ๐’™ 1 , โ€ฆ , ๐’™ ๐‘ of data points (in a ๐‘‘-dim featurespace) and an integer ๐พ

Output: a set of ๐พ representatives ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ โˆˆ โ„๐‘‘ as thecluster representatives data points are assigned to the clusters according to their distances to๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ Each data is assigned to the cluster whose representative is nearest to it

Objective: choose ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ to minimize:

๐‘–=1

๐‘

min๐‘—โˆˆ1,โ€ฆ,๐พ

๐‘‘2(๐’™ ๐‘– , ๐’„๐‘—)

20

Page 21: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Euclidean k-means Clustering

21

Input: a set ๐’™ 1 , โ€ฆ , ๐’™ ๐‘ of data points (in a ๐‘‘-dim featurespace) and an integer ๐พ

Output: a set of ๐พ representatives ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ โˆˆ โ„๐‘‘ as thecluster representatives data points are assigned to the clusters according to their distances to๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ Each data is assigned to the cluster whose representative is nearest to it

Objective: choose ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ to minimize:

๐‘–=1

๐‘

min๐‘—โˆˆ1,โ€ฆ,๐พ

๐’™ ๐‘– โˆ’ ๐’„๐‘—2

each point assigned to its closest cluster representative

Page 22: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Euclidean k-means Clustering:

Computational Complexity

22

To find the optimal partition, we need to exhaustively

enumerate all partitions

In how many ways can we assign ๐‘˜ labels to ๐‘ observations?

NP hard: even for ๐‘˜ = 2 or ๐‘‘ = 2

For k=1:min๐’„

ฯƒ๐‘–=1๐‘ ๐’™ ๐‘– โˆ’ ๐’„

2

๐’„ = ๐ =1

๐‘ฯƒ๐‘–=1๐‘ ๐’™ ๐‘–

For ๐‘‘ = 1, dynamic programming in time ๐‘‚(๐‘2๐พ).

Page 23: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Common Heuristic in Practice: The Lloydโ€™s

method

23

Input:A set ๐’ณ of ๐‘ datapoints ๐’™ 1 , โ€ฆ , ๐’™ ๐‘ in โ„๐‘‘

Initialize centers ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ โˆˆ โ„๐‘‘ in any way.

Repeat until there is no further change in the cost.

For each ๐‘—:๐’ž๐‘— โ† {๐’™ โˆˆ ๐’ณ|where ๐’„๐‘— is the closest center to ๐’™}

For each ๐‘—: ๐’„๐‘—โ†mean of members of ๐’ž๐‘—

Holding centers ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ fixed

Find optimal assignments ๐’ž1, โ€ฆ , ๐’ž๐พ of data points to clusters

Holding cluster assignments ๐’ž1, โ€ฆ , ๐’ž๐พ fixed

Find optimal centers ๐’„1, ๐’„2, โ€ฆ , ๐’„๐พ

Page 24: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means Algorithm (The Lloydโ€™s method)

24

Select ๐‘˜ random points ๐’„1, ๐’„2, โ€ฆ ๐’„๐‘˜ as clustersโ€™ initial centroids.

Repeat until converges (or other stopping criterion):

for i=1 to N do:

Assign ๐’™(๐‘–) to the closet cluster and thus ๐’ž๐‘— contains all

data that are closer to ๐’„๐‘— than to anyother cluster

for j=1 to k do

๐’„๐‘— =1

๐’ž๐‘—ฯƒ๐’™(๐‘–)โˆˆ๐’ž๐‘—

๐’™(๐‘–)

Assign data based on current centers

Re-estimate centers based on current assignment

Page 25: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

25

[Bishop]

Assigning data to

clustersUpdating means

Page 26: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Intra-cluster similarity view

26

k-means optimizes intra-cluster similarity:

๐ฝ(๐’ž) =๐‘—=1

๐พ

๐’™(๐‘–)โˆˆ๐’ž๐‘—

๐’™ ๐‘– โ€“ ๐’„๐‘—2

๐’„๐‘— =1

๐’ž๐‘—

๐’™(๐‘–)โˆˆ๐’ž๐‘—

๐’™(๐‘–)

ฯƒ๐’™(๐‘–)โˆˆ๐’ž๐‘—

๐’™ ๐‘– โ€“ ๐’„๐‘—2=

1

2 ๐’ž๐‘—ฯƒ๐’™(๐‘–)โˆˆ๐’ž๐‘—

ฯƒ๐’™(๐‘–

โ€ฒ)โˆˆ๐’ž๐‘—๐’™ ๐‘– โ€“ ๐’™ ๐‘–โ€ฒ

2

the average distance to members of the same cluster

Page 27: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means: Convergence

It always converges.

Why should the K-means algorithm ever reach a state in which

clustering doesnโ€™t change.

Reassignment stage monotonically decreases ๐ฝ since each vector is

assigned to the closest centroid.

Centroid update stage also for each cluster minimizes the sum of

squared distances of the assigned points to the cluster from its center.

Sec. 16.4

27

After E-step

After M-step

[Bishop]

Page 28: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Local optimum

28

It always converges

but it may converge at a local optimum that is different

from the global optimum

may be arbitrarily worse in terms of the objective score.

Page 29: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Local optimum

29

It always converges

but it may converge at a local optimum that is different

from the global optimum

may be arbitrarily worse in terms of the objective score.

Page 30: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Local optimum

30

It always converges

but it may converge at a local optimum that is different

from the global optimum

may be arbitrarily worse in terms of the objective score.

Local optimum: every point is assigned to its nearest center and

every center is the mean value of its points.

Page 31: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means: Local Minimum Problem

The obtained ClusteringOptimal Clustering

Original Data

31

Page 32: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

The Lloydโ€™s method: Initialization

32

Initialization is crucial (how fast it converges, quality of

clustering)

Random centers from the data points

Multiple runs and select the best ones

Initialize with the results of another method

Select good initial centers using a heuristic

Furthest traversal

K-means ++ (works well and has provable gaurantees)

Page 33: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Another Initialization Idea: Furthest Point

Heuristic

33

Choose ๐’„1 arbitrarily (or at random).

For ๐‘— = 2,โ€ฆ , ๐พ

Select ๐’„๐‘— among datapoints ๐’™(1), โ€ฆ , ๐’™(๐‘) that is farthest from

previously chosen ๐’„1, โ€ฆ , ๐’„๐‘—โˆ’1

Page 34: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Another Initialization Idea: Furthest Point

Heuristic

34

It is sensitive to outliers

Page 35: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means++ Initialization: D2 sampling [D. Arthur and S. Vassilvitskii, 2007]

35

Combine random initialization and furthest point initialization ideas

Let the probability of selection of the point be proportional to thedistance between this point and its nearest center. probability of selecting of ๐’™ is proportional to ๐ท2 ๐’™ = min

๐‘˜<๐‘—๐’™ โˆ’ ๐’„๐‘˜

2.

Choose ๐’„1 arbitrarily (or at random).

For ๐‘— = 2,โ€ฆ , ๐พ Select ๐’„๐‘— among data points ๐’™(1), โ€ฆ , ๐’™(๐‘) according to the distribution:

Pr(๐’„๐‘— = ๐’™(๐‘–)) โˆ min๐‘˜<๐‘—

๐’™ ๐‘– โˆ’ ๐’„๐‘˜2

Theorem: K-means++ always attains an ๐‘‚(log ๐‘˜) approximation tooptimal k-means solution in expectation.

Page 36: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

How Many Clusters?

36

Number of clusters ๐‘˜ is given in advance in the k-means

algorithm

However, finding the โ€œrightโ€ number of clusters is a part of the problem

Tradeoff between having better focus within each cluster and

having too many clusters

Page 37: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

How Many Clusters?

Heuristic:

Find large gap between ๐‘˜ โˆ’ 1-means cost and

๐‘˜-means cost.

โ€œknee findingโ€ or โ€œelbow findingโ€.

Hold-out validation/cross-validation on auxiliary task (e.g.,supervised learning task).

Optimization problem: penalize having lots of clusters some criteria can be used to automatically estimate k

Penalize the number of bits you need to describe the extra parameter

๐ฝโ€ฒ(๐’ž) = ๐ฝ(๐’ž) + |๐’ž| ร— log๐‘

Hierarchical clustering

37

After E-step

After M-step

Page 38: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means: Advantages and disadvantages

38

Strength

It is a simple method and easy to implement.

Relatively efficient:๐‘‚(๐‘ก๐พ๐‘๐‘‘), where ๐‘ก is the number of iterations.

K-means typically converges quickly

Usually ๐‘ก โ‰ช ๐‘›.

Exponential # of rounds in the worst case [AndreaVattani 2009].

Weakness

Need to specify K, the number of clusters, in advance

Often terminates at a local optimum.

Initialization is important.

Not suitable to discover clusters with arbitrary shapes

Works for numerical data.What about categorical data?

Noise and outliers can be considerable trouble to K-means

Page 39: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

k-means Algorithm: Limitation

In general, k-means is unable to find clusters of arbitrary

shapes, sizes, and densities

Except to very distant clusters

39

Page 40: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means

40

K-means was proposed near 60 years ago

thousands of clustering algorithms have been published since

then

However, K-means is still widely used.

This speaks to the difficulty in designing a general purpose

clustering algorithm and the ill-posed problem of

clustering.

A.K. Jain, Data Clustering: 50 years beyond k-means,2010.

Page 41: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means: Vector Quantization

41

Data Compression

Vector quantization: construct a codebook using k-means

cluster means as prototypes representing examples assigned to

clusters.

๐‘˜ = 3 ๐‘˜ = 5 ๐‘˜ = 15

Page 42: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means: Image Segmentation

42[Bishop]

Page 43: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Hierarchical Clustering

43

Notion of a cluster can be ambiguous?

How many clusters?

Hierarchical Clustering: Clusters contain sub-clusters and sub-

clusters themselves can have sub-sub-clusters, and so on

Several levels of details in clustering

A hierarchy might be more natural.

Different levels of granularity

Page 44: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

44

Hierarchical Partitional

AgglomerativeDivisive

Categorization of Clustering Algorithms

Page 45: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Hierarchical Clustering

45

Agglomerative (bottom up):

Starts with each data in a separate cluster

Repeatedly joins the closest pair of clusters, until there is only one

cluster (or other stopping criteria).

Divisive (top down):

Starts with the whole data as a cluster

Repeatedly divide data in one of the clusters until there is only one data

in each cluster (or other stopping criteria).

Page 46: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Hierarchical Agglomerative Clustering (HAC)

Algorithm

1. Maintain a set of clusters

2. Initially, each instance forms a cluster

3. While there are more than one cluster

Pick the two closest one

Merge them into a new cluster

46

Page 47: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Hierarchical Agglomerative Clustering (HAC)

Algorithm

1. Maintain a set of clusters

2. Initially, each instance forms a cluster

3. While there are more than one cluster

Pick the two closest one

Merge them into a new cluster

47 765 3241

7

6

54

3

2

1

Height represents the

distance at which the

merge occurs

Page 48: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Distances between Cluster Pairs

Many variants to defining distances between pair of

clusters

Single-link

Minimum distance between different pairs of data

Complete-link

Maximum distance between different pairs of data

Centroid (Wardโ€™s)

Distance between centroids (centers of gravity)

Average-link

Average distance between pairs of elements

48

Page 49: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Distances between Cluster Pairs

Single-link Complete-link

Wardโ€™s Average-link

๐‘‘๐‘–๐‘ ๐‘ก๐‘†๐ฟ ๐’ž๐‘– , ๐’ž๐‘— = min๐’™โˆˆ๐’ž๐‘–, ๐’™

โ€ฒโˆˆ๐’ž๐‘—๐‘‘๐‘–๐‘ ๐‘ก(๐’™, ๐’™โ€ฒ)

๐‘‘๐‘–๐‘ ๐‘ก๐ถ๐ฟ ๐’ž๐‘– , ๐’ž๐‘— = max๐’™โˆˆ๐’ž๐‘–, ๐’™

โ€ฒโˆˆ๐’ž๐‘—๐‘‘๐‘–๐‘ ๐‘ก(๐’™, ๐’™โ€ฒ)

๐‘‘๐‘–๐‘ ๐‘ก๐‘Š๐‘Ž๐‘Ÿ๐‘‘ ๐’ž๐‘– , ๐’ž๐‘— =๐’ž๐‘– ๐’ž๐‘—

๐’ž๐‘– + ๐’ž๐‘—๐‘‘๐‘–๐‘ ๐‘ก(๐’„๐‘– , ๐’„๐‘—) ๐‘‘๐‘–๐‘ ๐‘ก๐ด๐ฟ ๐’ž๐‘– , ๐’ž๐‘— =

1

๐’ž๐‘– โˆช ๐’ž๐‘—

๐’™โˆˆ๐’ž๐‘–โˆช๐’ž๐‘—

๐’™โ€ฒโˆˆ๐’ž๐‘–โˆช๐’ž๐‘—

๐‘‘๐‘–๐‘ ๐‘ก(๐’™, ๐’™โ€ฒ)

Page 50: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Single-Link

50

765 3241

7

6

54

32

1

keep max bridge length as small as possible.

Page 51: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Complete Link

51

7

6

54

32

1

765 3241

keep max diameter as small as possible.

Page 52: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Wardโ€™s method

52

The distances between centers of the two clusters

(weighted to consider sizes of clusters too):

๐‘‘๐‘–๐‘ ๐‘ก๐‘Š๐‘Ž๐‘Ÿ๐‘‘ ๐’ž๐‘– , ๐’ž๐‘— =๐’ž๐‘– ๐’ž๐‘—

๐’ž๐‘– + ๐’ž๐‘—๐‘‘๐‘–๐‘ ๐‘ก(๐’„๐‘– , ๐’„๐‘—)

Merge the two clusters such that the increase in k-means

cost is as small as possible.

Works well in practice.

Page 53: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Distances between Clusters: Summary

Which distance is the best?

Complete linkage prefers compact clusters.

Single linkage can produce long stretched clusters.

The choice depends on what you need.

expert opinion is helpful

53

Page 54: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Data Matrix vs. Distance Matrix

54

Data (or pattern) Matrix:๐‘ ร— ๐‘‘ (features of data):

๐‘ฟ =๐‘ฅ1(1)

โ‹ฏ ๐‘ฅ๐‘‘(1)

โ‹ฎ โ‹ฑ โ‹ฎ

๐‘ฅ1(๐‘)

โ‹ฏ ๐‘ฅ๐‘‘(๐‘)

Distance Matrix:๐‘ ร— ๐‘ (distances of each pattern pair):

๐‘ซ =๐‘‘(๐’™ 1 , ๐’™(1) ) โ‹ฏ ๐‘‘(๐’™ 1 , ๐’™(๐‘) )

โ‹ฎ โ‹ฑ โ‹ฎ๐‘‘(๐’™ ๐‘ , ๐’™(1) ) โ‹ฏ ๐‘‘(๐’™ ๐‘ , ๐’™(๐‘) )

Single-link, complete-link, and average link only needs the distance matrix.

Page 55: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Computational Complexity

In the first iteration, all HAC methods compute similarity of all

pairs of ๐‘ individual instances which is O(๐‘2) similarity

computation.

In each ๐‘ โˆ’ 2 merging iterations, compute the distance

between the most recently created cluster and all other

existing clusters.

if done naively O ๐‘3 but if done more cleverly O ๐‘2 log๐‘

55

Page 56: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Dendrogram: Hierarchical Clustering

56

Clustering obtained by cutting the dendrogram at a desired

level

Cut at a pre-specified level of similarity

where the gap between two successive combination similarities is largest

select the cutting point that produces K clusters

Where to โ€œcutโ€ the dendrogram

is user-determined.

7653241

Page 57: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Outliers

57

We can detect outliers (that are very different to all

others) by finding the isolated branches

765214 3 8

Page 58: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

K-means vs. Hierarchical

Time cost:

K-means is usually fast while hierarchical methods do not scale well

Human intuition

Hierarchical structure provides more natural output compatible withhuman intuition in some domains

Local minimum problem

It is very common for k-means

Hierarchical methods like any heuristic search algorithms also sufferfrom local optima problem.

Since they can never undo what was done previously and greedily mergeclusters

Choosing of the number of clusters

There is no need to specify the number of clusters in advance forhierarchical methods

58

Page 59: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Clustering Validity

We need to determine whether the found clusters arereal or compare different clustering methods.

What is a good clustering?

clustering quality measurement

Main approaches:

Internal index: evaluate how well the clustering fit the datawithout reference to an external information.

External index: evaluate how well is the clustering resultwith respect to known categories.

Assumption: Ground truth labels are available

Sec. 16.3

59

Page 60: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Internal Index: Coherence

60

Internal criterion is usually based on coherence:

Compactness of the data in the clusters

high intra-cluster similarity

closeness of cluster elements.

Separability of distinct clusters

low inter-cluster similarity

Some internal indices: Davies-Bouldin (DB), Silhouette ,

DUNN, Bayesian information criterion (BIC), Calinski-

Harabasz (CH)

Page 61: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Internal Index: Stability

61

Evaluate cluster stability to minor perturbation of data.

For example, evaluate a clustering result by comparing it with

the obtained result after subsampling of data (e.g., subsampling

80% of data).

To find stability, we need a measure of similarity between

two k-clusterings.

It is based on comparing two k-clusterings

Similar to external indices that compare the clustering result with the

ground truth.

Page 62: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

External Index: Rand Index and Clustering

F-measure

Sec. 16.3

62

๐‘…๐ผ =๐‘‡๐‘ƒ+๐‘‡๐‘

๐‘‡๐‘ƒ+๐‘‡๐‘+๐น๐‘ƒ+๐น๐‘

๐‘ƒ =๐‘‡๐‘ƒ

๐‘‡๐‘ƒ+๐น๐‘ƒ,๐‘… =

๐‘‡๐‘ƒ

๐‘‡๐‘ƒ+๐น๐‘

๐น๐›ฝ =๐›ฝ2+1 ๐‘ƒ๐‘…

๐›ฝ2๐‘ƒ+๐‘…

F measure in addition supports differential weighting of P and R.

๐ฝ๐‘Ž๐‘๐‘๐‘Ž๐‘Ÿ๐‘‘ =๐‘‡๐‘ƒ

๐‘‡๐‘ƒ+๐น๐‘ƒ+๐น๐‘

Same Different

Same TP FN

Different FP TN

๐’žแˆ˜๐’ž๐‘‡๐‘ƒ: # pairs that cluster together in both ๐’ž and แˆ˜๐’ž

๐‘‡๐‘: # pairs that are in separate clusters in both ๐’ž and แˆ˜๐’ž๐น๐‘: # pairs that cluster together in ๐’ž but not in แˆ˜๐’ž๐น๐‘ƒ: # pairs that cluster together in แˆ˜๐’ž but not in ๐’ž

Page 63: Clusteringce.sharif.edu/courses/99-00/1/ce717-1/resources/root/...Clustering Applications 13 Information retrieval (search and browsing) Cluster text docs or images based on their

Major Dilemma [Jain, 2010]

What is a cluster?

What features should be used?

Should the data be normalized?

How do we define the pair-wise similarity?

Which clustering method should be used?

How many clusters are present in the data?

Does the data contain any outliers?

Does the data have any clustering tendency?

Are the discovered clusters and partition valid?

63