foundations od data analysis i: clustering - vsb.czhomel.vsb.cz/~kud007/files/madi5.pdf · basics...

DATA ANALYSIS I

Foundations od Data Analysis I: Clustering

Sources

• Bramer, M. (2013). Principles of data mining. Springer. [221-

238]

• Zaki, M. J., Meira Jr, W. (2014). Data Mining and Analysis:

Fundamental Concepts and Algorithms. Cambridge

University Press. [335-377]

Basics of Clustering

• Clustering: It is concerned with grouping together objects

that are similar to each other and dissimilar to the objects

belonging to other clusters.

• Methods, for which the similarity between objects is based

on a measure of the distance between them.

– k-Means algorithm

– Agglomerative Hierarchical Clustering

Applications

• Countries whose economies are similar

• Companies that have similar financial performance

• Customers with similar buying behavior

• Patients with similar symptoms

• Documents with related content

Dimension

• If objects are described by the values of just two attributes,

we can represent them as points in a two-dimensional

space.

• In the case of three attributes the objects as being points in

a three-dimensional space and visualizing clusters is

straightforward.

• For larger dimensions it becomes impossible to visualize the

points, far less the clusters.

2D Example

• How many clusters?

Centroid

• A measure commonly used when clustering is the Euclidean

distance.

• We will assume that all attribute values are continuous.

• The centroid of a cluster is the point for which each attribute

value is the average of the values of the corresponding

attribute for all the points in the cluster.

Centroid Example

How to form clusters?

• There are many ways in which k clusters might potentially be formed.

• We can measure the quality of a set of clusters using the value of an

objective function.

• The objective function takes the sum of the squares of the distances of

each point from the centroid of the cluster to which it is assigned.

Clusters

k-Means

• Stuart Lloyd in 1957, James MacQueen in 1967.

• Exclusive clustering algorithm; each object is assigned to

precisely one of a set of clusters.

• The algorithm starts by deciding how many clusters we

would like to form from our data (k).

• The value of k is generally a small integer, such as 2, 3, 4 or

5.

k-Means Algorithm

• k-means employs a greedy iterative approach to find a clustering that minimizes

the objective function.

• It can converge to a local optima instead of a globally optimal clustering.

• Because the method starts with a random guess for the initial centroids, k-

Means is typically run several times, and the run with the lowest SSE value is

chosen to report the final clustering.

1D Data

2D Data

Hierarchical Clustering

• Hierarchical Clustering seeks to build a hierarchy of

clusters.

– Agglomerative: This is a "bottom up" approach: each observation

starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy.

– Divisive: This is a "top down" approach: all observations start in one

cluster, and splits are performed recursively as one moves down the

hierarchy.

Agglomerative Hierarchical Clustering

• Algorithm:

Example

How to…

Clusters After Two Steps

Dendrogram

Distance Matrix

• It would be very inefficient to calculate the distance between each pair of

clusters for each pass through the algorithm.

• The usual approach is to generate and maintain a distance matrix giving

the distance between each pair of clusters.

• Note that the matrix is symmetric, so not all values have to be

calculated.

• The values on the diagonal from the top-left corner to the bottom-right

corner must always be zero..

Example

How to…

• How to measure the distance of cluster from points (1D clusters) or

between clusters?

– Single-link clustering: The distance between two clusters is taken to be the

shortest distance from any member of one cluster to any member of the

other cluster.

– Complete-link clustering: The distance between two clusters is taken to be

the longest distance from any member of one cluster to any member of the

other cluster.

– Average-link clustering: The distance between two clusters is taken to be

the average distance from any member of one cluster to any member of the

other cluster.

Distance Matrix After Mergers

Dendrogram

When to Stop…

• We can merge clusters until only some pre-defined number

remain.

• We can stop merging when a newly created cluster fails to

meet some criterion for its compactness.

– The average distance between the objects in the cluster is too high.