foundations od data analysis i: clustering - vsb.czhomel.vsb.cz/~kud007/files/madi5.pdf · basics...
TRANSCRIPT
Sources
• Bramer, M. (2013). Principles of data mining. Springer. [221-
238]
• Zaki, M. J., Meira Jr, W. (2014). Data Mining and Analysis:
Fundamental Concepts and Algorithms. Cambridge
University Press. [335-377]
Basics of Clustering
• Clustering: It is concerned with grouping together objects
that are similar to each other and dissimilar to the objects
belonging to other clusters.
• Methods, for which the similarity between objects is based
on a measure of the distance between them.
– k-Means algorithm
– Agglomerative Hierarchical Clustering
Applications
• Countries whose economies are similar
• Companies that have similar financial performance
• Customers with similar buying behavior
• Patients with similar symptoms
• Documents with related content
Dimension
• If objects are described by the values of just two attributes,
we can represent them as points in a two-dimensional
space.
• In the case of three attributes the objects as being points in
a three-dimensional space and visualizing clusters is
straightforward.
• For larger dimensions it becomes impossible to visualize the
points, far less the clusters.
Centroid
• A measure commonly used when clustering is the Euclidean
distance.
• We will assume that all attribute values are continuous.
• The centroid of a cluster is the point for which each attribute
value is the average of the values of the corresponding
attribute for all the points in the cluster.
How to form clusters?
• There are many ways in which k clusters might potentially be formed.
• We can measure the quality of a set of clusters using the value of an
objective function.
• The objective function takes the sum of the squares of the distances of
each point from the centroid of the cluster to which it is assigned.
k-Means
• Stuart Lloyd in 1957, James MacQueen in 1967.
• Exclusive clustering algorithm; each object is assigned to
precisely one of a set of clusters.
• The algorithm starts by deciding how many clusters we
would like to form from our data (k).
• The value of k is generally a small integer, such as 2, 3, 4 or
5.
k-Means Algorithm
• k-means employs a greedy iterative approach to find a clustering that minimizes
the objective function.
• It can converge to a local optima instead of a globally optimal clustering.
• Because the method starts with a random guess for the initial centroids, k-
Means is typically run several times, and the run with the lowest SSE value is
chosen to report the final clustering.
Hierarchical Clustering
• Hierarchical Clustering seeks to build a hierarchy of
clusters.
– Agglomerative: This is a "bottom up" approach: each observation
starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
– Divisive: This is a "top down" approach: all observations start in one
cluster, and splits are performed recursively as one moves down the
hierarchy.
Distance Matrix
• It would be very inefficient to calculate the distance between each pair of
clusters for each pass through the algorithm.
• The usual approach is to generate and maintain a distance matrix giving
the distance between each pair of clusters.
• Note that the matrix is symmetric, so not all values have to be
calculated.
• The values on the diagonal from the top-left corner to the bottom-right
corner must always be zero..
• How to measure the distance of cluster from points (1D clusters) or
between clusters?
– Single-link clustering: The distance between two clusters is taken to be the
shortest distance from any member of one cluster to any member of the
other cluster.
– Complete-link clustering: The distance between two clusters is taken to be
the longest distance from any member of one cluster to any member of the
other cluster.
– Average-link clustering: The distance between two clusters is taken to be
the average distance from any member of one cluster to any member of the
other cluster.