K-Means AlgorithmEach cluster is represented by the mean value of
the objects in the clusterInput : set of objects (n), no of clusters (k)Output : set of k clustersAlgo
Randomly select k samples & mark them a initial cluster
Repeat Assign/ reassign in sample to any given cluster to which
it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.
K-Means (graph)Step1: Form k centroids, randomlyStep2: Calculate distance between centroids
and each objectUse Euclidean’s law do determine min distance:
d(A,B) = (x2-x1)2 + (y2-y1)2
Step3: Assign objects based on min distance to k clusters
Step4: Calculate centroid of each cluster using
C = (x1+x2+…xn , y1+y2+…yn)
n n
Go to step 2.Repeat until no change in centroids.
K-Mediod (PAM)Also called Partitioning Around Mediods.Step1: choose k mediodsStep2: assign all points to closest mediodStep3: form distance matrix for each
cluster and choose the next best mediod. i.e., the point closest to all other points in clustergo to step2.Repeat until no change in any mediods
What are Hierarchical Methods?Groups data objects into a tree of clustersClassified as
Agglomerative (Bottom-up)Divisive (Top-Bottom)
Once a merge or split decision is made it cannot be backtracked
Types of hierarchical clusteringAgglomerative (Bottom-up) AGNES
Places each object into a cluster and merges atomic clusters into larger clusters
They differ in the definition of intercluster similarityDivisive: (Top-Bottom) DIANA
All objects are initially in one clusterSubdivides the cluster into smaller and smaller
pieces, until each object forms a cluster of its own or satisfies some termination condition
In both of the above methods the termination condition is the number of clusters
Dendogram
Level 0
Level 1
Level 2
Level 3
Level 4
Measures of DistanceMinimum distance – Nearest Neighbor-
single linkage –minimum spanning tree Maximum distance – Farthest neighbor
clustering algorithm – complete linkageMean distance - avoids outlier sensitivity
problemAverage distance : can handle categorical as
well as numeric data
Euclidean Distance
Agglomerative AlgorithmStep1: Make each object as a clusterStep2: Calculate the Euclidean distance
from every point to every other point. i.e., construct a Distance Matrix
Step3: Identify two clusters with shortest distance.Merge themGo to Step 2Repeat until all objects are in one cluster
Agglomerative Algorithm ApproachesSingle Link:
Quite simpleNot very efficientSuffers from chain effect
Complete LinkMore compact than those found using the
single link techniqueAverage Link
Simple Example
Item E A C B D
E 0 1 2 2 3
A 1 0 2 5 3
C 2 2 0 1 6
B 2 5 1 0 3
D 3 3 6 3 0
Another ExampleFind single link technique to find clusters in
the given database.X Y
10.4 0.53
20.22 0.38
30.35 0.32
40.26 0.19
50.08 0.41
60.45 0.3
Plot given data
Identify two nearest clusters
Repeat process until all objects in same cluster
Average linkAverage distance matrix
Construct a distance matrix
1 2 3 4 5 6
1 0
2 0.24 0
3 0.22 0.15 0
4 0.37 0.2 0.15 0
5 0.34 0.14 0.28 0.29 0
6 0.23 0.25 0.11 0.22 0.39 0
Divisive ClusteringAll items are initially placed in one cluster The clusters are repeatedly split in two until
all items are in their own cluster
A B
C
D
E
1
2
13
Difficulties in Hierarchical ClusteringDifficulties regarding the selection of merge
or split pointsThis decision is critical because the further
merge or split decisions are based on the newly formed clusters
Method does not scale wellSo hierarchical methods are integrated with
other clustering techniques to form multiple-phase clustering
Types of hierarchical clustering techniquesBIRCH-Balanced Iterative Reducing and
Clustering using hierarchiesROCK: Robust clustering with links, explores
the concept of linksCHAMELEON: hierarchical clustering
algorithm using dynamic modeling
Outlier AnalysisOutliers are data objects, which are different
from or inconsistent with the remaining set of data
Outliers can be caused because ofMeasurement or execution errorResult of inherent data variability
Can be used in fraud detectionOutlier detection and analysis is referred to
as outlier mining.
Applications of outlier miningFraud detectionCustomized marketing for identifying the
spending behavior of customers with extremely low or high incomes.
Medical analysis for finding unusual responses to various medical treatments.
What is outlier mining?
Given a set of n data points or objects and k, the expected number of outliers find the top k objects that are dissimilar, exceptional or inconsistent with respect to remaining data
There are two subproblemsDefine what data can be considered as
inconsistent in a given data setMethod to mine the outliers
Methods of outlier detectionStatistical approachdistance-based approachDensity-based local outlier approachDeviation-based approach
Statistical DistributionIdentifies outliers with respect to a discordancy
testDiscordancy test examines a working hypothesis
and an alternative hypothesisIt verifies whether an object oi, is significantly
large in relation to the distribution F.This helps in accepting the working hypothesis
or rejecting it (alternative distribution)Inherent alternative distributionMixture alternative distributionSlippage alternative distribution
Procedures for detecting outliersBlock procedures: All suspect objects are
treated as outliers or all of then are accepted as consistent
Consecutive procedures: object that is least likely to be an outlier is tested first. If it is found to be an outlier then all of the more extreme values are also considered as outliers. Else the next most extreme object is tested and so on
Questions in Clustering