an improved k-medoid clustering algo
DESCRIPTION
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm.TRANSCRIPT
![Page 1: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/1.jpg)
An Improved K-medoid Clustering Algorithm
By
Mohammad Imran Kabir:090911543
Birendra Singh Airy:100911335
Under the guidance of
Ms Aparna Nayak Assistant Professor Dept of ICT MIT, Manipal
![Page 2: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/2.jpg)
CONTENTS
• Introduction• Literature Survey• Design• Implementation• Results• Conclusion• Reference
![Page 3: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/3.jpg)
INTRODUCTION
![Page 4: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/4.jpg)
What exactly is K-medoid Clustering?
• K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.
• A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
• Actual objects are chosen to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar.
![Page 5: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/5.jpg)
LITERATURE SURVEY
![Page 6: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/6.jpg)
• Hybrid algorithm for Kmedoid clustering of large datasets published by Weiguo Sheng,Dept. of Inf. Syst. Comput, Brunel Univ, London, UK on June 2004 where the local search heuristic selects k-medoids from the data set and tries to effeciently minimize the total dissimilarity within each cluster.
• Parallelization of K-medoid clustering algorithm implemented by Aljoby, W. from Queen Arwa University in Yemen on March 2013 where the K-medoid algorithm will be divided into tasks, which will be mapped into multiprocessor system.
![Page 7: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/7.jpg)
DESIGN
![Page 8: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/8.jpg)
Class Diagram
![Page 9: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/9.jpg)
Data Flow Diagram
![Page 10: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/10.jpg)
Activity Diagram
![Page 11: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/11.jpg)
IMPLEMENTATION
![Page 12: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/12.jpg)
BASIC PAM
The most common realization of k-medoids clustering is the Partitioning Around Medoids (PAM) algorithmInput: k: the number of clusters; Dataset: a data setcontaining n objects.Output: A set of k clusters.1. arbitrarily choose k objects in Dataset as the initial
representative objects or seeds2. assign each remaining object to the cluster with
the nearest representative object;
![Page 13: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/13.jpg)
3. Randomly select a non-representative object, Orandom
4. Compute the total cost, S, of swapping representative object, Oj, with Orandom
5. If S<0 then swap Oj with Orandom to form the new set
of k representative objects.7. Until no change
![Page 14: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/14.jpg)
Distribution of Data
![Page 15: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/15.jpg)
Cluster formation after initial medoid assumption
![Page 16: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/16.jpg)
Cluster formation after swapping
![Page 17: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/17.jpg)
IMPROVEMENT OF THE ALGORITHM
• To resolve the drawbacks of the traditional PAM algorithim by introducing a new improved K-medoid clustering algorithim based on CF-tree.
• This CF tree algorithm operates on the clustering features of the BIRCH(Balanced Iterative Reducing And Clustering using Hierarchies)algorithm.
![Page 18: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/18.jpg)
Methodology
• We preserve all the training sample data in an CF-Tree(Clustering Feature Tree), then use k-medoids method to cluster the CF in leaf nodes of CF-Tree.
• Eventually, we can get k clusters from the root of the CF-Tree.
![Page 19: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/19.jpg)
• Input: k: the number of clusters; Dataset: a data set containing n objects; B: maximum children for nonleaf nodes in CF-Tree set B=k; L: maximum entries for leaf nodes in CF-Tree; T: the threshold for the maximum diameter of an entry.
• Output: A set of k clusters.1. use data point in Dataset to create a CF-Tree, tree*;2. arbitrarily choose k leaf nodes in tree* as the initial
representative objects or seeds3. Repeat4. assign each remaining leaf node to the cluster,{OJ} (j =
1,2,....,k), with the nearest representative object based on formula (4)
ALGORITHM
![Page 20: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/20.jpg)
5. assignment result in updated CF values which have to be propagated back to the root of the CF-tree
6. recompute the radius of all nodes based on formula (3), if the radius of any node is more than the threshold value T, one or several splits of the node can happen.
7. randomly select a non-representative object in leaf nodes ,Orandom, Orandom ≠ Oj
8. compute the total cost, S, of swapping representative object, Oj , with Orandom;
9. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
10.until no change; In this way we can get k clusters from the root of the tree*
because of B=k.
![Page 21: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/21.jpg)
RESULTS
![Page 22: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/22.jpg)
Operation On A Small Dataset• When we tested the basic PAM algorithm on a
small dataset of about 100 entries the total time to cluster the points came around 2 milliseconds and the after swap cost came around 585 which indicates the most optimum cost between the data points and the medoids.
![Page 23: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/23.jpg)
• When we tested the improved Cf tree algorithm on a small dataset we got the following result
• Around the same time i.e, about 2 milliseconds.As you can see there's no difference in the total computation time of cluster formation even after using the enhanced cftree algorithm.
![Page 24: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/24.jpg)
Operation On A Large Dataset
• As we increase the size of the dataset to about 3000 entries ,time taken for the basic PAM to work is about 49ms
![Page 25: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/25.jpg)
• And time taken for the improved cftree algorithm to work is about 42ms
• which is significantly less than that of the basic PAM algorithm
![Page 26: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/26.jpg)
• By plotting a graph of the total no of data points against the total time taken by the two algorithm we get the following curve
![Page 27: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/27.jpg)
Conclusion
• The experimental results show that CFk-medoids algorithm reduces the usage of time and increases the result accuracy.
• CFk-medoids would greatly benefit from using a larger dataset.
![Page 28: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/28.jpg)
Refrences• J. Han, M. Kamber. Data Mining: concepts and techniques,
Beijing: China Machine Press, 2006.• Rui Xu, D. Wunsch. “Survey of clustering algorithms”, IEEE
Transactions on Neural Networks, 2005, Vol. 16, No. 3, pp. 645-678.
• Ordonez C. Clustering binary data streams with K-means, //Proceedings of DMKD’03, June 2003, Vol. 13, pp. 12-19.
• A. K. Jain, M. N. Murty, P. J. Flynn. “Data clustering: a review”,ACM Computing Surveys, 1999, Vol. 31, No. 3, pp. 264-323.
![Page 29: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/29.jpg)
• Stephen Johnson. “Hierarchical clustering schemes”, Psychometrika,1967, Vol. 32, No. 3, pp. 241-254.
• Barbara D. “Requirements for clustering data streams”, ACM SIGKDD Explorations Newsletter, 2003, Vol. 3, No. 2, pp. 23-27.
• Karypis G, Han E H, Kumar V. “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer,1999, Vol. 32, No. 8, pp. 68-75.
![Page 30: An Improved K-medoid Clustering Algo](https://reader033.vdocument.in/reader033/viewer/2022042513/5695cecd1a28ab9b028b4663/html5/thumbnails/30.jpg)
THANK YOU