project 0th review
Post on 23-Jan-2015
3.966 Views
Preview:
DESCRIPTION
TRANSCRIPT
Data Mining / Clustering
A Combined Approach for Clustering based on the GSA-KM and GeneticAlgorithms
Divakar Raj.M (0901016)
Dilip.M (0901015)
Kishore Kumar.C (0901036)
IV CSE - A
Under the guidance of
Mr.P.Perumal
Associate Professor
Department of Computer Science and Engineering (UG)
1/33
Data Mining / Clustering
Introduction about Data Mining
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information or patterns from data in large databases
• Potential Applications– Market analysis and management– Risk analysis and management– Fraud detection and management– Text mining (news group, email, documents) and Web analysis– Intelligent query answering
2/33
Data Mining / Clustering 3/33
Data Mining: A KDD Process
– Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Mining / Clustering4/33
Architecture of a Typical Data Mining System
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Data Mining / Clustering 5/33
Data Mining Functionalities
• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions
• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”)
– contains(T, “computer”) contains(x, “software”)
Data Mining / Clustering 6/33
Data Mining Functionalities
• Classification and Prediction
– Finding models (functions) that describe and distinguish classes or concepts for future prediction
– E.g., classify countries based on climate, or classify cars based on gas mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values
• Cluster analysis– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Data Mining / Clustering 7/33
Data Mining Functionalities
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior
of the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
Data Mining / Clustering
Issues in Data mining
• Individual Privacy• Data Integrity• Relational Database Structure (vs) Multidimensional One• Issue of Cost• Mining methodology and user interaction issues• Performance issues• Issues relating to the diversity of database types
8/33
Data Mining / Clustering 9/33
Applications
• Database analysis and decision support
– Market analysis and management
• Target Marketing, Customer Relation Management, Market
Basket Analysis, Cross Selling, Market Segmentation
– Risk analysis and management
• Forecasting, Customer Retention, Improved Underwriting,
Quality Control, Competitive Analysis
Data Mining / Clustering
Applications
• Text mining (news group, email, documents) and Web analysis
• Intelligent query answering
• Sports
• Astronomy
• Internet Web Surf-Aid
10/33
Data Mining / Clustering
Clustering
• Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions
• Set of meaningful sub classes called clusters
11/33
Data Mining / Clustering
Cluster Analysis
• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters
• Cluster analysis– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms
12/33
Data Mining / Clustering 13/33
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.
• The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns
Data Mining / Clustering 14/33
Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and Usability
Data Mining / Clustering 15/33
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other
Data Mining / Clustering
Issues of Clustering
• Assessment of results
• Choice of appropriate number of clusters
• Data preparation
• Proximity measures
• Handling outliers
16/33
Data Mining / Clustering 17/33
General Applications of Clustering
• Pattern Recognition
• Image Processing
• Economic Science (especially market research)
• WWW– Document classification– Cluster Weblog data to discover groups of similar access patterns
Data Mining / Clustering 18/33
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• City-planning: Identifying groups of houses according to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Data Mining / Clustering
Literature Survey[1] An Architecture for Component-Based Design of Representative-
Based Clustering Algorithms
Boris Delibas, Milan Vuki, Milos Jovanovi, Kathrin Kirchner,
Johannes Ruhland, Milija Suknovic (2012)
[2] The Research of Imbalanced Data Set of Sample Sampling Method
Based on K-Means Cluster and Genetic Algorithm
Yang Yong, (2012)
[3] A Combined Approach for Clustering based on K-means and
Gravitational Search Algorithms
Abdolreza Hatamlou, Salwani Abdullah, Hossein Nezamabadi-
pour, (2012)
19/33
Data Mining / Clustering
An Architecture for Component-Based Design of Representative-Based Clustering Algorithms
• Based on reusable components
• Components derived from K-Means like algorithms and their extensions
• The new algorithm is built by exchanging components from the original algorithm and their improvements
• The Comparison & Evaluation are possible by using Representative Based Clustering Algorithm
20/33
Data Mining / Clustering
The Research of Imbalanced Data Set of Sample Sampling Method
Based on K-Means Cluster and Genetic Algorithm
• We use K-Means to cluster & In each cluster, we use GA to carry on the valid confirmation and to gain a new sample
• Enhances the classified performance of imbalanced datasets
• Generates unbalanced data set’s minority class
• Attention to Classification’s accuracy of Minority Classes
21/33
Data Mining / Clustering
A Combined Approach for Clustering based on K-means and
Gravitational Search Algorithms
• A hybrid data clustering algorithm based on GSA and k-means (GSA-KM) is presented
• It uses the advantages of both algorithms• Comparison of the performance of GSA-KM with other well-known
algorithms – K-means– Genetic Algorithm(GA)– Simulated Annealing(SA)– Ant Colony Optimization(ACO)– Honey Bee Mating Optimization(HBMO)– Particle Swarm Optimization(PSO)– Gravitational Search Algorithm(GSA)
• Comparison based on real and standard datasets from the UCI repository
22/33
Data Mining / Clustering
Existing System
23/33
K-Means
• One of the most efficient and famous clustering algorithms
• Starts with some random or heuristic-based centroids for the desired
clusters
• Assigns every data object to the closest centroid
• Iteratively refines the current centroids to reach the (near) optimal ones by
calculating the mean value of data objects within their respective clusters
• The algorithm will terminate when any one of the specified termination
criteria is met (i.e., a predetermined maximum number of iterations is
reached, a (near) optimal solution is found or the maximum search time is
reached)
Data Mining / Clustering
Existing System
24/33
Gravitational Search Algorithm
• Inspired by the physical phenomenon of Gravity• Based on the interaction of masses in the universe via Newtonian
gravity law• Attraction depends on the amount of masses and the distance
between them
• F = G (M1*M2) / R2
Data Mining / Clustering
Drawbacks of Existing System
K – Means
• Performance is highly dependent on the initial state of centroids
• May converge to the local optima rather than global optima
• The number of clusters is needed as input to the algorithm, i.e. the number of clusters is assumed known
25/33
Data Mining / Clustering
GSA-KM
• Built on three main steps
1. GSA-KM applies k-means algorithm on selected dataset and tries to produce near optimal centroids for desired clusters
2. The proposed approach will produce an initial population of solutions
3. Application of the GSA Algorithm
26/33
Data Mining / Clustering
Ways for production of an initial population
• One of the candidate solutions will be produced by the output of the k-means algorithm, which has been achieved in the previous step
• Three of them will be created based on the dataset itself and other solutions will be produced randomly
• GSA will be employed for determining an optimal solution for the clustering problem
GSA - KM
27/33
Data Mining / Clustering
Reasons for Efficiency
• Decreases the number of iterations and function evaluations to find a near global optimum compared to the original GSA alone
• With the advent of a good candidate solution in the initial population, GSA can search for near global optima in a promising search space and, therefore, find a high quality solution in comparison with the original GSA alone
28/33
Data Mining / Clustering
Proposed System
• Along with the given GSA-KM, we intend to implement Genetic Algorithm to further increase the efficiency and speed of the clustering
• The proposed system will have combined advantages and will be faster and efficient than the traditional clustering algorithms and also GSA-KM
29/33
Data Mining / Clustering
Implementation Details
• Programming language : C#• Database : MS- Access
• The given repository is clustered using K-Means and GSA, combinedly called GSA-KM and Genetic Algorithm is used to enhance the performance
• The performance is calculated and compared with other clustering algorithms
30/33
Data Mining / Clustering
References
[1] C.L. Blake, C.J. Merz
UCI repository of machine learning databases
http://www.ics.uci.edu/-learn/MLRepository.html
[2] S. Das, A. Abraham, A. Konar
Meta heuristic pattern clustering —an overview
Studies in Computational Intelligence (2009)
[3] L. Kaufman, P.J. Rousseeuw
Finding Groups in Data: An Introduction to Cluster Analysis
John Wiley & Sons, New York, (1990)
[4] M.B. Adil
Modified global-means algorithm for minimum sum-of- squares clustering problems
Pattern Recognition 41 (10) (2008)
[5] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi
GSA: a gravitational search algorithm
Information Sciences 179 (13) (2009)
31/33
Data Mining / Clustering
References[6] A. Likas, N. Vlassis, J.J. Verbeek
The global k -means clustering algorithm
Pattern Recognition 36 (2) (2003)
[7] M. Mahdavi
Novel meta-heuristic algorithms for clustering web documents
Applied Mathematics and Computation (2008)
[8] M. Moshtaghi
Clustering ellipses for anomaly detection
Pattern Recognition 44 (2008)
[9] B. Saglam, et al.,
A mixed-integer programming approach to the clustering problem with an application in customer segmentation
European Journal of Operational Research 173 (3) (2006)
[10] A.K. Jain
Data clustering: 50 years beyond K –means
Pattern Recognition Letters 31 (8) (2010)
32/33
Data Mining / Clustering
Thank You !!!
33/33
top related