a novel clustering algorithm based on weighted support and its application
Post on 03-Jan-2016
22 Views
Preview:
DESCRIPTION
TRANSCRIPT
A novel clustering algorithm based on weighted support and its application
Author : Xiang-Rong Yang Jun-Yi Shen
Qlang Liu Graduate : Chien-Ming Hsiao
Outline
Motivation Objective Introduction Description of some Terms Algorithm and Analysis Experimental results Conclusions Personal opinion
Motivation
Many efficient clustering algorithms have been proposed but most of these works focus on numerical data.
Objective
To present a novel and efficient algorithm WeiSC for clustering categorical data
Introduction
Clustering is an important KDD problem. Objective : to group data into sets
Intra-cluster similarity is maximized Inter-cluster similarity is minimized
Most of these works focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points.
Introduction
The basic idea of WeiSC It repeatedly read tuples from dataset one by one When the first tuple arrives, it forms a cluster alone The consequent tuples are either put into existing cluster or rejecte
d by all existing clusters to form a new cluser by given similarity function defined between tuple and cluser.
Only makes one scan over the dataset
Description of some Terms
m1
im21
DD domains with attributes lcategorica
ofset a is A where tuples,ofset a be A ,,A ,A DLet
eevery tupl of ID unique ofset thebe TIDLet
i
i
A tid, valas drepresente is
tupleingcorrespond of A attributefor value theTID, each tidFor
Description of some Terms
DEFINITION 1
DEFINITION 2
DEFINITION 3
TID ofsubset is TID} tid| {tid Cluster
C tid A tid,val CVAL : as defined is C repect towith
Aon valuesattribute ofset theC,cluster aGiven
ii
i
SUM_CONTACONTAWEI
is A attribute of weight the,ACONTASUM_CONT
,A of valueattributedistinct ofcount thei.e. ,DACONTLet
ii
imi
iii
Description of some Terms
DEFINITION 4
DEFINITION 5
iiiii
iii
atid.A tidAWEIa wei_sp: as definded is A repect to
with Cin a ofsupport weighted the,D alet C,cluster aGiven
C tidatid.Av a wei_sp,aCont ,aVS where
mi1VS CID,Summary : as defined is Cfor summary theC,cluster a Give
iiiiii
i
Algorithm and Analysis
Overview Initially, the first tuple in the database is read and a cluster is con
structed. Then the consequent tuples are read iteratively.
The similarity between the new tuple and each existed clusters is computed according to
The similarity must be above the threshold, denoted as σ When computing the similarity, we use the clusters’ summary instea
d of the clusters themselves, since the information needed contained in clusters’ summary
Ccluster in tuplesofcount theis where, _
1 , 1 CC
aspweitidCsim
m
ii
Computational complexities
The time and space complexities of the WeiSC algorithm depend on
The size of dataset (|D|) The number of attributes (m) The number of the clusters (p) , f (σ) The size of each cluster, g (σ)
Time complexity O(|D| * m * f (σ)) Space complexity O(|D| + m * f (σ) * g (σ))
Experimental results
The experimental results on the performance of WeiSC
Compare the clustering result with ROCK’s on the same data set
Quality of clustering results with real-life datasets
Mushroom dataset (real-life) get from the UCI machine learning Corresponding to 23 species of gilled mushrooms
Each species is identified as definitely edible, definitely poisonous
Has 21 attributes with 8124 tuples The number of edible is 4208 The number of poisonous is 3916
The effect of σ
The parameter of σ Is the only parameter needed in WeiSC algorithm Effects the results of clustering and the speed of algorit
hm
Can use the percentage of misclassified tuples as measure of the effect Since the “edible” or “poisonous” has been labeled in e
ach tuple
Conclusions
The WeiSC algorithm is robust and efficient From inference and experimental Read dataset only once
Used in IDS Is speedy and deserves good efficiency
Personal Opinion
We can compare WeiSC algorithm with our algorithm.
top related