unsupervised data pruning for clustering of noisy data

Knowledge-Based Systems 21 (2008) 612–616

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/locate /knosys

Unsupervised data pruning for clustering of noisy data

Yi Hong a, Sam Kwong a,*, Yuchou Chang b, Qingsheng Ren c

a Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kongb Department of Electrical and Computer Engineering, Brigham Young University, USAc Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, PR China

a r t i c l e i n f o a b s t r a c t

Article history:Received 10 April 2007Accepted 21 March 2008Available online 8 April 2008

Keywords:Clustering analysisClustering ensemblesData pruning

0950-7051/$ - see front matter � 2008 Elsevier B.V. Adoi:10.1016/j.knosys.2008.03.052

* Corresponding author.E-mail addresses: [email protected] (Y. H

(S. Kwong), [email protected] (Y. Chang), re

Data pruning works with identifying noisy instances of a data set and removing them from the data set inorder to improve the generalization of a learning algorithm. It has been well studied in supervised clas-sification where the identification and removal of noisy instances are guided by available labels ofinstances. However, to the best knowledge of the authors’, very few work has been done on data pruningfor unsupervised clustering. This paper deals with the problem of data pruning for unsupervised cluster-ing under the condition that labels of instances are unknown beforehand. We claim that unsuperviseddata pruning can benefit for the clustering of the data with noise. We propose a feasible approach, termedas unsupervised Data Pruning using Ensembles of multiple Clusterers (DPEC), to identify noisy instancesof a data set. DPEC checks all instances of a data set and identifies noisy instances by using ensembles ofmultiple clustering results provided by different clusterers on the same data set. We test the performanceof DPEC on several real data sets with artificial noise. Experimental results demonstrate that DPEC is oftenable to improve the accuracy and robustness of the clustering algorithm.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Could a training instance be detrimental to a learning algo-rithm? Contrary to the common belief that more training instanceslead to a better generalization of the learning algorithm, several re-cent studies have shown that the learning algorithm might be bet-ter off, if part of training instances are discarded from the trainingset [1–4]. The above process that prunes the data set for bettertraining is known as data pruning, sometimes as data cleaning.Data pruning works with identifying noisy instances of a data setautomatically and then extracting them from the data set in orderto improve the generalization of the learning algorithm. It shouldbe noted that the concept of noisy instances in this paper is definedas instances that might degrade the performance of the learningalgorithm. Therefore, the classifier trained on the pruned data setcan be assumed to achieve a higher accuracy and better generaliza-tion than the one trained on the whole data set.

Data pruning has been well studied in supervised classificationarea under the condition that labels of instances are known before-hand [1,2]. If labels of instances are available, the identification ofnoisy instances can be viewed as a process to check whether in-stances are labeled correctly [5,6]. Commonly, each instance isassociated with a deviation from the distribution of instances with

ll rights reserved.

ong), [email protected]@cs.sjtu.edu.cn (Q. Ren).

the same label and instances with high deviations are consideredas noisy instances. However, in many situations such as data clus-tering, we do not have any label information about instances. Thus,most of the commonly used data pruning techniques are notusable for unsupervised clustering due to the absence of accuratelabels of instances. Whereas performances of many unsupervisedclustering algorithms such as K-means clustering algorithm areheavily relied on the quality of training data. If there is much noisein the training data set, accuracies of clustering algorithms mightsignificantly degrade. It is easy to image that like supervised clas-sification, data pruning may also benefit for unsupervised cluster-ing. In addition, unsupervised data pruning techniques can helpthe selection of a limited number of instances to be labeled fortraining classifiers. Therefore, much time can be saved for labelinginstances that will be used for training the classifiers. The relatedwork for this point in supervised classification area is commonlyknown as Query by Committee [7,8].

However, data pruning for unsupervised clustering is a chal-lenge task due to the absence of labels of instances for identifyingnoisy instances of the data set. We call data pruning techniquesunder the condition that labels of instances are unknown before-hand as unsupervised data pruning. The key problem associatedwith unsupervised data pruning is how to identify noisy instancesfrom the data set if no prior knowledge about labels of instances isavailable. This paper deals with data pruning within the unsuper-vised clustering area. We claim that unsupervised data pruningtechniques can benefit for the clustering of the data with noise.

mailto:[email protected]




http://www.sciencedirect.com/science/journal/09507051

http://www.elsevier.com/locate/knosys

Y. Hong et al. / Knowledge-Based Systems 21 (2008) 612–616 613

Additionally, we propose a feasible approach, termed as unsuper-vised Data Pruning using Ensembles of multiple Clusterers (DPEC),to identify noisy instances of a data set.

Contributions of this work include two facets: first, we extenddata pruning techniques into unsupervised clustering area andclaim that like supervised classification, data pruning techniquesare also usable for unsupervised clustering. To the best knowledgeof the authors’, very few work has been done on data pruning forunsupervised clustering. Second, we propose an unsupervised datapruning method based on the ensembles of multiple clusterers.The effectiveness of the proposed unsupervised data pruningmethod is confirmed by experimental results on several realdata sets. As far as we know, this is the first time to apply theensembles of multiple clusterers to identify noisy instances froma data set.

The remainder of this paper is arranged as follows. Section 2gives the related work for this paper. Section 3 goes into detailsof describing unsupervised Data Pruning using Ensembles of multi-ple Clusterers (DPEC). Experimental results on several real datasets for testing the performance of DPEC are given in section 4. Sec-tion 5 concludes this paper.

Ið1Þ ¼ ð1;1;2;2;2;3;3Þ Ið2Þ ¼ ð2;2;1;1;1;3;3Þ

2. Related work

In this section, we briefly introduce the literature about noisyinstances identification and ensembles of multiple clusterers.

2.1. Noisy instances identification

Real world data sets may contain much noise. These noise willmislead the learning algorithm and significantly degrade the accu-racy and generalization of the learning algorithm. Therefore, acommon step usually related with the classification of noisy datais data pruning [9]. Data pruning has been well studied in thesupervised classification area [1–4]. The key step of data pruningis how to distinguish noisy instances from noiseless ones. Variousalgorithms have been proposed and most of them are executed un-der the condition that labels of instances have known beforehand.For example, John proposed a robust C4.5 algorithm to fully re-move the effect of noisy instances through iteratively pruningthe data and training the decision tree on the reduced data [10].Guyon et al. provided an online data pruning approach that usesan information criterion to identify noisy instances and these noisyinstances are returned to the user to determine whether theyshould be removed from the data set or not [11]. Brodley and Friedlproposed a filter model for noisy instances elimination where clas-sifiers trained on noisy data are used to filter noisy instances andthe classifiers trained from pruned data sets are used for classifica-tion [4]. Zhu et al. presented a feasible approach to handle noisy in-stances for large distributed data sets [12]. Angelova explored ageneric approach for identifying noisy instances from a data setthrough combining opinions from multiple classifiers. He also suc-cessfully applied the proposed algorithm into classifying contami-nated image data. Very good results were reported in [1].

2.2. Ensembles of multiple clusterers

Clustering analysis was known as an ill-posed combinatoryoptimization problem. It works to classify a set of unlabeled in-stances into groups under the guidance of the predefined cluster-ing criterion. A large number of clustering algorithms existed sofar and have different clustering criterion. However, all of themare only valid for some data sets and invalid for other data sets.For example, K-means clustering algorithm can achieve high clus-tering accuracies for data sets with coherent distributions, but low

clustering accuracies for data sets with elongated distributions.Many feasible approaches have been proposed to improve therobustness and stability of clustering algorithm. Among them isthe ensembles of multiple clusterers, usually known as clusteringensembles. Clustering ensembles works with combing multipleclustering results of a data set without accessing its originalfeatures. Through leveraging the consensus across multiple clus-tering results, clustering ensembles provides us a generic knowl-edge reuse framework for improving the accuracy of clusteringalgorithms [13]. Several recent studies have been concentratedon the literature of clustering ensembles [13–15]. For example,Fred and Jain achieved a number of partitions of a data set throughexecuting K-means clustering algorithm with random initializa-tions and random numbers of clusters. They obtained the final con-sensus partition through combining all obtained clustering resultsby using a voting mechanism [14]. Fern and Brodley employed therandom projection method to build the clustering ensembles forhigh dimensional data sets. The final clustering result is obtainedby executing a graph partition algorithm [15]. It should be notedthat the ensembles of multiple clusterers not only can improvethe robustness and stability of clustering analysis, but also canbe employed for unsupervised feature selection [18] and unsuper-vised feature ranking [19]. In this paper, the ensembles of multipleclusterers is used for discovering noisy instances from real datasets.

3. Unsupervised data pruning using ensembles of multipleclusterers

In this section, we describe unsupervised Data Pruning usingEnsembles of multiple Clusterers (DPEC) into details. LetD ¼ fd1; d2; . . . :; dng denote a data set that contains n instanceswithout labels. A clustering solution of the data set D is repre-sented as a label vector I 2 Nn, where Ii is the label of the instancedi. Let C ¼ fIð1Þ; Ið2Þ; . . . :; IðMÞg be a set of M clustering results of thesame data set D and each IðiÞ is a label vector of fIðiÞ1 ; I

ðiÞ2 ; . . . :; IðiÞn g.

DPEC works with identifying noisy instances from the data set Dand removing them from the data set before training clusteringalgorithms on the pruned data set.

The inspiration of DPEC comes from the supervised classifica-tion area, where the ensembles of multiple classifiers was usedfor detecting noisy instances under the condition that labels of in-stances are available beforehand [1]. In this paper, DPEC employsthe ensembles of multiple clusterers to determine whether anunlabeled instance is noise or not. The ensembles of multiple clu-sterers, commonly known as clustering ensembles, obtains multi-ple clustering results of the same data set through executingdifferent clusterers on the same data set. There were many feasibleapproaches to generalize diverse clustering results of the samedata set. In this paper, we use the random subspaces method, be-cause the random subspaces method is very easy to be imple-mented [20]. The random subspaces method obtains a populationof clustering results of the same data set with the following stepsemployed: half of all features are randomly selected and a cluster-ing solution is obtained by executing the clustering algorithm onselected features. The above steps iterate for M rounds and M clus-tering results can be achieved.

Provided that fIð1Þ; Ið2Þ; . . . :; IðMÞg are M clustering results of thedata set D obtained by the random subspaces method, it shouldbe noted that these M clustering results have no explicit correspon-dences about which instances should be classified into the samegroup through comparing their labels. For example, the followingtwo clustering results:

614 Y. Hong et al. / Knowledge-Based Systems 21 (2008) 612–616

are significantly different if only labels of instances are considered.However, they are in fact logically identical. To make the represen-tation of all these M clustering results consistent, the label vector IðiÞ

is firstly transformed into a similarity matrix SðiÞn�n as follows:

SðiÞðj; kÞ ¼ 1 if IðiÞj ¼ IðiÞk ;

0 if otherwise;

(ð1Þ

where n is the number of data instances. Likewise, M similarity ma-trixes can be obtained from these M clustering resultsfIð1Þ; . . . :; IðMÞg. After all clustering results have been transformedinto similarity matrixes, DPEC calculates their average similaritymatrix SðHÞ as follows:

SðHÞ ¼PM

i¼1SðiÞ

Mð2Þ

The value of SðHÞij reflects the probability that the ith instance and thejth instance are classified into the same group. If the value SðHÞij islarge, the ith instance and the jth instance have a high probabilityof being classified into the same group. Otherwise, they have a highprobability of being classified into different groups. One special caseis that the value of SðHÞij equals 1.0 or 0.0. If SðHÞij equals 1:0, the ith in-stance and the jth instance were classified into the same group by allM clustering results. If SðHÞij equals 0.0, the ith instance and the jth in-stance were classified into different groups by all M clustering re-sults. It is worthwhile mentioning that several existing clusteringensembles methods also work to obtain the final clustering resultby executing a graph partitioning algorithm on the average similar-

0 5 10 15 20 25 30 35 40 45 5050

52

54

56

58

60

62

64

Pruning rate (%)

Acc

urac

y (%

)

DPEC on remained dataDPEC on whole dataK-means on remained dataK-means on whole data

0 5 10 15 20 25 30 35 40 45 5061

62

63

64

65

66

67

68

69

Pruning rate (%)

Acc

urac

y (%

)


Fig. 1. Clustering accuracies under different pruning rates. (1) Heart data se

ity matrix SðHÞ [13–15]. In addition, the average similarity matrixSðHÞ was used as a criterion for guiding the search of a good featuresubset for unsupervised clustering in [18].

In this paper, the average similarity matrix is employed fordetecting noisy instances of a data set. The main idea of the pro-posed noisy instances detecting algorithm is inspired from thephenomenon that different clusterers may provide different clus-tering results for noisy instances, while for noiseless instancestheir clustering results are much more consistent. Therefore, in thispaper we use the concept of clustering uncertainty that is firstlyproposed in [21] to judge whether an instance is noise or not.The clustering uncertainty UðdiÞ of the instance di can be calculatedfrom the average similarity matrix SðHÞ as follows:

UðdiÞ ¼PM

j¼1;j6¼i½Sij � logðSijÞ þ ð1� SijÞ � logð1� SijÞ�M � 1

ð3Þ

It should be noted that noisy instances usually have high values of theclustering uncertainty and more description about the clusteringuncertainty can be found in [21]. After uncertainties of instances havebeen calculated, all instances are ranked according to their uncertain-ties and a certain number of instances with the largest values of theclustering uncertainty are removed from the data set before cluster-ing of the data. Another thing that is worthwhile mentioning is thenumber of instances of being pruned is closely related with the origi-nal data set and different data sets may have different appropriatepruning rates. In this paper, we assume that the number of instancesof being pruned is directly specified by the user.

0 5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Pruning rate (%)

Acc

urac

y (%

)


0 5 10 15 20 25 30 35 40 45 5082

84

86

88

90

92

94

Pruning rate (%)

Acc

urac

y (%

)


t, (2) Thyroid data set, (3) Vehicle data set, (4) Segmentation data set.

0 5 10 15 20 25 30 35 40 45 5049.5

50

50.5

51

51.5

52

52.5

53

53.5

Noise level (%)

Acc

urac

y (%

)DPEC on remained dataDPEC on whole dataK-means on remained dataK-means on whole data

0 5 10 15 20 25 30 35 40 45 5045

50

55

60

65

70

75

80

85

90

Noise level (%)

Acc

urac

y (%

)


0 5 10 15 20 25 30 35 40 45 5061

62

63

64

65

66

67

68

69

Noise level (%)

Acc

urac

y (%

)


0 5 10 15 20 25 30 35 40 45 5076

78

80

82

84

86

88

90

Noise level (%)

Acc

urac

y (%

)


Fig. 2. Clustering accuracies under different noise levels. (1) Heart data set, (2) Thyroid data set, (3) Vehicle data set, (4) Segmentation data set.

Y. Hong et al. / Knowledge-Based Systems 21 (2008) 612–616 615

4. Experimental results and analysis

Four real data sets are selected from UCI machine learningrepository for testing the performance of DPEC [16]. These datasets are Heart data set (270 instances, 13 features, 2 groups), Thy-roid data set (214 instances, 5 features, 3 groups), Vehicle data set(846 instances, 18 features, 4 groups) and Segmentation data set(2310 instances, 19 features, 7 groups). In our experiments, weuse the Rand Index method [17] to calculate the accuracy of theclustering solution I as follows:

kI; IðaccurateÞk ¼ 2 � ðn00 þ n11Þn � ðn� 1Þ

where n11 is the number of pairs of instances which are both in thesame group in I and also both in the same group in IðaccurateÞ and n00

denotes the number of pairs of instances which are in differentgroups in I and also in different groups in IðaccurateÞ and IðaccurateÞ isthe accurate clustering solution which has been known for UCI datasets. K-means clustering algorithm is adopted as the base clusteringalgorithm. A population of clustering results of the same data setare obtained by using the Random Subspaces approach [20]. Thesize of the ensemble committee is fixed to 100 in our experiments.To analyze the usefulness of DPEC for unsupervised clustering ofnoisy data and noiseless data, we classify all instances in a dataset into two groups: noisy instances and noiseless instances. The in-stances pruned by DPEC are considered as noisy instances, while re-mained ones are considered as noiseless instances. The followingfour algorithms are compared:

(1) DPEC on remained data: the whole data set is firstly prunedby DPEC. Then K-means clustering algorithm is adopted toclassify the remained data into a certain number of groupsand its clustering accuracy is reported.

(2) DPEC on whole data: the data set is firstly pruned by DPEC.Then K-means clustering algorithm is adopted to classify thewhole data into a certain number of groups and its cluster-ing accuracy on the whole data is reported.

(3) K-means on remained data: the whole data set is classifiedby using K-means clustering algorithm and the clusteringaccuracy only on noiseless instances is reported.

(4) K-means on whole data: the whole data set is classified byusing K-means clustering algorithm and its clustering accu-racy on the whole data set is reported.

We test clustering accuracies of the compared four algorithmson all data sets under different pruning rates. Fig. 1 shows the re-sults. Three interesting phenomenon can be observed from Fig. 1.First, clustering accuracies on remained data are much higher thanclustering accuracies on whole data for all compared algorithms.This phenomenon lets us know that noise can significantly degradethe performance of the clustering algorithm and pruning thesenoisy instances before clustering can improve the accuracy of theclustering algorithm. Second, DPEC on remained instances is ableto achieve better solutions than the ones obtained by K-meanson remained instances. This phenomenon tells us that noisy in-stances mislead the learning algorithm and can degrade the perfor-mance of the learning algorithm on noiseless instances. Therefore,

616 Y. Hong et al. / Knowledge-Based Systems 21 (2008) 612–616

pruning noisy instances is able to increase the accuracy of thelearning algorithm on noiseless instances. Third, DPEC not onlycan improve the accuracy of the learning algorithm on noiseless in-stances, but also can increase the accuracy of the learning algo-rithm on the whole data. This phenomenon lets us know thatpruning noisy instances from the data set before clustering can im-prove the generalization of the learning algorithm and increase theaccuracy of the learning algorithm on whole data.

In addition, we test the performance of DPEC on all data sets withartificial noise. To add noise into four UCI data sets, we perturb val-ues of features of instances and remain labels of instances un-changed.1 To realize it, the standard deviation r of a feature iscalculated and a number is randomly generated from Nð0; rÞ andadded to the value of the feature. Noise level is defined as the probabil-ity that the value of the feature is perturbed. The pruning rate of DPECin our experiments is fixed to 0.2. Clustering accuracies of all four com-pared algorithms under different noise levels are calculated. Experi-mental results are given in Fig. 2. From Fig. 2, we can observe thatclustering accuracies of K-means clustering algorithm with DPEC aremuch higher than the ones obtained by traditional K-means clusteringalgorithm. This phenomenon tells us that noise can degrade the accu-racy of the learning algorithm and DPEC can improve the robustness ofthe clustering algorithm for data sets with noise.

5. Conclusions

In this paper, we extended data pruning techniques into theunsupervised clustering area where labels of instances are un-known beforehand. We have proposed a feasible unsuperviseddata pruning approach termed as unsupervised Data Pruning usingEnsembles of multiple Clusterers (DPEC). DPEC employs theensembles of multiple clusterers to determine whether an instanceis noise or not. Experimental results on several real data sets withartificial noise have demonstrated that DPEC is able to identifynoisy instances of a data set and can improve the accuracy androbustness of the clustering algorithm. It will be of great interestfor our future work to study the effect of different ensemble ap-proaches on the performance of DPEC.

Acknowledgements

This paper was supported by City University Strategic Grant7002294. The authors thank the comments and suggestions fromthe reviewers.

1 These labels are directly obtained from UCI Machine Learning Repository.

References

[1] A. Angelova, Y. Abu-Mostafa, P. Perona, Pruning training sets for learning ofobject categories, in: IEEE Conference on Computer Vision and PatternRecognition, 2005, pp. 494–501.

[2] J. Kubica, A. Moore, Probabilistic noise identification and data cleaning,Technical Report CMU-RI-TR-02-26, CMU, 2002.

[3] X. Zhu, X. Wu, Y. Yang, Error detection and impact-sensitive instance rankingin noisy data, in: National Conference on Artificial Intelligence, 2004, pp. 378–384.

[4] C.E. Brodley, M.A. Friedl, Identifying mislabeled training instances, Journal ofArtificial Intelligence Research 11 (1999) 131–167.

[5] C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled traininginstances, in: National Conference on Artificial Intelligence, 1996, pp. 799–805.

[6] D. Gamberger, N. Lavrac, G. Groselj, Experiments with noisy filtering in amedical domain, in: International Conference on Machine Learning, 1999, pp.143–151.

[7] I. Dagan, S.P. Engelson, Committee-based sampling for training probabilisticclassifier, in: International Conference on Machine Learning, 1995, pp. 150–157.

[8] Y. Freund, H.S. Seung, E. Shamir, N. Tishby, Selective sampling using the queryby committee algorithm, Machine Learning 28 (1997) 133–168.

[9] A. Arning, R. Agrawal, P. Raghavan, A linear method for deviationdetection in large databases, Knowledge Discovery and Data Mining(1996) 164–169.

[10] G.H. John, Robust decision trees: removing outliers from databases, KnowledgeDiscovery and Data Mining (1995) 174–179.

[11] I. Guyon, N. Matic, V. Vapnik, Discovering informative patterns and datacleaning, Knowledge Discovery and Data Mining (1996) 181–203.

[12] X. Zhu, X. Wu, S. Chen, Eliminating class noise in large datasets, in:International Conference on Machine Learning, 2003, pp. 920–927.

[13] A. Strehl, J. Ghosh, Clustering ensembles – a knowledge reuse framework forcombining multiple partitions, The Journal of Machine Learning Research 3(2002) 583–617.

[14] A. Fred, A.K. Jain, Combining multiple clusterings using evidenceaccumulation, IEEE Transactions on Pattern Analysis and MachineIntelligence 27 (2005) 835–850.

[15] X.Z. Fern, C.E. Brodley, Clustering ensembles for high dimensional dataclustering, in: International Conference on Machine Learning, 2003, pp. 186–193.

[16] C. Blake, C. Merz, UCI repository of machine learning databases. Availablefrom: <http://www.ics.uci.edu/mlearn/MLRepository.html>.

[17] W.M. Rand, Objective criteria for the evaluation of clustering methods, Journalof American Statistical Association, 66, 846–850.

[18] Y. Hong, S. Kwong, Y. Chang, Q. Ren, Clustering ensembles guidedunsupervised feature selection using population based incremental learningalgorithm, Pattern Recognition, in press.

[19] Y. Hong, S. Kwong, Y. Chang, Q. Ren, Consensus unsupervised featureranking from multiple views, Pattern Recognition Letters 29 (5) (2008) 595–602.

[20] T.K. Ho, The random subspace method for constructing decision forests, IEEETransactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844.

[21] Y. Hong, S. Kwong, Learning assignment order of instances for constrainedclustering algorithm, IEEE Transactions on System, Man and Cybernetics, PartB, submitted for publication.

http://www.ics.uci.edu/mlearn/MLRepository.html

unsupervised data pruning for clustering of noisy data

Documents