a semi-supervised learning framework to cluster mixed data ... · •convert and run: all numerical...

A Semi-supervised Learning Framework to Cluster Mixed Data Types

Artur Abdullin and Olfa Nasraoui [email protected], [email protected]

Knowledge Discovery & Web Mining Lab, Department of Computer Engineering and Computer Science,

University of Louisville, Louisville, KY

4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012,

Barcelona, Spain

October 4, 2012

mailto:[email protected]

mailto:[email protected]

Clustering Heterogeneous

Data Sets

Conversion Splitting

Algorithms for specific types of data

Algorithms for mixed data

Other approaches

Proposed Semi-supervised Learning Framework

Combining distance matrix

Artur Abdullin & Olfa Nasraoui - A Semi-supervised Learning Framework to Cluster Mixed Data Types - KDIR'12 2

Baselines for this paper: Conversion and Splitting

(for the case of categorical & numerical domains)

• Convert and run:

all numerical attributes to categorical attributes and run k-modes.

all categorical attributes to numerical attributes and run k-means.

• Split and run:

k-means on the numerical subsets of attributes

k-modes on the categorical subsets of attributes


Specifically Designed Clustering Algorithms

• Numerical data k-means, DBSCAN, etc.

• Categorical data

k-modes, ROCK, CACTUS, etc.

• Transactional or text data

spherical k-means, or any specialized algorithm.

• Graph data

KMETIS, spectral clustering, etc.


Algorithms for Mixed Data Attributes

• k-prototypes [Huang, 1998]: integrates the k-means and the k-modes. – Disadvantage: weight parameter

• INCONCO [Plant and Bohm, 2011] : – extends the Cholesky decomposition to model dependencies in

heterogeneous data – relying on the principle of Minimum Description Length,

integrates numerical and categorical information in clustering. – Disadvantage: assumes a known probability distribution model for

each domain and assumes an identical number of clusters in each domain

• k-mixed-means [Ahmad and Dey, 2007]: based on k-means paradigm that works with mixed numerical and categorical features.

• CAVE [Hsu and Chen, 2007]: based on variance and entropy.


Other Methods

• Multiview Clustering [Zhou and Burges, 2007; Cortes et al., 2009] – train two independent hypotheses with bootstrapping by

providing each other with labels for the unlabeled data.

• Ensemble-based Clustering [He et al., 2005; Al-Shaqsi and Wang, 2010] – Dedicating a different clustering process to each domain and

aggregating the final results within an ensemble.

• Collaborative Clustering [Pedrycz and Rai, 2008; Hammouda and Kamel, 2006] – Consider each site as dedicated to only one pure domain of the

data.


Where is our SSL Framework Located? Independent vs. Integrated approaches tend to tip

scale in either direction SSL framework can embrace continuum between the two!

Independent Fully

integrated

Combining distance matrix

CAVE

INCONCO

Specialized algorithms

Ensemble-based

Multiview

SSL Framework


Semi-supervised Clustering

• Supervision (some labels) in the form of: – Initial seeds [Basu et al., 2002]

– Constraints [Wagstaff et al., 2001]

– Feedback [Cohn et al., 2003]

• Examples of Semi-supervised Learning Algorithms: – Co-training [Blum and Mitchell, 1998]

– Transductive support vector machines [Joachims, 1999]

– Entropy minimization [Guerrero-Curieses and Cid-Sueiro, 2000]

– Semi-supervised EM [Nigam et al., 2000]

– Graph-based approaches [Blum and Chawla, 2001]

– Clustering-based approaches [Zeng et al., 2003]


Proposed: SSL Framework for Mixed Domain

Clustering • We propose a new approach to

– cluster diverse representations or types of data – Also, naturally, can cluster different sources of data

• Our approach is rooted in Semi-Supervised Learning (SSL) – However: no external labels are used! – Instead, knowledge that is discovered from each domain

serves as a guide to supervise the other domain – Note: some data may be missing from one domain or source – In this paper, we explore semi-supervision/guidance based

on seed exchange • We explore other supervision methods later


N Height Weight Eyes Color Education Shopping Cart 1 5 ft 9 in 170.4 lb Blue Master MP3 player, iPad 2 6 ft 5 in 181.7 lb Green High School Watch, car, T-shirt 3 5 ft 7 in 151.4 lb Black Bachelor Milk, Beer, Apples

Mixed-Type Data Set

Transactional Data

Algorithm 1

Result 2 Result 3 Result 1

Categorical Data Numerical Data

Seed-based Semi-supervised Framework

Algorithm 2 Algorithm 3 seeds seeds

10

Initialization

Algorithm 2 (domain 2)


Find the best seeds combination

Generate initial random seeds

Result 2 Result 1

best seeds

seeds Find the best seeds combination


seeds



best seeds

best seeds



seeds

…………

11

Complexity of Proposed SSL Approach

• Determined by complexity of the embedded base algorithms in each domain

+ Overhead complexity resulting

from mutual supervision via inter-domain coordination and seed exchange: O(k2N).

• Example: with k-means and k-modes as the base algorithms, total computational complexity is still O(N)

Algorithm 1

Algorithm 2

Algorithm 1

Algorithm 2

Algorithm 3


Clustering Evaluation

• Internal validity metrics (cluster quality based only on data distribution – compact & separated) – Davies-Bouldin index [Davies and Bouldin, 1979] – Silhouette index [Rousseeuw, 1987] – Dunn index [Dunn, 1974]

• External validity metrics (use external labels in validation step only – pure clusters) – Purity [Manning et al., 2008] – Entropy [Xiong et al., 2006] – Normalized Mutual Information [Strehl et al., 2000]


Real Data Sets (UCI ML Repository)

Data set No. of Records

No. of Numerical Attributes

No. of Categorical Attributes

Missing Values

No. of Classes

Adult 45179 6 8 Yes 2

Heart Disease 303 6 7 Yes 2

Credit Approval

690 6 9 Yes 2

• Number of clusters = 2 • Repeated each experiment 50 times

• (10 for larger adult data set) Artur Abdullin & Olfa Nasraoui - A Semi-supervised Learning Framework to Cluster Mixed Data Types - KDIR'12 14

Experimental Results: Adult Data Set

Algorithm Semi-Supervised Conversion Splitting

Data type Numerical Categorical Numerical Categorical Numerical Categorical

DB index 0.42 1.10 1.48 1.22 3.29 1.11

Silhouette Index

0.71 0.27 0.17 0.25 0.21 0.27

Purity 0.75 0.67 0.75 0.65 0.64 0.55

Entropy 0.72 0.71 0.73 0.70 0.71 0.71

NMI 0.10 0.11 0.13 0.12 0.10 0.11

• Significant improvements with Semi-supervised approach: • in DB and Silhouette indices:

• For both numerical and categorical domains • In almost all validity indices (but for Entropy & NMI, which are slightly

worse - but with low significance): • For categorical domain


Experimental Results: Heart Disease Data Set



DB index 1.54 0.65 0.21 0.98 1.65 0.75

Silhouette Index

0.41 0.31 0.75 0.19 0.36 0.31

Purity 0.76 0.81 0.82 0.83 0.75 0.81

Entropy 0.79 0.70 0.67 0.64 0.80 0.69

NMI 0.20 0.30 0.32 0.36 0.19 0.30

• Conversion algorithm yielded better clustering results for numerical domain

• However, Semi-supervised approach outperforms conversion algorithm in categorical domain for most metrics


Experimental Results: credit card data set



DB index 0.01 0.97 0.10 1.32 0.18 1.37

Silhouette Index

0.97 0.36 0.92 0.21 0.95 0.24

Purity 0.70 0.80 0.81 0.82 0.64 0.82

Entropy 0.84 0.70 0.68 0.67 0.93 0.65

NMI 0.18 0.30 0.31 0.31 0.08 0.36

• Based on internal indices: • SSL approach outperforms traditional algorithms for categorical and

numerical type attributes • Based on external indices:

• SSL concedes to • Conversion method for numerical domain (low significance for categorical) • splitting method for categorical domain only

• No agreement between Internal & External metrics


Different Number of Clusters Per Domain

• Two cases:

– Each data domain may naturally give rise to a different number of clusters.

– Even if same number of clusters in the different domains, their nature may be completely different

• The partition may be completely different!


Example: Different Number of Clusters

• In T1: 3 clusters • In T2: 2 clusters • In T1&T2: 4 clusters


Semi-supervised Merging Algorithm (SS-merging)

domain

domain

Data set

Algorithm 2 clusters

Algorithm 1 clusters

memberships

memberships

Optimal Matching using Hungarian

method with weights inv. prop.

to Jaccard distance:

}0),(|{, jiMxC TijT

create then ,||

|| if

11

2211

,

,,

jT

jTjT

C

CC

1T

2T

1Tk

2Tk

1TM

2TMJ/1W

for every cluster j1 in T1, find correspondent cluster j2 in T2

11 , jTC22 , jTC

a new cluster


Experimental Example: IRIS Data Set • 3 classes of 50 data instances each

• Set overlap threshold = 0.6

• Repeated experiments 10 times

– Reporting the best results

• Artificial split attributes to simulate different domains

– Still a valid simulation because no inter-domain distance was computed

Experiment No

1 {1} {3} 2 3 3

2 {2} {4} 2 3 3

3 {1,3} {2,4} 2 3 3

1T 2T1Tk

2Tk K

21

Different Number of Clusters: Experiment 1

Algorithm Class F-measure Accuracy Purity Entropy NMI

K-means

Setosa 0.9901 0.9933

0.8800 0.2967 0.7065 Versicolour 0.8333 0.8800

Virginica 0.8132 0.8868

SSL Framework

Setosa 1.0 1.0

0.9467 0.1642 0.8366 Versicolour 0.9231 0.9467

Virginica 0.9167 0.9467

• Semi-supervised Merging algorithm outperformed k-means in all indices.


Different Number of Clusters: Experiment 3

Algorithm Class F-measure Accuracy Purity Entropy NMI

K-means

Setosa 1.0 1.0

0.8933 0.2485 0.7582 Versicolour 0.8574 0.8933

Virginica 0.8182 0.8933

SSL Framework

Setosa 0.9899 0.9933

0.9267 0.2265 0.7738 Versicolour 0.8932 0.9267

Virginica 0.8980 0.9333

• Semi-supervised Merging algorithm outperformed k-means • For all but Setosa class: in all validity metrics • For entire data set: in purity, entropy and NMI metrics

• Note: slightly lower F-measure and accuracy – but difference is not significant


Conclusions

• In general, better clustering results in the categorical domain.

• Exchanged (supervising) seeds seem to provide additional helpful knowledge to guide the clustering in the categorical domain

– Possible explanation: Avoiding local minima and obtaining better clustering in the categorical domain.


Ongoing & Future Work

• Ongoing work – since this paper: – Continued work by exploiting constraint-based SSL

• HMRF (Hidden Markov Random Fields)

• Better results [Abdullin & Nasraoui, to be presented in SPIRE 2012]

– Bigger and more challenging Multimedia Web data sets • Visual + Text [Abdullin & Nasraoui, to be presented in LA-WEB 2012]

– Exploring Inter-Domain Compatibility [Abdullin & Nasraoui, to be presented in LA-WEB 2012]

• Future work: Further improve seed-based & constraint-based HMRF

Optimize parameters of the SSL framework

Improve scalability

Real applications (Web domain is fertile with heterogeneous data)


References

• [He et al., 2005] He, Z. and Xu, X. and Deng, S., "Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach", eprint arXiv:cs/0509011 (2005).

• [Al-Shaqsi and Wang, 2010] Al-Shaqsi, J. and Wenjia Wang, "A clustering ensemble method for clustering mixed data", in Neural Networks (IJCNN), The 2010 International Joint Conference on (, 2010), pp. 1 -8.

• [Huang, 1998] Huang, Zhexue, "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values", Data Mining and Knowledge Discovery 2 (1998), pp. 283-304.

• [Guha et al., 2000] Sudipto Guha and Rajeev Rastogi and Kyuseok Shim, "Rock: A robust clustering algorithm for categorical attributes", Information Systems 25, 5 (2000), pp. 345 - 366.

• [Ganti et al., 1999] Ganti, Venkatesh and Gehrke, Johannes and Ramakrishnan, Raghu, "CACTUS - clustering categorical data using summaries", in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA: ACM, 1999), pp. 73--83.

• [MacQueen, 1967] MacQueen, J. B., "Some Methods for Classification and Analysis of MultiVariate Observations", in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability vol. 1, (University of California Press, 1967), pp. 281-297.

• [Ester et al., 1996] Martin Ester and Hans-peter Kriegel and Jorg Sander and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) (AAAI Press, 1996), pp. 226--231.

• [Huang, 1998] Huang, Zhexue, "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values", Data Mining and Knowledge Discovery 2 (1998), pp. 283-304.

• [Hsu and Chen, 2007] Chung-Chian Hsu and Yu-Cheng Chen, "Mining of mixed data with application to catalog marketing", Expert Systems with Applications 32, 1 (2007), pp. 12 - 23.

26

• [Ahmad and Dey, 2007] Amir Ahmad and Lipika Dey, "A k-mean clustering algorithm for mixed numeric and categorical data", Data and Knowledge Engineering 63, 2 (2007), pp. 503 - 527.

• [Plant and Böhm , 2011] Plant, Claudia and Böhm, Christian, "INCONCO: interpretable clustering of numerical and categorical objects", in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA: ACM, 2011), pp. 1127--1135.

• [Sheath and Sokal, 1973] Sheath, P. H. A. and Sokal, Robert C., Numerical taxonomy. The principles and practice of numerical classification (San Francisco : W.H. Freeman, 1973).

• [Kaufman and Rousseeuw , 1990] Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data An Introduction to Cluster Analysis (New York: John Wiley & Sons, Inc, 1990).

• [Muller et al., 2001] Muller, K.-R. and Mika, S. and Ratsch, G. and Tsuda, K. and Scholkopf, B., "An introduction to kernel-based learning algorithms", Neural Networks, IEEE Transactions on 12, 2 (2001), pp. 181 -201.

• [Girolami, 2002] Girolami, M., "Mercer kernel-based clustering in feature space", Neural Networks, IEEE Transactions on 13, 3 (2002), pp. 780 -784.

• [Inokuchi and Miyamoto, 2004] Inokuchi, R. and Miyamoto, S., "LVQ clustering and SOM using a kernel function", in Fuzzy Systems, 2004. Proceedings. 2004 IEEE International Conference on vol. 3, (, 2004), pp. 1497 - 1500 vol.3.

• [MacDonald and Fyfe, 2000] MacDonald, D. and Fyfe, C., "The kernel self-organising map", in Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on vol. 1, (, 2000), pp. 317 -320 vol.1.

• [Hur et al., 2001] Asa Ben-Hur and David Horn and Hava T. Siegelmann and Vladimir Vapnik, "A Support Vector Method for Clustering", in Advances in Neural Information Processing Systems 13 (MIT Press, 2001), pp. 367--373.

• [Hur et at., 2002] Ben-Hur, Asa and Horn, David and Siegelmann, Hava T. and Vapnik, Vladimir, "Support vector clustering", J. Mach. Learn. Res. 2 (2002), pp. 125--137. 27

• [Frank and Asuncion, 2010] A. Frank and A. Asuncion, "UCI Machine Learning Repository", University of California, Irvine, School of Information and Computer Sciences (2010).

• [Zhou and Burges, 2007] D. Zhou and C. Burges. Spectral clustering and transductive learning with multiple views. In ICML, pages 1159-1166, 2007.

• [Cortes et al., 2009] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, pages 396-404, 2009.

• [A. Kumar and H. D. III., 2011] A co-training approach for multi-view spectral clustering. In ICML, pages 393-400, 2011.

• [Cheng and Zhao, 2009] Y. Cheng and R. Zhao. Multiview spectral clustering via ensemble. In GrC, pages 101-106, 2009.

• [de Sa, 2005] V. de Sa. Spectral clustering with two views. In ICML workshop on learning with multiple views, pages 20-27, 2005.

• [Tang et al., 2009] W. Tang, Z. Lu, and I. S. Dhillon. Clustering with multiple graphs. In ICDM, pages 1016-1021, 2009.

• [Pedrycz and Rai, 2008] Witold Pedrycz and Partab Rai, "Collaborative clustering with the use of Fuzzy C-Means and its quantification", Fuzzy Sets and Systems (2008), 2399 - 2427.

• [Hammuoda and Kamel, 2006]Hammuoda, Khaled and Kamel, Mohamed, "Collaborative document clustering" (2006), 453--463.

• [Cohl et al., 2003] Cohn, D., Carauana, R., and McCallum, A. (2003). Semi-supervised clustering with user feedback. Technical report.

• [Abdullin & Nasraoui, SPIRE 2012] A. Abdullin and O. Nasraoui, “Clustering Heterogeneous Data with Mutual Semi-Supervision,” in Proceedings of SPIRE 2012, October 2012.

• [Abdullin & Nasraoui, LA-WEB 2012] A. Abdullin and O. Nasraoui, “Clustering Heterogeneous Data Sets,” in Proceedings of LA-WEB 2012 (Invited keynote), October 2012.

28

Thank you!


Questions ?

a semi-supervised learning framework to cluster mixed data ... · •convert and run: all numerical...

Documents