![Page 1: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/1.jpg)
Co-clustering of Multi-View Datasets: a Parallelizable Approach
Authors/ Gilles Bisson and Clement GrimalAffiliation/ University Joseph Forier, France
Source/ International Conference on Data Mining 2012Presenter/ Allen
1
![Page 2: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/2.jpg)
Outline
• Introduction
• Multi-View Learning
• The -SIM algorithm
• The MVSIM architecture
• Experiments
• Conclusion
2
![Page 3: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/3.jpg)
Introduction
• Co-clustering have been proposed to observe the intensity of relation between two objects.
• However, datasets involving more than two types of interacting objects are also frequent.– In addition to analyze users’ relation in a social network, the
relations between documents and users are also needed to be analyzed.
• A simple way is to process such datasets into many matrices and co-cluster them separately.– Interactions between objects in difference matrices are not
considered.
3
![Page 4: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/4.jpg)
Introduction (Cont.)
• Multi-view clustering task, handle the views together, was proposed to solve this problem.
• -SIM is a co-clustering algorithm, which builds similarity matrices rather than produce co-cluster results.– It is flexible to combine different views together.– It can be easily inject priori knowledge into initialized
similarity matrix.– It’s possible to transfer the similarities form one view
to the others.
4
![Page 5: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/5.jpg)
Multi-view learning
• Multi-view learning became highly popular with the seminal work of co-training, which trained two algorithms on two different views.
• Several extensions of classical clustering methods have been proposed to deal with multi-view data.– Multi-view K-means (MVKM)
– Multi-view EM
5
![Page 6: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/6.jpg)
Multi-view learning
• Multi-view clustering aims at combining multiple results into one.– Occurrence
• Fred et al. produced a meta-similarity matrix based on how many times objects appear in the same cluster.
– Clustering ensemble selection problem• Li et al. built a weighted consensus clustering methods to select
the best clustering among multi views.• Azimi et al. adapts their selection strategy according to stability of
clustering.
– Fusion manner• Combining multiple similarity matrices to perform a given learning
task.– Linked Matrix Factorization, fuzzy clustering
6
![Page 7: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/7.jpg)
Notations
• Type of objects– Let N be the number of objects in the dataset. (i.e.
users, documents, words, etc.)• Ti is an object. i 1…N
• For simplify, object Ti has ni instances.
– Relation matrices• Let M be the number of relations between objects.
• Rijni nj is the relation matrix between objects Ti and Tj.
– Similarity matrices• Similarity matrix Si
ni ni is the square and symmetricalmatrix of Ti, where the values must be in [0,1].
7
![Page 8: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/8.jpg)
The -SIM algorithm [SDM’10]
• Let R12 is a [documents/words] matrix and that the task is to compute the similarity matrix S1(documents) and S2 (words).
• The idea of -SIM is to capture the duality between documents and words.
• This is achieved by simultaneously calculating document-document similarities based on words, and word-word similarities based on documents.
8
![Page 9: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/9.jpg)
The -SIM algorithm (cont.)
• The similarity matrix S1 between documents is evaluated in two steps:
– The k parameter is similar to one used inMinkowski distance.
The Minkowski distance of order p between two points
is defined as:
9
![Page 10: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/10.jpg)
The -SIM algorithm (cont.)
• Parameter p: the percentage of the smallest similarity needed to be pruned.
• If k=1, It=1 and p=0, -SIM is equivalent to cosine similarity.
R12 word1 word2 word3
doc1 2 1 0
doc2 1 2 3
doc3 0 1 2
S1 doc1 doc2 doc3
doc1 5 4 1
doc2 4 14 8
doc3 1 8 5
10
![Page 11: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/11.jpg)
The MVSIM architecture
• This architecture deal with datasets having multiple relation matrices (or views).
• The goal:– Compute a co-similarity
matrix Si for object Ti which appear in different views.
• The idea:– Create a learning network
isomorphic to the relational structure of the datasets.
The input: Si, Sj, Ri,j, i,j 1…N
The output: Si(i,j), Sj
(i,j), Rij, i,j 1…N
The aggregation function: i, j
11
![Page 12: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/12.jpg)
Aggregation Function
• Functions i have two important roles:
– Aggregate the multiple similarity matrices produced by -SIM.
• F(Si(i,1), Si
(i,2),..): merging function combining matrices.
– Ensure the convergence
• Use damping factor [0,1] to balance the function i
12
![Page 13: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/13.jpg)
The MVSIM algorithm
• IG: the number of iterations for MVSIM.
• For simplify, k, p and It are set to the same.
13
![Page 14: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/14.jpg)
Complexity and Parallelization
• Complexity– MVSIM is related to -SIM.
• Time complexity: O(nm2+n2m)
• Parallelization– For one relation matrix R12
n m, it will be spilt into hsmall matrices. (n: # documents; m: # words)• If m is huge, R12 can be divided into h small matrices
R’ n (m/h).• Using a distributed version on h cores.
– Time complexity is decreased to O(1/h2(nm2)+1/h(n2m))– Memory storage is decreased to 1/h.
14
![Page 15: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/15.jpg)
Evaluation of multi-view approaches
• Evaluating the correlation between the learned and known clusters in the confusion matrix.– Measurement: micro-averaged precision
• Datasets (Ground truth: document class)
– IMDB
– CiteSeer
– 4 universities datasets: Cornell, Texas, Washington and Wisconsin
– Reuters RCV1/RCV2
15
![Page 16: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/16.jpg)
Benchmarks & Results
• Single view: Cosine, LSA, SNOS, CTK, -SIM, ITCC
• Multi view: MVSC, Naïve MVSIM (IG=1), MVSIM(IG=6, =0.5, k=0.8, p=0.4)
• The clusters have been generated by an Agglomerative Hierarchical Clustering method.
– Cut the clustering tree at the level according to #class.
16
![Page 17: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/17.jpg)
Evaluation of Splitting Approach
• Dataset
– NG20: 20,000 newsgroup
– Ground truth: 10 categories
• How is the quality of the clustering influenced, when #splits increases with a total #features kept constant?
17
![Page 18: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/18.jpg)
Observation
• We tested the MVSIN with 1 split containing 4,000 words, then 2 random splits of 2,000 words, etc. until 16 random splits of 250 words.
18
• The quality of the clustering tends to decrease.
• Although the performance achieve 2-3% lower, computation time is 1/splits2 lower.
![Page 19: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/19.jpg)
Evaluation of Splitting Approach
• Is it possible to improve the clustering by adding more features through separated matrices?– We evaluate the task by assuming the total number of words is not
fixed.
19
More words gain more quality of the clustering.
![Page 20: Co-clustering of multi-view datasets: a parallelizable approach](https://reader033.vdocument.in/reader033/viewer/2022060200/55992fdb1a28ab0e7b8b47a0/html5/thumbnails/20.jpg)
Conclusion
• The MVSIM architecture deal with the problem of learning co-similarities from a collection of matrices describing interrelated types of objects.
• It provides interesting properties in terms of convergence and scalability, and allows a straightforward parallelization of the process.
• The experiments demonstrate that this method outperform both single-view and multi-view approaches.
20