cse601 clustering ensemble - university at buffalo · 2012-10-17 · outline • basics –...
TRANSCRIPT
![Page 1: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/1.jpg)
Clustering Lecture 7: Clustering Ensemble
Jing Gao SUNY Buffalo
1
![Page 2: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/2.jpg)
Outline
• Basics – Motivation, definition, evaluation
• Methods – Partitional – Hierarchical – Density-based – Mixture model – Spectral methods
• Advanced topics – Clustering ensemble – Clustering in MapReduce – Semi-supervised clustering, subspace clustering, co-clustering,
etc. 2
![Page 3: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/3.jpg)
Clustering Ensemble
• Problem – Given an unlabeled data set D={x1,x2,…,xn} – An ensemble approach computes:
• A set of clustering solutions {C1,C2,…,Ck}, each of which maps data to a cluster: fj(x)=m
• A unified clustering solutions f* which combines base clustering solutions by their consensus
• Challenges – The correspondence between the clusters in different
clustering solutions is unknown
3
![Page 4: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/4.jpg)
Motivations • Goal
– Combine “weak” clusterings to a better one
[PTJ05] 4
![Page 5: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/5.jpg)
An Example
base clustering models
objects
The goal: get the consensus clustering they may not represent the same cluster!
[GMT07] 5
![Page 6: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/6.jpg)
Methods (1) • How to get base models?
– Bootstrap samples – Different subsets of features – Different clustering algorithms – Random number of clusters – Random initialization for K-means – Incorporating random noises into cluster labels – Varying the order of data in on-line methods
6
![Page 7: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/7.jpg)
Methods (2) • How to combine the models?
Correspondence
Implicit Explicit
Consensus Function Representation
Object-based Cluster-based Object-Cluster -based
Optimization Method
Generative Approaches
7
![Page 8: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/8.jpg)
8
Hard Correspondence (1) • Re-labeling+voting
– Find the correspondence between the labels in the partitions and fuse the clusters with the same labels by voting [DuFr03,DWH01]
C1 C2 C3 v1 1 3 2 v2 1 3 2 v3 2 1 2 v4 2 1 3 v5 3 2 1 v6 3 2 1
C1 C2 C3 v1 1 1 1 v2 1 1 1 v3 2 2 1 v4 2 2 2 v5 3 3 3 v6 3 3 3
Re-labeling C* 1 1 2 2 3 3
Voting
![Page 9: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/9.jpg)
Hard Correspondence (2)
• Details – Minimize match costs – Match to a reference clustering or match in a
pairwise manner
• Problems – In most cases, clusters do not have one-to-one
correspondence
9
![Page 10: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/10.jpg)
10
Soft Correspondence* (1) • Notations
– Membership matrix M1, M2, … ,Mk – Membership matrix of consensus clustering M – Correspondence matrix S1, S2, … ,Sk
– Mi Si =M
C1 C2 C3 v1 1 3 2 v2 1 3 2 v3 2 1 2 v4 2 1 3 v5 3 2 1 v6 3 2 1
úúúúúúúú
û
ù
êêêêêêêê
ë
é
010010001001100100
úúúúúúúú
û
ù
êêêêêêêê
ë
é
100100010010001001
úúú
û
ù
êêê
ë
é
001100010
= X
M2 S2 M
*[LZY05]
![Page 11: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/11.jpg)
11
Soft Correspondence (2)
• Consensus function – Minimize disagreement – Constraint 1: column-sparseness – Constraint 2: each row sums up to 1 – Variables: M, S1, S2, … ,Sk
• Optimization – EM-based approach – Iterate until convergence
• Update S using gradient descent • Update M as
å =-
k
j jjSMM1
2||||min
å ==
k
j jjSMk
M1
1
![Page 12: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/12.jpg)
• How to combine the models?
Correspondence
Implicit Explicit
Consensus Function Representation
Object-based Cluster-based Object-Cluster -based
Optimization Method
Generative Approaches
12
![Page 13: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/13.jpg)
13
Object-based Methods (1) • Clustering objects
– Define a similarity or distance measure: • Similarity between two objects can be defined as the
percentage of clusterings that assign the two objects into same clusters
• Distance between two objects can be defined as the percentage of clusterings that assign the two objects into different clusters
– Conduct clustering on the new similarity (distance) matrix
– Result clustering represents the consensus – Can view this approach as clustering in the new feature
space where clustering results are the categorical features
![Page 14: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/14.jpg)
Object-based Methods (2)
v2
v4
v3
v1
v5
v6
14
![Page 15: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/15.jpg)
Co-association matrix T
[StGh03] 15
![Page 16: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/16.jpg)
16
Consensus Function
• Minimizing disagreement – Information-theoretic [StGh03]
– Median partition [LDJ07]
– Correlation clustering [GMT07]
å =
k
j jTTNMIk 1
),(1max)()(
),(),(
j
jj THTH
TTITTNMI =
åå¹=
-+)()(
),()()(
),( )1(maxvCuC
vu uvvCuCvu uv TT
2min TT -å =
=k
j jTk
T1
1
![Page 17: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/17.jpg)
17
Optimization Method • Approximation
– Agglomerative clustering (bottom-up) [FrJa02,GMT07] • Single link, average link, complete link
– Divisive clustering (top-down) [GMT07] • Furthest
– LocalSearch [GMT07] • Place an object into a different cluster if objective function improved • Iterate the above until no improvements can be made
– BestClustering [GMT07] • Select the clustering that maximize (minimize) the objective function
– Graph partitioning [StGh03] – Nonnegative matrix factorization [LDJ07,LiDi08]
![Page 18: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/18.jpg)
[GMT07] 18
![Page 19: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/19.jpg)
19
28500
29000
29500
30000
30500
31000
31500
32000
32500
BestClustering Agglomerat ive Furthest Balls LocalSearch Rock Limbo
Overall Distance on Votes data set
[GMT07]
![Page 20: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/20.jpg)
• How to combine the models?
Correspondence
Implicit Explicit
Consensus Function Representation
Object-based Cluster-based Object-Cluster -based
Optimization Method
Generative Approaches
20
![Page 21: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/21.jpg)
21
Cluster-based Methods
• Clustering clusters – Regard each cluster from a base model as a record – Similarity is defined as the percentage of shared
common objects • eg. Jaccard measure
– Conduct clustering on these clusters – Assign an object to its most associated consensus
cluster
![Page 22: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/22.jpg)
22
Meta-Clustering Algorithm (MCLA)*
g2 g5
g3 g1
g7
g9
g4 g6
g8
g10
M1 M2
M3
M1 M2 M3
3 0 0 1 2 0
2 1 0 0 3 0 0 0 3
0 0 3
*[StGh03]
![Page 23: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/23.jpg)
• How to combine the models?
Correspondence
Implicit Explicit
Consensus Function Representation
Object-based Cluster-based Object-Cluster -based
Optimization Method
Generative Approaches
23
![Page 24: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/24.jpg)
24
HyperGraph-Partitioning Algorithm (HGPA)*
• Hypergraph representation and clustering – Each node denotes an object – A hyperedge is a generalization of an edge in that it can
connect any number of nodes – For objects that are put into the same cluster by a
clustering algorithm, draw a hyperedge connecting them – Partition the hypergraph by minimizing the number of
cut hyperedges – Each component forms a consensus cluster
*[StGh03]
![Page 25: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/25.jpg)
25
HyperGraph-Partitioning Algorithm (HGPA)
v2
v4
v3
v1
v5
v6
Hypergraph representation– a circle denotes a hyperedge
![Page 26: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/26.jpg)
Bipartite Graph Partitioning*
• Hybrid Bipartite Graph Formulation – Summarize base model output in a bipartite graph – Lossless summarization—base model output can be
reconstructed from the bipartite graph – Use spectral clustering algorithm to partition the
bipartite graph – Time complexity O(nkr)—due to the special structure
of the bipartite graph – Each component represents a consensus cluster
*[FeBr04] 26
![Page 27: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/27.jpg)
Bipartite Graph Partitioning
v2 v4 v3 v1 v5 v6
c2 c5 c3 c1
c7 c8
c4 c6
c9 c10
objects
clusters
clusters
27
![Page 28: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/28.jpg)
• How to combine the models?
Correspondence
Implicit Explicit
Consensus Function Representation
Object-based Cluster-based Object-Cluster -based
Optimization Method
Generative Approaches
28
![Page 29: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/29.jpg)
29
A Mixture Model of Consensus*
• Probability-based – Assume output comes from a mixture of models – Use EM algorithm to learn the model
• Generative model – The clustering solutions for each object are represented as
nominal features--vi
– vi is described by a mixture of k components, each component follows a multinomial distribution
– Each component is characterized by distribution parameters jq
*[PTJ05]
![Page 30: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/30.jpg)
30
EM Method
• Maximize log likelihood
• Hidden variables – zi denotes which consensus cluster the object belongs to
• EM procedure – E-step: compute expectation of zi – M-step: update model parameters to maximize likelihood
( )å å= =
n
i jk
j ij vP1 1
)|(log qa
![Page 31: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/31.jpg)
31
![Page 32: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/32.jpg)
32
![Page 33: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/33.jpg)
33 Base Model Accuracy
Ensemble
Accuracy
Number of Base Models k
1 4 7 10 15 20 50
[TLJ+04]
![Page 34: CSE601 Clustering Ensemble - University at Buffalo · 2012-10-17 · Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based](https://reader034.vdocument.in/reader034/viewer/2022042113/5e8f728c0fb8542a1a4da6ec/html5/thumbnails/34.jpg)
Take-away Message
• Clustering ensemble • Different approaches to combine multiple clustering
solutions to one solution
34