vladyslav kolbasin stable clustering. clustering data clustering is part of exploratory process...
TRANSCRIPT
Vladyslav Kolbasin
Stable Clustering
Clustering data
Clustering is part of exploratory process
Standard definition:
Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups
There is no “true” solution for clustering
We don't have any “true Y values”
Usually we want to do some data exploration or simplification or even find some data taxonomy
Usually we don't have precise mathematical definition of clustering
Usually we iterate through different methods that have different mathematical target function, then use some best method
2
Usual clustering methods
Methods:
K-means
Hierarchical clustering
Spectral clustering
DBScan
BIRCH
…
Issues:
Need to estimate clusters count
Non-determinism
Non-stability
3
Are standard methods stable? Kmeans
4
Are standard methods stable? Hierarchical clustering
5
Audience data
A lot of attributes: 9000-30000- ...
All attributes are binary
There are several data providers
There is no very important attributes
6
Stability importance
Data comes from different providers and it is very noisy
It is unlikely that results will change from run to run
Usually audience doesn't change a lot in short period
Many algorithms “explode” when we increase data size
Non linear complexity of clustering
Best count of clusters move to higher values for bigger data
7
Let's add some additional requirement to clustering:
clustering result should be a structure on the data set that is “stable”
So there should be similar results when:
We change some small portion of data
We apply clustering onto several datasets from the same underlying model
Apply clustering onto several subsets of initial dataset
We don't want to process gigabytes and terabytes to get several stable clusters which are independent of randomness in sampling
8
Stable clustering. Requirements
Stable clustering. Requirements
Natural restrictions:
We don't want to have too many clusters
We don't want to have too small or too big clusters
Too small clusters are usually useless for further processing
Too big clusters do not bring significantly new information
Some points can be noise points, so let try to find only significant tendencies
It will be big benefit if we can easily scale results
To be able to look at inner structure of selected cluster without full rerun
Any additional instruments for manual analysis of clustering are welcome
9
Stable clustering ideas
Do not use whole dataset, but use many small sub samples
Use several samplings to mine as much as possible information from data
Average all clustering on samples to get stable result
10
Stable clustering algorithm
1) Select N samples of whole dataset
2) Do clustering for each sample
So for each sample we have set of clusters (possibly very different)
3) Do some clusters' preprocessing
4) Associate clusters from different samples to each other
Build some relationship structure - clusters graph
Set relationship measure - distance measure
5) Do clustering on relationship structure
Do communities search
11
2. Sample clustering
Any clustering method:
Kmeans
Hierarchical clustering
…
It is conveniently to use hierarchical clustering:
It is rather fast clustering method
We can estimate clusters count using natural restrictions, not using special criteria like we usually do for kmeans
We can deep into internal structure without any additional calculations
12
2.1. Dendrogram clustering
Recursive splitting of large clusters
With natural restrictions:
Set max possible cluster size (in %)
Set min cluster size (in %), any smaller cluster – noise
Max count of splits
…
13
2.1. Dendrogram clustering
14
3. Do clusters' preprocessing
Reduce noise points
Cluster smoothing
Make clusters more convenient for associating:
Cluster can be similar to several other clusters (1-to-many)
If split it, it can transform into: 1-to-1 & 1-to-1 clusters
And some other heuristics...
15
4. Associate clusters from different samples to each other
How similar to each other are clusters?
Set relationship measure:
Simplest measure - distance between cluster's centers
But we can use any suitable measure
16
4. Associate clusters from different samples to each other
Clusters relationship structure - clusters graph
But we are not interested in edges for very different clusters
So need some threshold:
Can estimate manually, then hard-code
Can estimate automatically
17
5. Communities search in networks
Methods:
walktrap.community
edge.betweenness.community
fastgreedy.community
spinglass.community
…
It is possible that some clusters will not be in any community. Then will mark these clusters as special type community
18
5.1 Community structure detection based on edge betweenness
edge.betweenness.community() implements Girvan–Newman algorithm
Betwenness - the number of geodesics (shortest paths) going through an edge
Algorithm:
Calculate edge-betweenness for all edges
Remove the edge with highest betweenness
Recalculate betweenness
Repeat until all edges are removed, or modularity function is optimized (depending on variation)
19
5. Communities examples
20
Algorithm analysis
21
Algorithm analysis
22
Algorithm analysis
23
Summary
Issues in clustering algorithms
Why stability is important for business questions?
2 staged clustering algorithm
1st stage – apply simple clustering on samples
2nd stage – do clustering on cluster graph
Real data clustering example
Algorithm can be simply parallelized:
Most time is spent on 2nd step
24