clustering introduction
TRANSCRIPT
Clustering for New Discovery in Data
Houston Machine Learning Meetup
2SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan
• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Agglomerative clustering - Kunal– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions) _ Neural network
– From neural network to deep learning – Convolutional neural network– Train deep nets with open-source tools
3SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry
4SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation
5SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the unlabeled data
6SCR©
Application: Recommendation
7SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/
8SCR©
Application: Pizza Hut Center
Delivery locations
9SCR©
Application: Discovering Gene functions
Important to discover diseases and treatment
10SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……
11SCR©
• K-Means
• DBSCAN
12SCR©
Cluster Validation
13SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?
• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To determine the optimal number of clusters
14SCR©
Cluster Validity
• Numerical measures:– External: Used to measure the extent to which cluster labels match
externally supplied class labels.• Entropy
– Internal: Used to measure the goodness of a clustering structure without respect to external information.
• Sum of Squared Error (SSE)– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization
15SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a cluster– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
• Example: Squared Error– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
i Cx
ii
mxWSS 2)(
i
ii mmCBSS 2)(
16SCR©
Internal Measures: WSE and BSE
• Example: SSE– BSS + WSS = constant
10919)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(22
2222
TotalBSS
WSS
1 2 3 4 5 m1 m2
m
K=2 clusters:
100100)33(4
10)35()34()32()31(2
2222
TotalBSS
WSSK=1 cluster:
17SCR©
Internal Measures: WSE and BSE
• Can be used to estimate the number of clusters
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
KS
SE
5 10 15
-6
-4
-2
0
2
4
6
WS
S
18SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a cluster.
• Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.
cohesion separation
19SCR©
Correlation between affinity matrix and incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by
n
jiij
n
jiij
n
jiijij
ccdd
ccddr
1,1
2_
1,1
2_
1,1
__
)()(
))((
20SCR©
Correlation with Incidence matrix
n
jiij
n
jiij
n
jiijij
ccdd
ccddr
1,1
2_
1,1
2_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810
21SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and inspect visually.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Poi
nts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22SCR©
• Clusters in random data are not so crisp
Points
Poi
nts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix
23SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
24SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan
• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Hierarchical clustering - Kunal– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions) _ Neural network
– From neural network to deep learning - Yan– Convolutional neural network – Train deep nets with open-source tools
25SCR©
Thank you
Slides will be posted on slide share:
http://www.slideshare.net/xuyangela