clustering introduction

Clustering for New Discovery in Data

Houston Machine Learning Meetup

2SCR©

Roadmap: Method

• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)

– Feature selection - Yan

• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan

• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Agglomerative clustering - Kunal– Dimension reduction for data visualization - Yan

• Deep learning (4 sessions) _ Neural network

– From neural network to deep learning – Convolutional neural network– Train deep nets with open-source tools

3SCR©

Roadmap: Application

• Business analytics

• Recommendation system

• Natural language processing

• Computer vision

• Energy industry

4SCR©

Agenda

• Introduction

• Application of clustering

• K-means

• DBSCAN

• Cluster validation

5SCR©

What is clustering

Clustering: to discover the natural groupings of a set of objects/patterns in the unlabeled data

6SCR©

Application: Recommendation

7SCR©

Application: Document Clustering

https://www.noggle.online/knowledgebase/document-clustering/

8SCR©

Application: Pizza Hut Center

Delivery locations

9SCR©

Application: Discovering Gene functions

Important to discover diseases and treatment

10SCR©

Clustering Algorithm

• K-Means (King of clustering, many variants)

• DBSCAN (group neighboring points)

• Mean shift (locating the maxima of density)

• Spectral clustering (cares about connectivity instead of proximity)

• Hierarchical clustering (a hierarchical structure, multiple levels)

• Expectation Maximization (k-means is a variant of EM)

• Latent Dirichlet Allocation (natural language processing)

……

11SCR©

• K-Means

• DBSCAN

12SCR©

Cluster Validation

13SCR©

Cluster Validity

• For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?

• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To determine the optimal number of clusters

14SCR©

Cluster Validity

• Numerical measures:– External: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

– Internal: Used to measure the goodness of a clustering structure without respect to external information.

• Sum of Squared Error (SSE)– Relative: Used to compare two different clusterings.

• Often an external or internal measurement is used for this function, e.g., SSE or entropy

• Visualization

15SCR©

Internal Measures: WSE and BSE

• Cluster Cohesion: Measures how closely related are objects in a cluster– Example: SSE

• Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

• Example: Squared Error– Cohesion is measured by the within cluster sum of squares (SSE)

– Separation is measured by the between cluster sum of squares

– Where |Ci| is the size of cluster i

i Cx

ii

mxWSS 2)(

i

ii mmCBSS 2)(

16SCR©


• Example: SSE– BSS + WSS = constant

10919)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

TotalBSS

WSS

1 2 3 4 5 m1 m2

m

K=2 clusters:

100100)33(4

10)35()34()32()31(2

2222

TotalBSS

WSSK=1 cluster:

17SCR©


• Can be used to estimate the number of clusters

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

KS

SE

5 10 15

-6

-4

-2

0

2

4

6

WS

S

18SCR©

Internal Measures: Proximity graph measures

• Cluster cohesion is the sum of the weight of all links within a cluster.

• Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.

cohesion separation

19SCR©

Correlation between affinity matrix and incidence matrix

• Given affinity distance matrix D = {d11,d12, …, dnn }

Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by

n

jiij

n

jiij

n

jiijij

ccdd

ccddr

1,1

2_

1,1

2_

1,1

__

)()(

))((

20SCR©

Correlation with Incidence matrix

n

jiij

n

jiij

n

jiijij

ccdd

ccddr

1,1

2_

1,1

2_

1,1

__

)()(

))((

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

r = -0.9235 r = -0.5810

21SCR©

Visualization of similarity matrix

• Order the similarity matrix with respect to cluster labels and inspect visually.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

22SCR©

• Clusters in random data are not so crisp

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Visualization of similarity matrix

23SCR©

Final Comment on Cluster Validity

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

24SCR©

Roadmap: Method

• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)

– Feature selection - Yan

• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan

• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Hierarchical clustering - Kunal– Dimension reduction for data visualization - Yan

• Deep learning (4 sessions) _ Neural network

– From neural network to deep learning - Yan– Convolutional neural network – Train deep nets with open-source tools

clustering introduction

Data & Analytics