clustering introduction

25
Clustering for New Discovery in Data Houston Machine Learning Meetup

Upload: yan-xu

Post on 15-Apr-2017

147 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Clustering introduction

Clustering for New Discovery in Data

Houston Machine Learning Meetup

Page 2: Clustering introduction

2SCR©

Roadmap: Method

• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)

– Feature selection - Yan

• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan

• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Agglomerative clustering - Kunal– Dimension reduction for data visualization - Yan

• Deep learning (4 sessions) _ Neural network

– From neural network to deep learning – Convolutional neural network– Train deep nets with open-source tools

Page 3: Clustering introduction

3SCR©

Roadmap: Application

• Business analytics

• Recommendation system

• Natural language processing

• Computer vision

• Energy industry

Page 4: Clustering introduction

4SCR©

Agenda

• Introduction

• Application of clustering

• K-means

• DBSCAN

• Cluster validation

Page 5: Clustering introduction

5SCR©

What is clustering

Clustering: to discover the natural groupings of a set of objects/patterns in the unlabeled data

Page 6: Clustering introduction

6SCR©

Application: Recommendation

Page 7: Clustering introduction

7SCR©

Application: Document Clustering

https://www.noggle.online/knowledgebase/document-clustering/

Page 8: Clustering introduction

8SCR©

Application: Pizza Hut Center

Delivery locations

Page 9: Clustering introduction

9SCR©

Application: Discovering Gene functions

Important to discover diseases and treatment

Page 10: Clustering introduction

10SCR©

Clustering Algorithm

• K-Means (King of clustering, many variants)

• DBSCAN (group neighboring points)

• Mean shift (locating the maxima of density)

• Spectral clustering (cares about connectivity instead of proximity)

• Hierarchical clustering (a hierarchical structure, multiple levels)

• Expectation Maximization (k-means is a variant of EM)

• Latent Dirichlet Allocation (natural language processing)

……

Page 11: Clustering introduction

11SCR©

• K-Means

• DBSCAN

Page 12: Clustering introduction

12SCR©

Cluster Validation

Page 13: Clustering introduction

13SCR©

Cluster Validity

• For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?

• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To determine the optimal number of clusters

Page 14: Clustering introduction

14SCR©

Cluster Validity

• Numerical measures:– External: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

– Internal: Used to measure the goodness of a clustering structure without respect to external information.

• Sum of Squared Error (SSE)– Relative: Used to compare two different clusterings.

• Often an external or internal measurement is used for this function, e.g., SSE or entropy

• Visualization

Page 15: Clustering introduction

15SCR©

Internal Measures: WSE and BSE

• Cluster Cohesion: Measures how closely related are objects in a cluster– Example: SSE

• Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

• Example: Squared Error– Cohesion is measured by the within cluster sum of squares (SSE)

– Separation is measured by the between cluster sum of squares

– Where |Ci| is the size of cluster i

i Cx

ii

mxWSS 2)(

i

ii mmCBSS 2)(

Page 16: Clustering introduction

16SCR©

Internal Measures: WSE and BSE

• Example: SSE– BSS + WSS = constant

10919)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

TotalBSS

WSS

1 2 3 4 5 m1 m2

m

K=2 clusters:

100100)33(4

10)35()34()32()31(2

2222

TotalBSS

WSSK=1 cluster:

Page 17: Clustering introduction

17SCR©

Internal Measures: WSE and BSE

• Can be used to estimate the number of clusters

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

KS

SE

5 10 15

-6

-4

-2

0

2

4

6

WS

S

Page 18: Clustering introduction

18SCR©

Internal Measures: Proximity graph measures

• Cluster cohesion is the sum of the weight of all links within a cluster.

• Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.

cohesion separation

Page 19: Clustering introduction

19SCR©

Correlation between affinity matrix and incidence matrix

• Given affinity distance matrix D = {d11,d12, …, dnn }

Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by

n

jiij

n

jiij

n

jiijij

ccdd

ccddr

1,1

2_

1,1

2_

1,1

__

)()(

))((

Page 20: Clustering introduction

20SCR©

Correlation with Incidence matrix

n

jiij

n

jiij

n

jiijij

ccdd

ccddr

1,1

2_

1,1

2_

1,1

__

)()(

))((

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

r = -0.9235 r = -0.5810

Page 21: Clustering introduction

21SCR©

Visualization of similarity matrix

• Order the similarity matrix with respect to cluster labels and inspect visually.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 22: Clustering introduction

22SCR©

• Clusters in random data are not so crisp

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Visualization of similarity matrix

Page 23: Clustering introduction

23SCR©

Final Comment on Cluster Validity

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Page 24: Clustering introduction

24SCR©

Roadmap: Method

• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)

– Feature selection - Yan

• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan

• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Hierarchical clustering - Kunal– Dimension reduction for data visualization - Yan

• Deep learning (4 sessions) _ Neural network

– From neural network to deep learning - Yan– Convolutional neural network – Train deep nets with open-source tools

Page 25: Clustering introduction

25SCR©

Thank you

Slides will be posted on slide share:

http://www.slideshare.net/xuyangela