12. clustering - chloé-agathe azencottcazencott.info/dotclear/public/lectures/ma2823_2016/...–...

12. Clustering

Foundations of Machine LearningCentraleSupélec — Fall 2016

Benoît Playe, Chloé-Agathe AzencottCentre for Computational Biology, Mines

ParisTechchloeagathe.azencott@minesparistech.fr

2

Learning objectives● Explain what clustering algorithms can

be used for.● Explain and implement three different

ways to evaluate clustering algorithms.

● Implement hierarchical clustering, discuss its various flavors.

● Implement k-means clustering, discuss its advantages and drawbacks.

3

Goals of clustering

Group objects that are similar into clusters: classes that are unknown beforehand.

4

Goals of clustering


5

Goals of clustering


E.g. – group genes that are similarly affected by a

disease

– group people who share the same interest (marketing purposes)

– group pixels in an image that belong to the same object (image segmentation).

6

Applications of clustering● Understand general characteristics of the data● Visualize the data● Infer some properties of a data point based on

how it relates to other data points

E.g.– find subtypes of diseases– visualize protein families– find categories among images– find patterns in financial transactions– detect communities in social networks

7

Centroids and medoids● Centroid: mean of the points in the cluster.

● Medoid: point in the cluster that is closest to the centroid.

8

Distances and similarities

9

For a given mapping

from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces.

Similarities● Kernels define similarities

10

Distances● Assess how close / far

– data points are from each other

– a data point is from a cluster

– two clusters are from each other

11

Distances● Assess how close / far

– data points are from each other

– a data point is from a cluster

– two clusters are from each other

● Distance metric

symmetry

triangle inequality

12

Distance & similarities● Transform distances into similarities?

13

Distance & similarities● Transform distances into similarities?

● Transform similarities into distances?

Generalization:

14

Distances● L-q norm:

● Pearson's correlation

Measure of the linear correlation between two variables

15

Pearson's correlationWhat does it correspond to if the features are centered?

16


Normalized dot product = cosine

featu

re 2

feature 1

● What's the max value of ρ? What does it mean?

● What's the min value of ρ? What does it mean?

17


Normalized dot product = cosine

featu

re 2

feature 1

● Max value of ρ: 1

There's a positive linear relationship between the features of x and the features of z.

● Min value of ρ: -1

There's a negative linear relationship between the features of x and the features of z.

18

Pearson vs Euclide● Pearson's coefficient

Profiles of similar shapes will be close to each other, even if they differ in magnitude.

● Euclidean distance

Magnitude is taken into account.

19

Pearson vs Euclide

20

Evaluating clusters

21

Evaluating clusters● Clustering is unsupervised. There is no

ground truth.● How do we evaluate the quality of a

clustering algorithm?

22

Evaluating clusters● Clustering is unsupervised. There is no

ground truth.● How do we evaluate the quality of a

clustering algorithm?1)Based on the shape of the clusters:

Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters.

2)Based on the stability of the clusters:

We should get the same results if we remove some data points, add noise, etc.

3)Based on domain knowledge:

The clusters should “make sense”.

23

Clusters shape● Cluster tightness (homogeneity)

● Cluster separation

● Davies-Bouldin index

Tk

Skl

24

Clusters shape● Silhouette coefficient

a(x) = average dissimilarity of x with the other points in the same cluster

b(x) = lowest average dissimilarity of x with any other point

– If x is very close to other points in the same cluster, and very different from points in other cluster, s(x) ≈ … ?

– If x is assigned to the wrong cluster, s(x) ≈ ... ?

25

Clusters shape● Silhouette coefficient

a(x) = average dissimilarity of x with the other points in the same cluster

b(x) = lowest average dissimilarity of x with any other point

– If x is very close to other points in the same cluster, and very different from points in other cluster, s(x) ≈ +1.

– If x is assigned to the wrong cluster, s(x) ≈ -1.

26

Cluster stability● How many clusters?

27

Cluster stability

● k=4

● K=2 or 5. What clusters do you expect?

28

Cluster stability

● k=4

● k=2 ● k=5

● Several algorithms / runs of the same algorithm might give you different answers

29

Cluster stability

● k=4

● k=2 ● k=5

● Several algorithms / runs of the same algorithm might give you different answers

30

Cluster stability

● Measuring cluster stability● Generate perturbed versions of the original dataset

(for example by sub-sampling or adding noise).● Cluster the data set with the desired algorithm into k

cluster.● Instability measure:

● One can choose the number of cluster which minimizes the instability measure

31

Domain knowledge● Do the cluster match natural

categories?– Check with human expertise

32

Domain knowledge: enrichment analysis

● Example: Ontology

Entities may be grouped, related within a hierarchy, and subdivided according to similarities and differences.

Build by human experts

● E.g.: The Gene Ontologyhttp://geneontology.org/

– Describe genes with a common vocabulary, organized in categories

E.g. cellular process > cell death > programmed cell death > apoptotic process > execution phase of apoptosis

33

Ontology enrichment analysis

● Enrichment analysis:

Are there more data points from ontology category G in cluster C than expected by chance?

● TANGO [Tanay et al., 2003]

– Assume data points sampled from a hypergeometric distribution

34

Ontology enrichment analysis

● Enrichment analysis:

Are there more data points from ontology category G in cluster C than expected by chance?

● TANGO [Tanay et al., 2003]

– Assume data points sampled from a hypergeometric distribution

– The probability for the intersection of G and C to contain more than t points is:

Probability of getting i points from G when drawing |C| points from a total of n samples.

35

Hierarchical clustering

36

Hierachical clustering

Group data over a variety of possible scales, in a multi-level hierarchy.

37

Construction● Agglomerative approach (bottom-up)

Start with each element in its own cluster

Iteratively join neighboring clusters.

● Divisive approach (top-down)

Start with all elements in the same cluster

Iteratively separate into smaller clusters.

38

Dendogram● The results of a hierarchical clustering

algorithm are presented in a dendogram.● Branch length = cluster distance.

How many clusters do I have?

39

Dendogram● The results of a hierarchical clustering

algorithm are presented in a dendogram.● Branch length = cluster distance.

How many clusters do I have?

21 3 4

40

Linkage

How do we decide how to connect/split two clusters?● Single linkage

● Complete linkage

41

LinkageHow do we decide how to connect/split two clusters?

● Average linkage or UPGMA

Unweighted Paired Group Method with Arithmetic mean

● Centroid linkage or UPGMC– Unweighted Paired Group Method using

Centroids

42

Draw single linkage, complete linkage, UPGMC.

43

Draw single linkage, complete linkage, UPGMC.

44

Example: Gene expression clustering

Breast cancer survival signature

[Bergamashi et al. 2011]

gen

es

patients1 2

2

1

45

LinkageHow do we decide how to connect/split two clusters?

● Ward

Join clusters so as to minimize within-cluster variance

46

Hierarchical clustering● Advantages

– No need to pre-define the number of clusters

– Interpretability

● Drawbacks– Computational complexity

What is the computational complexity of hierarchical clustering?

47

Hierarchical clustering● Advantages

– No need to pre-define the number of clusters– Interpretability

● Drawbacks– Computational complexity:

E.g. Single/complete linkage (naive):

At least O(pn²) to compute all pairwise distances.

– Must decide at which level of the hierarchy to split

– Lack of robustness (unstable)

48

K-means

49

K-means clustering● Minimize the intra-cluster variance

● What will this partition of the space look like?

50

K-means clustering● Minimize the intra-cluster variance

● Voronoi tessellation

51

Lloyd's algorithm● K-means cannot be easily optimized● We adopt a greedy strategy.

– Partition the data into K clusters at random

– Compute the centroid of each cluster

– Assign each point to the cluster whose centroid it is closest to

– Repeat until cluster membership converges.

52

demo

53

K-means● Advantages

– What is the computational time of k-means?

54


– Computational time:

– Easily implementable

● Drawbacks– Need to set up K ahead of time

– What happens when there are outliers?

number of iterations

compute kn distancesin p dimensions (Can be small if there's

indeed a cluster structure in the data)

55


– Computational time is linear – Easily implementable

● Drawbacks– Need to set up K ahead of time– Sensitive to noise and outliers– Stochastic (different solutions with each

iteration)– The clusters are forced to have spherical

shapes

56

K-means variants● K-means++

– Seeding algorithm to initialize clusters with centroids “spread-out” throughout the data.

– Deterministic● K-medoids● Kernel k-means

Find clusters in feature space

k-means kernel k-means

57

More clustering approaches● Soft clustering

Each point gets a probability of belonging to each cluster.

● Disjunctive clustering

Each point can belong to multiple clusters.

● Density-based clustering

Look for dense regions of the space.

E.g. DBSCAN

58

Summary● Clustering: unsupervised approach to group

similar data points together.● Evaluate clustering algorithms based on

– the shape of the cluster– the stability of the results– the consistency with domain knowledge.

● Hierarchical clustering– top-down / bottom-up– various linkage functions.

● k-means clustering.

59

Exam: Fri, Dec 16 8am‒11am

● No documents, no calculators, no computer.● Theoretical, technical, and practical questions ● Exam from last year online ● Short answers!● How to study

– Homework + last year's exam

– Labs

– Answer the questions on the slides.

● Formulas– To know: Bayes, how to compute derivatives.

– Everything else will be given. Interpretation is key.

60

challenge project

● Detailed instructions on the course website http://cazencott.info/dotclear/public/lectures/ma2823_2016/kaggle-project.pdf

● Deadline for submissions & for the report:

Fri, Dec 16 at 23:59● Report: Practical instructions:– PDF document

● No more than 2 pages● Lastname1_Lastname2.pdf

– Starts with● Full names● Kaggle user names● Kaggle team names.

By email to all three of us!

12. clustering - chloé-agathe azencottcazencott.info/dotclear/public/lectures/ma2823_2016/...–...

Documents