clustering instructor: max welling ics 178 machine learning & data mining

Clustering

Instructor: Max Welling

ICS 178 Machine Learning & Data Mining

Unsupervised Learning• In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes.

• Our task is to discover structure in the data.

• Example: the data may be structured in clusters:

Is this a good clustering?

Why Discover Structure ?

• Often, the result of an unsupervised learning algorithm is a new representation for the same data. This new representation should be more meaningful and could be used for further processing (e.g. classification).

• Clustering: The new representation is now given by the label of a cluster to which the data-point belongs. This tells us which data-cases are similar to each other.

• The new representation is smaller and hence more convenient computationally.

• Clustering: Each data-case is now encoded by its cluster label. This is a lot cheaper than its attribute values.

• CF: We can group the users into user-communities or/and the movies into movie genres. If we need to predict something we simply pick the average rating in the group.

Clustering: K-means• We iterate two operations:

1. Update the assignment of data-cases to clusters2. Update the location of the cluster.

• Denote the assignment of data-case “i” to cluster “c”. • Denote the position of cluster “c” in a d-dimensional space.

• Denote the location of data-case i

• Then iterate until convergence:

1. For each data-case, compute distances to each cluster and pick the closest one:

2. For each cluster location, compute the mean location of all data-cases assigned to it:

[1,2,3,..., ]iz K

dc

dix

argmin|| ||i i cc

z x i

1

c

c ii Sc

x cN

Set of data-cases assigned to cluster cNr. of data-cases in cluster c

K-means

• Cost function:

• Each step in k-means decreases this cost function.

• Often initialization is very important since there are very many local minima in C. Relatively good initialization: place cluster locations on K randomly chosen data-cases.

• How to choose K? Add complexity term: and minimize also over K

2

1

|| ||zi

N

ii

C x

1[# ] log( )

2C C parameters N

Vector Quantization

• K-means divides the space up in a Voronoi tesselation.• Every point on a tile is summarized by the code-book vector “+”. This clearly allows for data compression !

Mixtures of Gaussians

• K-means assigns each data-case to exactly 1 cluster. But what if clusters are overlapping? Maybe we are uncertain as to which cluster it really belongs.

• The mixtures of Gaussians algorithm assigns data-cases to cluster with a certain probability.

MoG Clustering

1

/ 2

1 1[ ; , ] exp[ ( ) ( )]

22 det( )T

dN x x x

Covariance determines the shape of these contours

• Idea: fit these Gaussian densities to the data, one per cluster.

EM Algorithm: E-step

' ' '' 1

[ ; , ]

[ ; , ]

c i c cic K

ic c cc

N xr

N x

• “r” is the probability that data-case “i” belongs to cluster “c”.

• is the a priori probability of being assigned to cluster “c”.

• Note that if the Gaussian has high probability on data-case “i” (i.e. the bell-shape is on top of the data-case) then it claims high responsibility for this data-case.

• The denominator is just to normalize all responsibilities to 1:

c

1

1K

icc

r i

EM Algorithm: M-Step

1c ic i

ic

r xN

c ici

N r

1( )( )Tc ic i c i c

ic

r x xN

cc

NN

total responsibility claimed by cluster “c”

expected fraction of data-cases assigned to this cluster

weighted sample mean where every data-case is weighted according to the probability that it belongs to that cluster.

weighted sample covariance

EM-MoG

• EM comes from “expectation maximization”. We won’t go through the derivation.

• If we are forced to decide, we should assign a data-case to the cluster which claims highest responsibility.

• For a new data-case, we should compute responsibilities as in the E-step and pick the cluster with the largest responsibility.

• E and M steps should be iterated until convergence (which is guaranteed).

• Every step increases the following objective function (which is the total log-probability of the data under the model we are learning):

1 1

log [ ; , ]N K

c i c ci c

L N x

Agglomerative Hierarchical Clustering

• Define a “distance” between clusters (later).

• Initially, every data-case is its own cluster.

• At each iteration, compute the distances between all existing clusters (you can store distances and avoid their re-computation).

• Merge the closest clusters into 1 single cluster.

• Update you “dendrogram”.

Every data-case is a cluster

Iteration 1

Iteration 2

Iteration 3

• This way you build a hierarchy.

• Complexity Order (why?) 2N

Dendrogram

Distancesmin

, '

max, '

, '

( , ) || ' ||

( , ) || ' ||

1( , ) || ' ||

( , ) || ||

min

max

i j

i j

i j

i jx C x C

i jx C x C

avg i jx C x Ci j

mean i j i j

D C C x x

D C C x x

D C C x xN N

D C C

produces minimal spanning tree.

avoids elongated clusters.

Gene Expression DataMicro-array Data

• The expression level of genes is tested under different experimental conditions.

• We like to find the genes which co-express in a subset of conditions.

• Both genes and conditions are clustered and shown as dendrograms.

Exercise I

Imagine I have run a clustering algorithm on some data describing 3 attributes of cars: height, weight, length.I have found two clusters. An expert comes by and tells you that class 1 is really Ferrari’s while class 2 is Hummers.

• A new data-case (car) is presented, i.e. you get to see the height, weight, length. Describe how you can use the output of your clustering, including the informationobtained from the expert to classify the new car as a Ferrari or a Hummer.Be very precise: use an equation or pseudo-code to describe what to do.

• You add the new car to the dataset and run the K-means starting at its convergedassignments and cluster means obtained from before. Is it possible that the assignments of the old data change due to the addition of the new data-case?

Exercise II

• We classify data according to the 3-nearest neighbors (3-NN) rule. Explain in detail how this works.• Which decision surface do you think is smoother: the one for 1-NN or for 100-NN? Explain.• Is k-NN a parametric or non-parametric method. Give an important property of non-parametric classification method.

• We will do linear regression on data of the form (Xn,Yn) where Xn and Yn are real values: Yn = AXn+b+n where A,b are parameters and n is the noise variable.• Provide the equation for the total Error of the data-items.• We want to minimize the Error. With respect to what ?• You are given a new attribute Xnew. What would you predict for Ynew.

clustering instructor: max welling ics 178 machine learning & data mining

Documents

location of data

assignment of data

data compression

datacase i

cluster label

cluster ck

cluster cnr

position of cluster