2001 - k-means and hierarchical clustering (slides)

8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

1/24

1

Nov 16th, 2001Copyright 2001, Andrew W. Moore

K-means andHierarchicalClustering

Andrew W. Moore Associate Professor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would bedelighted if you found this sourcematerial useful in giving your ownlectures. Feel free to use these slidesverbatim, or to modify them to fit yourown needs. PowerPoint originals areavailable. If you make use of asignificant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials

. Comments and corrections gratefullyreceived.

Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2

SomeData

This could easily bemodeled by aGaussian Mixture(with 5 components)

But lets look at an

satisfying, friendly andinfinitely popularalternative


2/24


3/24

3


Suppose you transmit thecoordinates of points drawn

randomly from this dataset. You can install decodingsoftware at the receiver.

Youre only allowed to sendtwo bits per point.

Itll have to be a lossytransmission.

Loss = Sum Squared Errorbetween decoded coords andoriginal coords.What encoder/decoder willlose the least information?

Idea Two

00

1110

01

Break into a grid, decodeeach bit-pair as thecentroid of all data inthat grid-cell

A n y F u r t h e r I

d e a s ?


K-means1. Ask user how manyclusters theyd like.

(e.g. k=5)


4/24

4


K-means1. Ask user how many

clusters theyd like.(e.g. k=5)

2. Randomly guess k cluster Centerlocations



(e.g. k=5)


3. Each datapoint findsout which Center itsclosest to. (Thuseach Center owns a set of datapoints)


5/24

5


K-means1. Ask user how many

clusters theyd like.(e.g. k=5)


3. Each datapoint findsout which Center itsclosest to.

4. Each Center findsthe centroid of thepoints it owns



(e.g. k=5)


3. Each datapoint findsout which Center itsclosest to.

4. Each Center findsthe centroid of thepoints it owns

5. and jumps there

6. Repeat untilterminated!


6/24

6


K-means

Start Advance apologies: inBlack and White thisexample will deteriorate

Example generated byDan Pellegs super-duperfast K-means system:

Dan Pelleg and AndrewMoore. Accelerating Exactk-means Algorithms withGeometric Reasoning.Proc. Conference on

Knowledge Discovery inDatabases 1999,(KDD99) (available onwww.autonlab.org/pap.html )


K-meanscontinues


7/24

7


K-meanscontinues


K-meanscontinues


8/24

8


K-meanscontinues


K-meanscontinues


9/24

9


K-meanscontinues


K-meanscontinues


10/24

10


K-meanscontinues


K-meansterminates


11/24

11


K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal

clustering? How should we start it? How could we automatically choose the

number of centers?

.well deal with these questions over the next few slides


DistortionGiven..an encoder function: ENCODE : m [1..k]

a decoder function: DECODE : [1..k] m

Define

( )=

= R

iii

1

2)]([Distortion ENCODEDECODE xx


12/24

12


DistortionGiven..

an encoder function: ENCODE : m [1..k]

a decoder function: DECODE : [1..k] m

Define

We may as well write

=

=

= R

ii

j

i

j

1

2)(ENCODE )(Distortionso

][DECODE

xcx

c

( )=

= R

iii

1

2)]([Distortion ENCODEDECODE xx


The Minimal Distortion

What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?

=

= R

ii i

1

2)(ENCODE )(Distortion xcx


13/24

13


The Minimal Distortion (1)


(1) x i must be encoded by its nearest center

.why?

=

= R

ii i

1


2

},...,{)(ENCODE )(minarg

21

jik j

icxc

cccc

x=

..at the minimal distortion





.why?

=

= R

ii i

1


2

},...,{)(ENCODE )(minarg

21

jik j

i cxccccc

x =

..at the minimal distortion

Otherwise distortion could bereduced by replacing ENCODE[ x i]by the nearest center


14/24

14




(2) The partial derivative of Distortion with respectto each center location must be zero.

=

= R

ii i

1




minimum)a(for0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

=

=

=

=

=

=

j

j

j

i

i ji

i ji

j j

k

j i ji

R

ii

c

c

c

x

cx

cxcc

cx

cx

OwnedBy( c j ) = the setof records owned byCenter c j .


15/24

15



minimum)a(for0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

=

=

=

=

=

=

j

j

j

i

i

ji

i ji

j j

k

j i ji

R

ii

c

c

c

x

cx

cxcc

cx

cx

=)OwnedBy(|)OwnedBy(|

1

jii

j j

c

xc

cThus, at a minimum:


At the minimum distortion

What properties must centers c 1 , c 2 , , c k have whendistortion is minimized?


(2) Each Center must be at the centroid of points it owns.

=

= R

ii i

1



16/24

16


Improving a suboptimal configuration

What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?

(1) Change encoding so that x i is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

Theres no point applying either operation twice in succession.

But it can be profitable to alternate.

And thats K-means!

Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?

=

= R

ii i

1



Improving a suboptimal configuration

What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?

(1) Change encoding so that x i is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

Theres no point applying either operation twice in succession.

But it can be profitable to alternate.

And thats K-means!

Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?

=

= R

ii i

1


T h ere are onl y a f in i t e num b er o f wa ys

o f par t i t ion ing R

records in t o k groups.

So t h ere are onl y a f in i t e num ber

o f poss i b le

con f igura t ions in w h ic h all Cen t er

s are t h e cen t ro ids o f

t h e po in t s t h e y o wn.

I f t h e con f igura t ion c h anges on a

n i t era t ion, i t mus t h a ve

impro ved t h e d is t or t ion.

So eac h t ime t h e con f igura t ion c h

anges i t mus t go t o a

con f igura t ion i t s ne ver b een t o b

e fore.

So i f i t t r ied t o go on fore ver, i t w

ould e ven t uall y run ou t

o f con f igura t ions.


17/24

17


Will we find the optimal

configuration? Not necessarily. Can you invent a configuration that has

converged, but does not have the minimumdistortion?




converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)


18/24

18




converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)


Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each

from a different random start configuration Many other ideas floating around.


19/24

19


Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each

from a different random start configuration Many other ideas floating around.

Neat trick:Place first center on top of randomly chosen datapoint.Place second center on datapoint thats as far away aspossible from first center

:

Place jth center on datapoint thats as far away aspossible from the closest of Centers 1 through j-1:


Choosing the number of Centers A difficult problem Most common approach is to try to find the

solution that minimizes the SchwarzCriterion (also related to the BIC)

log)parameters(#Distortion R?+

logDistortion R?mk +=

m=#dimensions k=#Centers R=#Records


20/24

20


Common uses of K-means Often used as an exploratory data analysis tool In one-dimension, a good way to quantize real-

valued variables into k non-uniform buckets Used on acoustic data in speech understanding to

convert waveforms into one of k categories(known as Vector Quantization)

Also used for choosing color palettes on oldfashioned graphical display devices!


Single Linkage Hierarchical

Clustering1. Say Every point is its

own cluster


21/24

21




own cluster

2. Find most similar pairof clusters




own cluster


3. Merge it into a parentcluster


22/24

22




own cluster



4. Repeat




own cluster



4. Repeat


23/24

23




own cluster



4. Repeatuntil youvemerged the wholedataset into one cluster Youre left with a nice

dendrogram, or taxonomy, orhierarchy of datapoints (not

shown here)

How do we define similaritybetween clusters?

Minimum distance betweenpoints in clusters (in whichcase were simply doingEuclidian Minimum SpanningTrees)

Maximum distance betweenpoints in clusters

Average distance betweenpoints in clusters


Single Linkage Comments Its nice that you get a hierarchy instead of

an amorphous collection of groups If you want k groups, just cut the (k-1)

longest links Theres no real statistical or information-

theoretic foundation to this. Makes yourlecturer feel a bit queasy.

Also known in the trade asHierarchical Agglomerative

Clustering (note the acronym)


24/24


What you should know All the details of K-means The theory behind K-means as an

optimization algorithm How K-means can get stuck The outline of Hierarchical clustering Be able to contrast between which problems

would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchicalclustering

2001 - k-means and hierarchical clustering (slides)

Documents