2001 - k-means and hierarchical clustering (slides)
TRANSCRIPT
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
1/24
1
Nov 16th, 2001Copyright 2001, Andrew W. Moore
K-means andHierarchicalClustering
Andrew W. Moore Associate Professor
School of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Note to other teachers and users of these slides. Andrew would bedelighted if you found this sourcematerial useful in giving your ownlectures. Feel free to use these slidesverbatim, or to modify them to fit yourown needs. PowerPoint originals areavailable. If you make use of asignificant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials
. Comments and corrections gratefullyreceived.
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2
SomeData
This could easily bemodeled by aGaussian Mixture(with 5 components)
But lets look at an
satisfying, friendly andinfinitely popularalternative
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
2/24
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
3/24
3
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5
Suppose you transmit thecoordinates of points drawn
randomly from this dataset. You can install decodingsoftware at the receiver.
Youre only allowed to sendtwo bits per point.
Itll have to be a lossytransmission.
Loss = Sum Squared Errorbetween decoded coords andoriginal coords.What encoder/decoder willlose the least information?
Idea Two
00
1110
01
Break into a grid, decodeeach bit-pair as thecentroid of all data inthat grid-cell
A n y F u r t h e r I
d e a s ?
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 6
K-means1. Ask user how manyclusters theyd like.
(e.g. k=5)
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
4/24
4
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7
K-means1. Ask user how many
clusters theyd like.(e.g. k=5)
2. Randomly guess k cluster Centerlocations
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8
K-means1. Ask user how manyclusters theyd like.
(e.g. k=5)
2. Randomly guess k cluster Centerlocations
3. Each datapoint findsout which Center itsclosest to. (Thuseach Center owns a set of datapoints)
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
5/24
5
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9
K-means1. Ask user how many
clusters theyd like.(e.g. k=5)
2. Randomly guess k cluster Centerlocations
3. Each datapoint findsout which Center itsclosest to.
4. Each Center findsthe centroid of thepoints it owns
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 10
K-means1. Ask user how manyclusters theyd like.
(e.g. k=5)
2. Randomly guess k cluster Centerlocations
3. Each datapoint findsout which Center itsclosest to.
4. Each Center findsthe centroid of thepoints it owns
5. and jumps there
6. Repeat untilterminated!
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
6/24
6
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 11
K-means
Start Advance apologies: inBlack and White thisexample will deteriorate
Example generated byDan Pellegs super-duperfast K-means system:
Dan Pelleg and AndrewMoore. Accelerating Exactk-means Algorithms withGeometric Reasoning.Proc. Conference on
Knowledge Discovery inDatabases 1999,(KDD99) (available onwww.autonlab.org/pap.html )
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 12
K-meanscontinues
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
7/24
7
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 13
K-meanscontinues
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 14
K-meanscontinues
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
8/24
8
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 15
K-meanscontinues
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 16
K-meanscontinues
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
9/24
9
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 17
K-meanscontinues
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 18
K-meanscontinues
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
10/24
10
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 19
K-meanscontinues
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 20
K-meansterminates
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
11/24
11
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 21
K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal
clustering? How should we start it? How could we automatically choose the
number of centers?
.well deal with these questions over the next few slides
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 22
DistortionGiven..an encoder function: ENCODE : m [1..k]
a decoder function: DECODE : [1..k] m
Define
( )=
= R
iii
1
2)]([Distortion ENCODEDECODE xx
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
12/24
12
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 23
DistortionGiven..
an encoder function: ENCODE : m [1..k]
a decoder function: DECODE : [1..k] m
Define
We may as well write
=
=
= R
ii
j
i
j
1
2)(ENCODE )(Distortionso
][DECODE
xcx
c
( )=
= R
iii
1
2)]([Distortion ENCODEDECODE xx
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 24
The Minimal Distortion
What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
13/24
13
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 25
The Minimal Distortion (1)
What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?
(1) x i must be encoded by its nearest center
.why?
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
2
},...,{)(ENCODE )(minarg
21
jik j
icxc
cccc
x=
..at the minimal distortion
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 26
The Minimal Distortion (1)
What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?
(1) x i must be encoded by its nearest center
.why?
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
2
},...,{)(ENCODE )(minarg
21
jik j
i cxccccc
x =
..at the minimal distortion
Otherwise distortion could bereduced by replacing ENCODE[ x i]by the nearest center
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
14/24
14
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 27
The Minimal Distortion (2)
What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?
(2) The partial derivative of Distortion with respectto each center location must be zero.
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 28
(2) The partial derivative of Distortion with respectto each center location must be zero.
minimum)a(for0
)(2
)(Distortion
)(
)(Distortion
)OwnedBy(
)OwnedBy(
2
1 )OwnedBy(
2
1
2)(ENCODE
=
=
=
=
=
=
=
j
j
j
i
i ji
i ji
j j
k
j i ji
R
ii
c
c
c
x
cx
cxcc
cx
cx
OwnedBy( c j ) = the setof records owned byCenter c j .
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
15/24
15
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29
(2) The partial derivative of Distortion with respectto each center location must be zero.
minimum)a(for0
)(2
)(Distortion
)(
)(Distortion
)OwnedBy(
)OwnedBy(
2
1 )OwnedBy(
2
1
2)(ENCODE
=
=
=
=
=
=
=
j
j
j
i
i
ji
i ji
j j
k
j i ji
R
ii
c
c
c
x
cx
cxcc
cx
cx
=)OwnedBy(|)OwnedBy(|
1
jii
j j
c
xc
cThus, at a minimum:
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 30
At the minimum distortion
What properties must centers c 1 , c 2 , , c k have whendistortion is minimized?
(1) x i must be encoded by its nearest center
(2) Each Center must be at the centroid of points it owns.
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
16/24
16
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 31
Improving a suboptimal configuration
What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?
(1) Change encoding so that x i is encoded by its nearest center
(2) Set each Center to the centroid of points it owns.
Theres no point applying either operation twice in succession.
But it can be profitable to alternate.
And thats K-means!
Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32
Improving a suboptimal configuration
What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?
(1) Change encoding so that x i is encoded by its nearest center
(2) Set each Center to the centroid of points it owns.
Theres no point applying either operation twice in succession.
But it can be profitable to alternate.
And thats K-means!
Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?
=
= R
ii i
1
2)(ENCODE )(Distortion xcx
T h ere are onl y a f in i t e num b er o f wa ys
o f par t i t ion ing R
records in t o k groups.
So t h ere are onl y a f in i t e num ber
o f poss i b le
con f igura t ions in w h ic h all Cen t er
s are t h e cen t ro ids o f
t h e po in t s t h e y o wn.
I f t h e con f igura t ion c h anges on a
n i t era t ion, i t mus t h a ve
impro ved t h e d is t or t ion.
So eac h t ime t h e con f igura t ion c h
anges i t mus t go t o a
con f igura t ion i t s ne ver b een t o b
e fore.
So i f i t t r ied t o go on fore ver, i t w
ould e ven t uall y run ou t
o f con f igura t ions.
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
17/24
17
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 33
Will we find the optimal
configuration? Not necessarily. Can you invent a configuration that has
converged, but does not have the minimumdistortion?
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 34
Will we find the optimal
configuration? Not necessarily. Can you invent a configuration that has
converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
18/24
18
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 35
Will we find the optimal
configuration? Not necessarily. Can you invent a configuration that has
converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 36
Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each
from a different random start configuration Many other ideas floating around.
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
19/24
19
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 37
Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each
from a different random start configuration Many other ideas floating around.
Neat trick:Place first center on top of randomly chosen datapoint.Place second center on datapoint thats as far away aspossible from first center
:
Place jth center on datapoint thats as far away aspossible from the closest of Centers 1 through j-1:
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 38
Choosing the number of Centers A difficult problem Most common approach is to try to find the
solution that minimizes the SchwarzCriterion (also related to the BIC)
log)parameters(#Distortion R?+
logDistortion R?mk +=
m=#dimensions k=#Centers R=#Records
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
20/24
20
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 39
Common uses of K-means Often used as an exploratory data analysis tool In one-dimension, a good way to quantize real-
valued variables into k non-uniform buckets Used on acoustic data in speech understanding to
convert waveforms into one of k categories(known as Vector Quantization)
Also used for choosing color palettes on oldfashioned graphical display devices!
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 40
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
21/24
21
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 41
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
2. Find most similar pairof clusters
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 42
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
2. Find most similar pairof clusters
3. Merge it into a parentcluster
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
22/24
22
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 43
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
2. Find most similar pairof clusters
3. Merge it into a parentcluster
4. Repeat
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 44
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
2. Find most similar pairof clusters
3. Merge it into a parentcluster
4. Repeat
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
23/24
23
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 45
Single Linkage Hierarchical
Clustering1. Say Every point is its
own cluster
2. Find most similar pairof clusters
3. Merge it into a parentcluster
4. Repeatuntil youvemerged the wholedataset into one cluster Youre left with a nice
dendrogram, or taxonomy, orhierarchy of datapoints (not
shown here)
How do we define similaritybetween clusters?
Minimum distance betweenpoints in clusters (in whichcase were simply doingEuclidian Minimum SpanningTrees)
Maximum distance betweenpoints in clusters
Average distance betweenpoints in clusters
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 46
Single Linkage Comments Its nice that you get a hierarchy instead of
an amorphous collection of groups If you want k groups, just cut the (k-1)
longest links Theres no real statistical or information-
theoretic foundation to this. Makes yourlecturer feel a bit queasy.
Also known in the trade asHierarchical Agglomerative
Clustering (note the acronym)
-
8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)
24/24
Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47
What you should know All the details of K-means The theory behind K-means as an
optimization algorithm How K-means can get stuck The outline of Hierarchical clustering Be able to contrast between which problems
would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchicalclustering