2001 - k-means and hierarchical clustering (slides)

Upload: franck-dernoncourt

Post on 10-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    1/24

    1

    Nov 16th, 2001Copyright 2001, Andrew W. Moore

    K-means andHierarchicalClustering

    Andrew W. Moore Associate Professor

    School of Computer ScienceCarnegie Mellon University

    www.cs.cmu.edu/[email protected]

    412-268-7599

    Note to other teachers and users of these slides. Andrew would bedelighted if you found this sourcematerial useful in giving your ownlectures. Feel free to use these slidesverbatim, or to modify them to fit yourown needs. PowerPoint originals areavailable. If you make use of asignificant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials

    . Comments and corrections gratefullyreceived.

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2

    SomeData

    This could easily bemodeled by aGaussian Mixture(with 5 components)

    But lets look at an

    satisfying, friendly andinfinitely popularalternative

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    2/24

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    3/24

    3

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5

    Suppose you transmit thecoordinates of points drawn

    randomly from this dataset. You can install decodingsoftware at the receiver.

    Youre only allowed to sendtwo bits per point.

    Itll have to be a lossytransmission.

    Loss = Sum Squared Errorbetween decoded coords andoriginal coords.What encoder/decoder willlose the least information?

    Idea Two

    00

    1110

    01

    Break into a grid, decodeeach bit-pair as thecentroid of all data inthat grid-cell

    A n y F u r t h e r I

    d e a s ?

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 6

    K-means1. Ask user how manyclusters theyd like.

    (e.g. k=5)

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    4/24

    4

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7

    K-means1. Ask user how many

    clusters theyd like.(e.g. k=5)

    2. Randomly guess k cluster Centerlocations

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8

    K-means1. Ask user how manyclusters theyd like.

    (e.g. k=5)

    2. Randomly guess k cluster Centerlocations

    3. Each datapoint findsout which Center itsclosest to. (Thuseach Center owns a set of datapoints)

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    5/24

    5

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9

    K-means1. Ask user how many

    clusters theyd like.(e.g. k=5)

    2. Randomly guess k cluster Centerlocations

    3. Each datapoint findsout which Center itsclosest to.

    4. Each Center findsthe centroid of thepoints it owns

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 10

    K-means1. Ask user how manyclusters theyd like.

    (e.g. k=5)

    2. Randomly guess k cluster Centerlocations

    3. Each datapoint findsout which Center itsclosest to.

    4. Each Center findsthe centroid of thepoints it owns

    5. and jumps there

    6. Repeat untilterminated!

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    6/24

    6

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 11

    K-means

    Start Advance apologies: inBlack and White thisexample will deteriorate

    Example generated byDan Pellegs super-duperfast K-means system:

    Dan Pelleg and AndrewMoore. Accelerating Exactk-means Algorithms withGeometric Reasoning.Proc. Conference on

    Knowledge Discovery inDatabases 1999,(KDD99) (available onwww.autonlab.org/pap.html )

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 12

    K-meanscontinues

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    7/24

    7

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 13

    K-meanscontinues

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 14

    K-meanscontinues

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    8/24

    8

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 15

    K-meanscontinues

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 16

    K-meanscontinues

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    9/24

    9

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 17

    K-meanscontinues

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 18

    K-meanscontinues

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    10/24

    10

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 19

    K-meanscontinues

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 20

    K-meansterminates

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    11/24

    11

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 21

    K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal

    clustering? How should we start it? How could we automatically choose the

    number of centers?

    .well deal with these questions over the next few slides

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 22

    DistortionGiven..an encoder function: ENCODE : m [1..k]

    a decoder function: DECODE : [1..k] m

    Define

    ( )=

    = R

    iii

    1

    2)]([Distortion ENCODEDECODE xx

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    12/24

    12

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 23

    DistortionGiven..

    an encoder function: ENCODE : m [1..k]

    a decoder function: DECODE : [1..k] m

    Define

    We may as well write

    =

    =

    = R

    ii

    j

    i

    j

    1

    2)(ENCODE )(Distortionso

    ][DECODE

    xcx

    c

    ( )=

    = R

    iii

    1

    2)]([Distortion ENCODEDECODE xx

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 24

    The Minimal Distortion

    What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    13/24

    13

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 25

    The Minimal Distortion (1)

    What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?

    (1) x i must be encoded by its nearest center

    .why?

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

    2

    },...,{)(ENCODE )(minarg

    21

    jik j

    icxc

    cccc

    x=

    ..at the minimal distortion

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 26

    The Minimal Distortion (1)

    What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?

    (1) x i must be encoded by its nearest center

    .why?

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

    2

    },...,{)(ENCODE )(minarg

    21

    jik j

    i cxccccc

    x =

    ..at the minimal distortion

    Otherwise distortion could bereduced by replacing ENCODE[ x i]by the nearest center

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    14/24

    14

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 27

    The Minimal Distortion (2)

    What properties must centers c 1 , c 2 , , c k havewhen distortion is minimized?

    (2) The partial derivative of Distortion with respectto each center location must be zero.

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 28

    (2) The partial derivative of Distortion with respectto each center location must be zero.

    minimum)a(for0

    )(2

    )(Distortion

    )(

    )(Distortion

    )OwnedBy(

    )OwnedBy(

    2

    1 )OwnedBy(

    2

    1

    2)(ENCODE

    =

    =

    =

    =

    =

    =

    =

    j

    j

    j

    i

    i ji

    i ji

    j j

    k

    j i ji

    R

    ii

    c

    c

    c

    x

    cx

    cxcc

    cx

    cx

    OwnedBy( c j ) = the setof records owned byCenter c j .

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    15/24

    15

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29

    (2) The partial derivative of Distortion with respectto each center location must be zero.

    minimum)a(for0

    )(2

    )(Distortion

    )(

    )(Distortion

    )OwnedBy(

    )OwnedBy(

    2

    1 )OwnedBy(

    2

    1

    2)(ENCODE

    =

    =

    =

    =

    =

    =

    =

    j

    j

    j

    i

    i

    ji

    i ji

    j j

    k

    j i ji

    R

    ii

    c

    c

    c

    x

    cx

    cxcc

    cx

    cx

    =)OwnedBy(|)OwnedBy(|

    1

    jii

    j j

    c

    xc

    cThus, at a minimum:

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 30

    At the minimum distortion

    What properties must centers c 1 , c 2 , , c k have whendistortion is minimized?

    (1) x i must be encoded by its nearest center

    (2) Each Center must be at the centroid of points it owns.

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    16/24

    16

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 31

    Improving a suboptimal configuration

    What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?

    (1) Change encoding so that x i is encoded by its nearest center

    (2) Set each Center to the centroid of points it owns.

    Theres no point applying either operation twice in succession.

    But it can be profitable to alternate.

    And thats K-means!

    Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32

    Improving a suboptimal configuration

    What properties can be changed for centers c 1 , c 2 , , c k have when distortion is not minimized?

    (1) Change encoding so that x i is encoded by its nearest center

    (2) Set each Center to the centroid of points it owns.

    Theres no point applying either operation twice in succession.

    But it can be profitable to alternate.

    And thats K-means!

    Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?

    =

    = R

    ii i

    1

    2)(ENCODE )(Distortion xcx

    T h ere are onl y a f in i t e num b er o f wa ys

    o f par t i t ion ing R

    records in t o k groups.

    So t h ere are onl y a f in i t e num ber

    o f poss i b le

    con f igura t ions in w h ic h all Cen t er

    s are t h e cen t ro ids o f

    t h e po in t s t h e y o wn.

    I f t h e con f igura t ion c h anges on a

    n i t era t ion, i t mus t h a ve

    impro ved t h e d is t or t ion.

    So eac h t ime t h e con f igura t ion c h

    anges i t mus t go t o a

    con f igura t ion i t s ne ver b een t o b

    e fore.

    So i f i t t r ied t o go on fore ver, i t w

    ould e ven t uall y run ou t

    o f con f igura t ions.

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    17/24

    17

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 33

    Will we find the optimal

    configuration? Not necessarily. Can you invent a configuration that has

    converged, but does not have the minimumdistortion?

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 34

    Will we find the optimal

    configuration? Not necessarily. Can you invent a configuration that has

    converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    18/24

    18

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 35

    Will we find the optimal

    configuration? Not necessarily. Can you invent a configuration that has

    converged, but does not have the minimumdistortion? (Hint: try a fiendish k=3 configuration here)

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 36

    Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each

    from a different random start configuration Many other ideas floating around.

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    19/24

    19

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 37

    Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each

    from a different random start configuration Many other ideas floating around.

    Neat trick:Place first center on top of randomly chosen datapoint.Place second center on datapoint thats as far away aspossible from first center

    :

    Place jth center on datapoint thats as far away aspossible from the closest of Centers 1 through j-1:

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 38

    Choosing the number of Centers A difficult problem Most common approach is to try to find the

    solution that minimizes the SchwarzCriterion (also related to the BIC)

    log)parameters(#Distortion R?+

    logDistortion R?mk +=

    m=#dimensions k=#Centers R=#Records

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    20/24

    20

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 39

    Common uses of K-means Often used as an exploratory data analysis tool In one-dimension, a good way to quantize real-

    valued variables into k non-uniform buckets Used on acoustic data in speech understanding to

    convert waveforms into one of k categories(known as Vector Quantization)

    Also used for choosing color palettes on oldfashioned graphical display devices!

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 40

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    21/24

    21

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 41

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

    2. Find most similar pairof clusters

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 42

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

    2. Find most similar pairof clusters

    3. Merge it into a parentcluster

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    22/24

    22

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 43

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

    2. Find most similar pairof clusters

    3. Merge it into a parentcluster

    4. Repeat

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 44

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

    2. Find most similar pairof clusters

    3. Merge it into a parentcluster

    4. Repeat

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    23/24

    23

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 45

    Single Linkage Hierarchical

    Clustering1. Say Every point is its

    own cluster

    2. Find most similar pairof clusters

    3. Merge it into a parentcluster

    4. Repeatuntil youvemerged the wholedataset into one cluster Youre left with a nice

    dendrogram, or taxonomy, orhierarchy of datapoints (not

    shown here)

    How do we define similaritybetween clusters?

    Minimum distance betweenpoints in clusters (in whichcase were simply doingEuclidian Minimum SpanningTrees)

    Maximum distance betweenpoints in clusters

    Average distance betweenpoints in clusters

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 46

    Single Linkage Comments Its nice that you get a hierarchy instead of

    an amorphous collection of groups If you want k groups, just cut the (k-1)

    longest links Theres no real statistical or information-

    theoretic foundation to this. Makes yourlecturer feel a bit queasy.

    Also known in the trade asHierarchical Agglomerative

    Clustering (note the acronym)

  • 8/8/2019 2001 - K-Means and Hierarchical Clustering (Slides)

    24/24

    Copyright 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47

    What you should know All the details of K-means The theory behind K-means as an

    optimization algorithm How K-means can get stuck The outline of Hierarchical clustering Be able to contrast between which problems

    would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchicalclustering