david corne, and nick taylor, heriot-watt university - these slides and related resources:

56
David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: Data Mining (and machine learning) DM Lecture 5: Clustering

Upload: hope-hilda-craig

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What is Clustering? Why do it? SubscriberCalls per dayMonthly bill Consider these data: Made up; maybe they are 17 subscribers to a mobile phone services company, and show the mean calls per day and the mean monthly bill for each customer Do you spot any patterns or Structure ??

TRANSCRIPT

Page 1: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining(and machine learning)

DM Lecture 5: Clustering

Page 2: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Today• (unsupervised) Clustering

– What and why– A good first step towards understanding your data– Discover patterns and structure in data– Which then guides further data mining– Helps to spot problems and outliers– Identifies `market segments’, e.g. specific types of customers or users ; this is

called segmentation or market segmentation

• How to do it: – Choose distance measure or a similarity measure– Run a (usually) simple algorithm– We will cover the two main algorithms

Page 3: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is Clustering? Why do it?Subscriber Calls per day Monthly bill

1 1 3

2 4 7

3 4 8

4 3 5

5 6 1

6 9 3

7 3 5

8 7 2

9 4 7

10 6 3

11 2 5

12 8 4

13 5 3

14 6 2

15 2 4

16 3 6

17 8 3

Consider these data:

Made up; maybe they are17 subscribers to a mobile phoneservices company, and show the mean calls per day and the meanmonthly bill for each customer

Do you spot any patterns orStructure ??

Page 4: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is Clustering? Why do it?Here is a plot of the data, with calls as X and bills as Y

Now do you spot any patterns or structure ??

0

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10

Calls per day

Mon

thly

bill

Monthly bill

Page 5: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What is Clustering? Why do it?Clearly there are two clusters -- two distinct types of customer

Top left: few calls but highish bills; bottom right: many calls, low bills

0

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10

Calls per day

Mon

thly

bill

Monthly bill

Page 6: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

So, clustering is all about plotting/visualising and noting distinct groups by eye, right?

Not really, because:• We can only spot patterns by eye (i.e. with our brains) if the data is 1D, 2D or 3D. Most data of interest is much higher dimensional – e.g. 10D, 20D, 1000D.• Sometimes the clusters are not so obvious as a bunch of data all in the same place – we will see examples.

• So: we need automated algorithms which can do what you just did (find distinct groups in the data), but which can do this for any number of dimensions, and for perhaps more complex kinds of groups.

Page 7: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

OK, so when we apply an automated clustering algorithm to data, the result is a collection of groups, which tells us potentially useful things about our data – e.g. if we are doing this with supermarket baskets, each group is a collection of typical baskets, which may relate to “general housekeeping”, “late night dinner”, “quick lunchtime shopper”, and perhaps other types that we are not expecting

Yes

Page 8: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Quality of a clustering?

Why is this:

0

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10

Calls per day

Mon

thly

bill

Monthly bill

Better than this?

0

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10

Calls per day

Mon

thly

bill

Monthly bill

Page 9: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Quality of a clustering?

A `good clustering’ has the following properties:• Items in the same cluster tend to be close to each other• Items in different clusters tend to be far from each other

It is not hard to come up with a metric – an easily calculatedvalue – that can be used to give a score to any clustering.There are many such metrics. E.g.

S = the mean distance between pairs of items in the same clusterD = the mean distance between pairs of items in different clustersMeasure of cluster quality is: D/S -- the higher the better.

Page 10: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Let’s try that

S = [AB + AD + AF + AH + BD + BF + BH + DF + DH + FH + CE + CG + EG] / 13 = 44/13 3.38D = [ AC + AE + AG + BC + BE + BG + DC + DE + DG + FC + FE + FG + HC + HE + HG ]/15 = 40/15 = 2.67

Cluster Quality = D/S = 0.77

A B C D E F G H

Page 11: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Let’s try that again

S = [AB + AC + AD + BC+ BD + CD + EF + EG + EH + FG + FH + GH ] / 12 = 20/12 = 1.67D = [ AE + AF + AG + AH + BE + BF + BG + BH + CE + CF + CG + CH + DE + DF + DG + DH]/16 = 68/16 = 4.25

Cluster Quality = D/S = 2.54

A B C D E F G H

Page 12: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

But what about this?

S = [AB + CD + EF + EG + EH + FG + FH + GH ] / 8 = 12/8 = 1.5D = [ AC + AD + AE + AF + AG + AH + BC + BD + BE + BF + BG + BH + CE + CF + CG + CH + DE + DF + DG + DH]/20 = = 72/20= 3.6

Cluster Quality = D/S = 2.40

A B C D E F G H

Page 13: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some important notes

• There is usually no `correct’ clustering.

• Clustering algorithms (whether or not they work with cluster quality metrics) always use some kind of distance or similarity measure -- the result of the clustering process will depend on the chosen distance measure.

• Choice of algorithm, and/or distance measure, will depend on the kind of cluster shapes you might expect in the data.

•Our D/S measure for cluster quality will not work well in lots of cases

Page 14: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Examples: sometimes groups arenot simple to spot, even in 2D

Slide credit:Julia Handl

Page 15: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Examples: sometimes groups arenot simple to spot, even in 2D

Slide credit:Julia Handl

Page 16: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Brain Training

Think about why D/S is not a useful cluster qualitymeasure in the general case

Try to design a cluster quality metric that will work well in the cases of the previous slides (not very difficult)

Page 17: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

In many problems the clusters are more `conventional’ – but maybe fuzzy and unclear

Slide credit:Julia Handl

Page 18: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

And there is a different kind of clusteringthat can be done, which avoids the issue ofdeciding the number of clusters in advance

Slide credit:Elias Raftopoulos Prof. Maria Papadopouli

Page 19: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

How to do it

• The most commonly used methods:• K-Means• Hierarchical Agglomerative Clustering

Page 20: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

K-Means • If you want to see K clusters, then run K-means. I.e.

you need to choose in advance the number of clusters. Say K=3 -- run 3-means and the result is a `good’ grouping of the data into 3 clusters.

• It works by generating K points (in a way, these are made-up records in the data); each point is the centre (or centroid) of one cluster. As the algorithm iterates, the points adjust their positions until they stabilise.

• Very simple, fairly fast, very common; a few drawbacks.

Page 21: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Let’s see it

Page 22: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Here is the data; we choose k = 2and run 2-means

Page 23: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

We choose two cluster centres -- randomly

Page 24: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Step 1: decide which cluster each point is in – the one whose centre is

closest

Page 25: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Step 2: We now have two clusters – recompute the centre of each cluster

Page 26: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

These are the new centres

Page 27: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Step 1: decide which cluster each point is in – the one whose centre is

closest

Page 28: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

This one has to be reassigned:

Page 29: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Step 2: We now have two new clusters – recompute the centre of

each cluster

Page 30: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Centres now slightly moved

Page 31: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Step 1: decide which cluster each point is in – the one whose centre is

closest

Page 32: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

In this case, nothing gets reassigned to a new cluster – so the algorithm is

finished

Page 33: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The K-Means Algorithm

• Choose k1. Randomly choose k points, labelled {1, 2, …,

k} to be the initial cluster centroids.2. For each datum, let its cluster ID be the label

of its closest centroid.3. For each cluster, recalculate its actual centre.4. Go back to Step 2, stop when step 2 does not

change the cluster ID of any point

Page 34: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Simple but often not ideal

• Variable results with noisy data and outliers• Very large or very small values can skew

the centroid positions, and give poor clusterings

• Only suitable for the cases where we can expect clusters to be `clumps’ that are close together – e.g. terrible in the two-spirals and similar cases

Page 35: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Hierarchical Agglomerative Clustering

Before we discuss this:• We need to know how to work out the

distance between two points:

Page 36: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Hierarchical Agglomerative Clustering

Before we discuss this:• And the distance between a point and a

cluster

?

Page 37: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Hierarchical Agglomerative Clustering

Before we discuss this:• And the distance between two clusters

?

Page 38: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Hierarchical Agglomerative Clustering

Before we discuss this:• There are many options for all of these things;

we will discuss them in a later lecture.

?

Page 39: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Hierarchical Agglomerative Clustering

• Is very commonly used• Very different from K-means• Provides a much richer structuring of the

data• No need to choose k• But, quite sensitive to the various ways of

working out distance (different results for different ways)

Page 40: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Let’s see it

1

2

5 7

6

439

8

Page 41: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Initially, each point is a cluster

1

2

5 7

6

439

8

Page 42: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9

Page 43: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 4

Page 44: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48

Page 45: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48 2

Page 46: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48 2 1

Page 47: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48 2 1 5

Page 48: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 49: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Find closest pair of clusters, and merge them into one cluster

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 50: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Now all one cluster, so stop

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 51: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The thing on the right is a dendrogram – it contains the

information for us to group the data into clusters in various ways

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 52: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

E.g. 2 clusters

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 53: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

E.g. 3 clusters

1

2

5 7

6

439

8

7 9 3 48 2 1 5 6

Page 54: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

In a proper dendrogram …

• The height of a bar indicates how different the items are

• A dendrogram is also called a binary tree

• The data points are the leaves of the tree

• Each node represents a cluster – all the leaves of its subtree

Page 55: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Agglomerative Hierarchical Clustering Algorithm

• Decide on how to work out distance between two clusters

• Initialise: each of the N data items is a cluster• Repeat N-1 times:

– Find the closest pair of clusters; merge them into a single cluster (and update your tree representation)

Page 56: David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Next time• Correlation – what are the important fields

in your dataset?