ma mru dm chapter11

40
Chapter 11 Automatic Cluster Detection

Upload: pop-roxana

Post on 30-Apr-2017

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ma Mru Dm Chapter11

Chapter 11Automatic Cluster Detection

Page 2: Ma Mru Dm Chapter11

2

Data Mining Techniques So Far…

• Chapter 5 – Statistics

• Chapter 6 – Decision Trees

• Chapter 7 – Neural Networks

• Chapter 8 – Nearest Neighbor Approaches: Memory-

Based Reasoning and Collaborative Filtering

• Chapter 9 – Market Basket Analysis & Association Rules

• Chapter 10 – Link Analysis

Page 3: Ma Mru Dm Chapter11

3

Automatic Cluster Detection

• DM techniques used to find patterns in data

– Not always easy to identify• No observable pattern

• Too many patterns

• Decomposition (break down into smaller pieces) [example: Olympics]

• Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest without getting lost in the trees

Page 4: Ma Mru Dm Chapter11

4

Automatic Cluster Detection• K-Means clustering algorithm – similar to nearest neighbor

techniques (memory-based-reasoning and collaborative filtering) –depends on a geometric interpretation of the data

• Other automatic cluster detection (ACD) algorithms include:

– Gaussian mixture models

– Agglomerative clustering

– Divisive clustering

– Self-organizing maps (SOM) – Ch. 7 – Neural Nets

• ACD is a tool used primarily for undirected data mining

– No preclassified training data set

– No distinction between independent and dependent variables

• When used for directed data mining

– Marketing clusters referred to as “segments”

– Customer segmentation is a popular application of clustering

• ACD rarely used in isolation – other methods follow up

Page 5: Ma Mru Dm Chapter11

5

Clustering Examples

• “Star Power” ~ 1910 Hertzsprung-Russell

• Group of Teens

• 1990’s US Army – women’s uniforms:•100 measurements for each of 3,000 women•Using K-means algorithm reduced to a handful

Page 6: Ma Mru Dm Chapter11

6

K-means Clustering

• “K” – circa 1967 – this algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other

• How K-means works (see next slide figures):

– Algorithm selects K (3) data points randomly

– Assigns each of the remaining data points to one of K clusters (via perpendicular bisector)

– Calculate the centroids of each cluster (uses averages in each cluster to do this)

– At each iteration, all cluster assignments are reevaluated

Page 7: Ma Mru Dm Chapter11

7

K-means algoritm example

Page 8: Ma Mru Dm Chapter11

8

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-means algoritm example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 9: Ma Mru Dm Chapter11

9

K-means algoritm example

Customer Age Income

(K)

John 0.55 0.175

Rachel 0.34 0.25

Hannah 1 1

Tom 0.93 0.85

Nellie 0.39 0.2

David 0.58 0.25 Age

Income

Note: Both Age and Income are normalized.

Page 10: Ma Mru Dm Chapter11

10

Nellie and David are selected as cluster centers A and B respectively

Customer Distance from David

Distance from Nellie

John 0.08 0.16

Rachel 0.24 0.07

Hannah 0.86 1.01

Tom 0.69 0.85

Nellie

David

A

B

Age

Income

K-means algoritm example

Page 11: Ma Mru Dm Chapter11

11

Calculate cluster center:

Cluster A center:

– Age 0.37, Income=0.23

Cluster B center:

– Age 0.77, Income=0.57

Assign customers to clusters based on new cluster centers

Age

Income

K-means algoritm example

Page 12: Ma Mru Dm Chapter11

12

Customer Distance A

Distance B

John 0.19 0.45

Rachel 0.04 0.54

Hannah 0.99 0.49

Tom 0.84 0.32

Nellie 0.04 0.53

David 0.21 0.37

Age

Income

A

B

K-means algoritm example

Page 13: Ma Mru Dm Chapter11

13

Calculate cluster center:

Cluster A center:

– Age 0.47, Income=0.22

Cluster B center:

– Age 0.97, Income= 0.93

• Clusters do not change

Age

Income

K-means algoritm example

Page 14: Ma Mru Dm Chapter11

14

K-means Clustering:Right or Wrong?

• Resulting clusters describe underlying structure in the data, however, there is no one right description of that structure (Ex: Figure 11.6 – playing cards K=2, K=4)

Page 15: Ma Mru Dm Chapter11

15

K-means Clustering Demo

• Clustering demo:

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Page 16: Ma Mru Dm Chapter11

16

Similarity & Difference

• Automatic Cluster Detection is quite simple for a software program to accomplish – data points, clusters mapped in space

• However, business data points are not about points in space but about purchases, phone calls, airplane trips, car registrations, etc. which have no obvious connection to the dots in a cluster diagram

Page 17: Ma Mru Dm Chapter11

17

Similarity & Difference

• Clustering business data requires some notion of natural association – records (data) in a given cluster are more similar to each other than to those in another cluster

• For DM software, this concept of association must be translated into some sort of numeric measure of the degree of similarity

• Most common translation is to translate data values(eg., gender, age, product, etc.) into numeric values so can be treated as points in space

• If two points are close in geometric sense then they represent similar data in the database

Page 18: Ma Mru Dm Chapter11

18

Similarity & Difference

• Business variable (fields) types:– Categorical (eg., mint, cherry, chocolate)

– Ranks (eg., freshman, soph, etc. or valedictorian, salutatorian)

– Intervals (eg., 56 degrees, 72 degrees, etc)

– True measures – interval variables that measure from a meaningful zero point

• Fahrenheit, Celsius not good examples

• Age, weight, height, length, tenure are good

• Geometric standpoint the above variable types go from least effective to most effective (top to bottom)

• Finally, there are dozens/hundreds of published techniques for measuring the similarity of two data records

Page 19: Ma Mru Dm Chapter11

19

Similarity & Difference

• Two main problems with this approach: – Many variable types, including all categorical

variables and many numeric variables such as rankings, do not have the right behavior to properly be treated as components of a position vector.

– In geometry, the contributions of each dimension are of equal importance, but in databases, a small change in one field may be much more important than a large change in another field.

– Other approaches!!!

Page 20: Ma Mru Dm Chapter11

20

Formal Measures of Similarity

• Euclidian distance

• Manhattan Distance

– sometimes preferred to the Euclidean distance because given that the distances along each axis are not squared, it is less likely that a large difference in one dimension will dominate the total distance

n

iii yxYXD

1

2)(),(

n

iii yxYXD

1

||),(

Page 21: Ma Mru Dm Chapter11

21

Formal Measures of Similarity

Distance (John, Rachel)=sqrt [(35-41)2+(95-215)2 +(3-2)2]

21

John:

Age=35

Income=95K

no. of credit cards=3

Rachel:

Age=41

Income=215K

no. of credit cards=2

n

iii yxYXD

1

2)(),(

Euclidian Distance

Page 22: Ma Mru Dm Chapter11

22

• Angle between Two Vectors

– A vector has both magnitude (the distance from the origin to the point) and direction.

– For this similarity measure, it is the direction that matters.

– The angle between vectors provides a measure of association that is not influenced by differences in magnitude between the two things being compared

Formal Measures of Similarity

• The cosine of the angle measures correlation:• it is 1 when the vectors are parallel (perfectly correlated)• 0 when they are orthogonal

Page 23: Ma Mru Dm Chapter11

23

Formal Measures of Similarity

• Number of Features in Common– For categorical variables metric measures are not the

best choice.

– A better measure is based on the degree of overlap between records.

– As with the geometric measures, there are many variations on this idea.

– In all variations, the two records are compared field by field to determine the number of fields that match and the number of fields that don’t match.

– The simplest measure is the ratio of matches to the total number of fields.

Page 24: Ma Mru Dm Chapter11

24

Strengths and Weaknessesof the K-Means

• Strength – Relatively efficient– Simple implementation

• Weakness– Need to specify k, the number of clusters, in

advance • Can use hierarchical clustering to start• Or try different k

– Unable to handle noisy data and outliers well– Euclidian Distance may not make a lot of sense for

categorical types.• Variable types:

– Categorical (eg., gender, racial background)– Ranks (eg., income levels (H,M,L))– Continuous (eg., age, number of orders)

Page 25: Ma Mru Dm Chapter11

25

Other Approaches to Cluster Detection

• Gaussian Mixture Models

• Agglomerative Clustering

• Divisive Clustering

• Self-Organizing Maps (SOM) [Chapter 7]

Page 26: Ma Mru Dm Chapter11

26

Gaussian Mixture Models

• The K-means method as described has some drawbacks: – It does not do well with overlapping clusters.

– The clusters are easily pulled off-center by outliers.

– Each record is either inside or outside of a given cluster.

• Differences to K-Means:– The seeds are considered to be the means of Gaussian

distributions.

– The algorithm proceeds by iterating over two steps called the estimation step and the maximization step.

– At the end of the process, each point is tied to the various clusters with higher or lower probability.

• This is sometimes called soft clustering, because points are not uniquely identified with a single cluster.

Page 27: Ma Mru Dm Chapter11

27

Agglomerative Clustering

• Differences to K-Means:

– Start out with each data point forming its own cluster and gradually merge them into larger and larger clusters until all points have been gathered together into one big cluster.

– Toward the beginning of the process, the clusters are very small and very pure - the members of each cluster are few and closely related.

– Towards the end of the process, the clusters are large and not as well defined.

• Steps:

– create a similarity matrix

– find the smallest value in the similarity matrix

– merge these two clusters into a new one and update the similarity matrix

– repeat the merge step N – 1 times

How do we measure this distance?■ Single linkage ■ Complete linkage ■ Centroid distance

Page 28: Ma Mru Dm Chapter11

28

Agglomerative ClusteringHow do we measure the distance?

Page 29: Ma Mru Dm Chapter11

29

Agglomerative Clustering

Page 30: Ma Mru Dm Chapter11

30

Divisive Clustering

• A decision tree algorithm starts with the entire collection of records and looks for a way to split it into partitions that are purer, in some sense defined by a purity function.

• In the standard decision tree algorithms, the purity function uses a separate variable - the target variable - to make this decision.

• All that is required to turn decision trees into a clustering algorithm is to supply a purity function chosen to either minimize the average intra-cluster distance or maximize the inter-cluster distances.

Page 31: Ma Mru Dm Chapter11

31

Data Preparation for Clustering

Units of measurement are generally different!

• Scaling adjusts the values of variables to take into account the fact that different variables are measured in different units or over different ranges.

– For instance, household income is measured in tens of thousands of dollars and number of children in single digits.

– It is very important to scale different variables so their values fall roughly into the same range, by normalizing, indexing, or standardizing the values.

– Ways of scaling:

• Divide each variable by the range after subtracting the lowest value (0-1)

• Divide each variable by the mean of all the values it takes on (“indexing”)

• Subtract the mean value from each variable and then divide it by the standard deviation (standardization or “converting to z-scores”)

– A z-score tells you how many standard deviations away from the mean a value is.

Page 32: Ma Mru Dm Chapter11

32

Data Preparation for Clustering

• What if we think that two families with the same income have more in common than two families on the same size plot, and we want that to be taken into consideration during clustering?

• Weighting provides a relative adjustment for a variable, because some variables are more important than others.

• The purpose of weighting is to encode the information that one variable is more (or less) important than others.

• The whole point of automatic cluster detection is to find clusters that make sense to you.– If, for your purposes, whether people have children is much more

important than the number of credit cards they carry, there is no reason not to bias the outcome of the clustering by multiplying the number of children field by a higher weight than the number of credit cards field.

Page 33: Ma Mru Dm Chapter11

33

Evaluating Clusters

• What does it mean to say that a cluster is “good”?

– Clusters should have members that have a high degree of similarity

– Standard way to measure within-cluster similarity is variance* – clusters with lowest variance is considered best

– Cluster size is also important so alternate approach is to use average variance**

* The sum of the squared differences of each element from the mean** The total variance divided by the size of the cluster

Page 34: Ma Mru Dm Chapter11

34

Evaluating Clusters

• Finally, if detection identifies good clusters

along with weak ones it could be useful to

set the good ones aside (for further study)

and run the analysis again to see if

improved clusters are revealed from only

the weaker ones

Page 35: Ma Mru Dm Chapter11

35

Hard vs. Soft Clustering

Hard clustering puts each example in only one cluster.

Clusters are therefore disjoint (non-overlapping).

Soft clustering allows an example to be in multiple clusters with different probabilities.

Clusters may therefore overlap.

Page 36: Ma Mru Dm Chapter11

36

Applications of Clustering

• Marketing:• Customer segmentation (discovery of distinct

groups of customers) for target marketing.• Create product differentiation: different offers for

different segments (It’s not always possible to offer personalization.)

• Car insurance: Identify customer groups with high average claim cost

• Property: Identify houses in the same city with similar characteristics

• Image recognition• Creating document collections, or grouping web

pages

Page 37: Ma Mru Dm Chapter11

37

Case Study: Clustering Towns (pp 374-379)

“Best” based on delivery penetration

“2nd Best” based on delivery penetration

Cluster 2

Cluster 1B

Cluster 1AB

Page 38: Ma Mru Dm Chapter11

38

RapidMiner Practice

• To read:

– RapidMiner Tutorial 4 (see the Help menu)

• To practice:

– Inspect the example

– Run the process

– Interpret the results

– Try different options and interpret the results

Page 39: Ma Mru Dm Chapter11

39

RapidMiner Practice

Take the “Exercise.xls” file, and perform clustering with RapidMiner (# of clusters=2)

(Note: you should get the same results as those on slide #13)

Take the “Bank.arff” file, and perform clustering with RapidMiner

– Try different number of clusters and compare the results

– Try different cluster approaches and compare the results

(If necessary, change attributes type, select and transform / normalize attributes)

Page 40: Ma Mru Dm Chapter11

40

RapidMiner Practice

• Practice on previous examples and datasets

• Project discussions