ma mru dm chapter11

Chapter 11Automatic Cluster Detection

2

Data Mining Techniques So Far…

• Chapter 5 – Statistics

• Chapter 6 – Decision Trees

• Chapter 7 – Neural Networks

• Chapter 8 – Nearest Neighbor Approaches: Memory-

Based Reasoning and Collaborative Filtering

• Chapter 9 – Market Basket Analysis & Association Rules

• Chapter 10 – Link Analysis

3

Automatic Cluster Detection

• DM techniques used to find patterns in data

– Not always easy to identify• No observable pattern

• Too many patterns

• Decomposition (break down into smaller pieces) [example: Olympics]

• Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest without getting lost in the trees

4

Automatic Cluster Detection• K-Means clustering algorithm – similar to nearest neighbor

techniques (memory-based-reasoning and collaborative filtering) –depends on a geometric interpretation of the data

• Other automatic cluster detection (ACD) algorithms include:

– Gaussian mixture models

– Agglomerative clustering

– Divisive clustering

– Self-organizing maps (SOM) – Ch. 7 – Neural Nets

• ACD is a tool used primarily for undirected data mining

– No preclassified training data set

– No distinction between independent and dependent variables

• When used for directed data mining

– Marketing clusters referred to as “segments”

– Customer segmentation is a popular application of clustering

• ACD rarely used in isolation – other methods follow up

5

Clustering Examples

• “Star Power” ~ 1910 Hertzsprung-Russell

• Group of Teens

• 1990’s US Army – women’s uniforms:•100 measurements for each of 3,000 women•Using K-means algorithm reduced to a handful

6

K-means Clustering

• “K” – circa 1967 – this algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other

• How K-means works (see next slide figures):

– Algorithm selects K (3) data points randomly

– Assigns each of the remaining data points to one of K clusters (via perpendicular bisector)

– Calculate the centroids of each cluster (uses averages in each cluster to do this)

– At each iteration, all cluster assignments are reevaluated

7

K-means algoritm example

8

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

9


Customer Age Income

(K)

John 0.55 0.175

Rachel 0.34 0.25

Hannah 1 1

Tom 0.93 0.85

Nellie 0.39 0.2

David 0.58 0.25 Age

Income

Note: Both Age and Income are normalized.

10

Nellie and David are selected as cluster centers A and B respectively

Customer Distance from David

Distance from Nellie

John 0.08 0.16

Rachel 0.24 0.07

Hannah 0.86 1.01

Tom 0.69 0.85

Nellie

David

A

B

Age

Income


11

Calculate cluster center:

Cluster A center:

– Age 0.37, Income=0.23

Cluster B center:

– Age 0.77, Income=0.57

Assign customers to clusters based on new cluster centers

Age

Income


12

Customer Distance A

Distance B

John 0.19 0.45

Rachel 0.04 0.54

Hannah 0.99 0.49

Tom 0.84 0.32

Nellie 0.04 0.53

David 0.21 0.37

Age

Income

A

B


13

Calculate cluster center:

Cluster A center:

– Age 0.47, Income=0.22

Cluster B center:

– Age 0.97, Income= 0.93

• Clusters do not change

Age

Income


14

K-means Clustering:Right or Wrong?

• Resulting clusters describe underlying structure in the data, however, there is no one right description of that structure (Ex: Figure 11.6 – playing cards K=2, K=4)

15

K-means Clustering Demo

• Clustering demo:

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

16

Similarity & Difference

• Automatic Cluster Detection is quite simple for a software program to accomplish – data points, clusters mapped in space

• However, business data points are not about points in space but about purchases, phone calls, airplane trips, car registrations, etc. which have no obvious connection to the dots in a cluster diagram

17


• Clustering business data requires some notion of natural association – records (data) in a given cluster are more similar to each other than to those in another cluster

• For DM software, this concept of association must be translated into some sort of numeric measure of the degree of similarity

• Most common translation is to translate data values(eg., gender, age, product, etc.) into numeric values so can be treated as points in space

• If two points are close in geometric sense then they represent similar data in the database

18


• Business variable (fields) types:– Categorical (eg., mint, cherry, chocolate)

– Ranks (eg., freshman, soph, etc. or valedictorian, salutatorian)

– Intervals (eg., 56 degrees, 72 degrees, etc)

– True measures – interval variables that measure from a meaningful zero point

• Fahrenheit, Celsius not good examples

• Age, weight, height, length, tenure are good

• Geometric standpoint the above variable types go from least effective to most effective (top to bottom)

• Finally, there are dozens/hundreds of published techniques for measuring the similarity of two data records

19


• Two main problems with this approach: – Many variable types, including all categorical

variables and many numeric variables such as rankings, do not have the right behavior to properly be treated as components of a position vector.

– In geometry, the contributions of each dimension are of equal importance, but in databases, a small change in one field may be much more important than a large change in another field.

– Other approaches!!!

20

Formal Measures of Similarity

• Euclidian distance

• Manhattan Distance

– sometimes preferred to the Euclidean distance because given that the distances along each axis are not squared, it is less likely that a large difference in one dimension will dominate the total distance

n

iii yxYXD

1

2)(),(

n

iii yxYXD

1

||),(

21


Distance (John, Rachel)=sqrt [(35-41)2+(95-215)2 +(3-2)2]

21

John:

Age=35

Income=95K

no. of credit cards=3

Rachel:

Age=41

Income=215K

no. of credit cards=2

n

iii yxYXD

1

2)(),(

Euclidian Distance

22

• Angle between Two Vectors

– A vector has both magnitude (the distance from the origin to the point) and direction.

– For this similarity measure, it is the direction that matters.

– The angle between vectors provides a measure of association that is not influenced by differences in magnitude between the two things being compared


• The cosine of the angle measures correlation:• it is 1 when the vectors are parallel (perfectly correlated)• 0 when they are orthogonal

23


• Number of Features in Common– For categorical variables metric measures are not the

best choice.

– A better measure is based on the degree of overlap between records.

– As with the geometric measures, there are many variations on this idea.

– In all variations, the two records are compared field by field to determine the number of fields that match and the number of fields that don’t match.

– The simplest measure is the ratio of matches to the total number of fields.

24

Strengths and Weaknessesof the K-Means

• Strength – Relatively efficient– Simple implementation

• Weakness– Need to specify k, the number of clusters, in

advance • Can use hierarchical clustering to start• Or try different k

– Unable to handle noisy data and outliers well– Euclidian Distance may not make a lot of sense for

categorical types.• Variable types:

– Categorical (eg., gender, racial background)– Ranks (eg., income levels (H,M,L))– Continuous (eg., age, number of orders)

25

Other Approaches to Cluster Detection

• Gaussian Mixture Models

• Agglomerative Clustering

• Divisive Clustering

• Self-Organizing Maps (SOM) [Chapter 7]

26

Gaussian Mixture Models

• The K-means method as described has some drawbacks: – It does not do well with overlapping clusters.

– The clusters are easily pulled off-center by outliers.

– Each record is either inside or outside of a given cluster.

• Differences to K-Means:– The seeds are considered to be the means of Gaussian

distributions.

– The algorithm proceeds by iterating over two steps called the estimation step and the maximization step.

– At the end of the process, each point is tied to the various clusters with higher or lower probability.

• This is sometimes called soft clustering, because points are not uniquely identified with a single cluster.

27

Agglomerative Clustering

• Differences to K-Means:

– Start out with each data point forming its own cluster and gradually merge them into larger and larger clusters until all points have been gathered together into one big cluster.

– Toward the beginning of the process, the clusters are very small and very pure - the members of each cluster are few and closely related.

– Towards the end of the process, the clusters are large and not as well defined.

• Steps:

– create a similarity matrix

– find the smallest value in the similarity matrix

– merge these two clusters into a new one and update the similarity matrix

– repeat the merge step N – 1 times

How do we measure this distance?■ Single linkage ■ Complete linkage ■ Centroid distance

28

Agglomerative ClusteringHow do we measure the distance?

29

Agglomerative Clustering

30

Divisive Clustering

• A decision tree algorithm starts with the entire collection of records and looks for a way to split it into partitions that are purer, in some sense defined by a purity function.

• In the standard decision tree algorithms, the purity function uses a separate variable - the target variable - to make this decision.

• All that is required to turn decision trees into a clustering algorithm is to supply a purity function chosen to either minimize the average intra-cluster distance or maximize the inter-cluster distances.

31

Data Preparation for Clustering

Units of measurement are generally different!

• Scaling adjusts the values of variables to take into account the fact that different variables are measured in different units or over different ranges.

– For instance, household income is measured in tens of thousands of dollars and number of children in single digits.

– It is very important to scale different variables so their values fall roughly into the same range, by normalizing, indexing, or standardizing the values.

– Ways of scaling:

• Divide each variable by the range after subtracting the lowest value (0-1)

• Divide each variable by the mean of all the values it takes on (“indexing”)

• Subtract the mean value from each variable and then divide it by the standard deviation (standardization or “converting to z-scores”)

– A z-score tells you how many standard deviations away from the mean a value is.

32

Data Preparation for Clustering

• What if we think that two families with the same income have more in common than two families on the same size plot, and we want that to be taken into consideration during clustering?

• Weighting provides a relative adjustment for a variable, because some variables are more important than others.

• The purpose of weighting is to encode the information that one variable is more (or less) important than others.

• The whole point of automatic cluster detection is to find clusters that make sense to you.– If, for your purposes, whether people have children is much more

important than the number of credit cards they carry, there is no reason not to bias the outcome of the clustering by multiplying the number of children field by a higher weight than the number of credit cards field.

33

Evaluating Clusters

• What does it mean to say that a cluster is “good”?

– Clusters should have members that have a high degree of similarity

– Standard way to measure within-cluster similarity is variance* – clusters with lowest variance is considered best

– Cluster size is also important so alternate approach is to use average variance**

* The sum of the squared differences of each element from the mean** The total variance divided by the size of the cluster

34

Evaluating Clusters

• Finally, if detection identifies good clusters

along with weak ones it could be useful to

set the good ones aside (for further study)

and run the analysis again to see if

improved clusters are revealed from only

the weaker ones

35

Hard vs. Soft Clustering

Hard clustering puts each example in only one cluster.

Clusters are therefore disjoint (non-overlapping).

Soft clustering allows an example to be in multiple clusters with different probabilities.

Clusters may therefore overlap.

36

Applications of Clustering

• Marketing:• Customer segmentation (discovery of distinct

groups of customers) for target marketing.• Create product differentiation: different offers for

different segments (It’s not always possible to offer personalization.)

• Car insurance: Identify customer groups with high average claim cost

• Property: Identify houses in the same city with similar characteristics

• Image recognition• Creating document collections, or grouping web

pages

37

Case Study: Clustering Towns (pp 374-379)

“Best” based on delivery penetration

“2nd Best” based on delivery penetration

Cluster 2

Cluster 1B

Cluster 1AB

38

RapidMiner Practice

• To read:

– RapidMiner Tutorial 4 (see the Help menu)

• To practice:

– Inspect the example

– Run the process

– Interpret the results

– Try different options and interpret the results

39

RapidMiner Practice

Take the “Exercise.xls” file, and perform clustering with RapidMiner (# of clusters=2)

(Note: you should get the same results as those on slide #13)

Take the “Bank.arff” file, and perform clustering with RapidMiner

– Try different number of clusters and compare the results

– Try different cluster approaches and compare the results

(If necessary, change attributes type, select and transform / normalize attributes)

40

RapidMiner Practice

• Practice on previous examples and datasets

• Project discussions

ma mru dm chapter11

Documents