dmtm 2015 - 06 introduction to clustering

24
Prof. Pier Luca Lanzi Clustering: Introduction Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Upload: pier-luca-lanzi

Post on 11-Aug-2015

50 views

Category:

Education


1 download

TRANSCRIPT

Prof. Pier Luca Lanzi

Clustering: Introduction ���Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Prof. Pier Luca Lanzi

Readings

•  Mining of Massive Datasets (Chapter 7, Section 3.5)

2

Prof. Pier Luca Lanzi

3

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

Clustering algorithms group a collection of data points into “clusters” according to some distance measure

Data points in the same cluster should havea small distance from one another

Data points in different clusters should be at a large distance from one another.

Prof. Pier Luca Lanzi

Clustering finds “natural” grouping/structure in un-labeled data(Unsupervised Learning)

Prof. Pier Luca Lanzi

What is Cluster Analysis?

•  A cluster is a collection of data objects§ Similar to one another within the same cluster§ Dissimilar to the objects in other clusters

•  Cluster analysis§ Given a set data points try to understand their structure§ Finds similarities between data according to the characteristics

found in the data§ Groups similar data objects into clusters§ It is unsupervised learning since there is no predefined classes

•  Typical applications§ Stand-alone tool to get insight into data§ Preprocessing step for other algorithms

8

Prof. Pier Luca Lanzi

Clustering Methods

•  Hierarchical vs point assignment

•  Numeric and/or symbolic data

•  Deterministic vs. probabilistic

•  Exclusive vs. overlapping

•  Hierarchical vs. flat

•  Top-down vs. bottom-up

9

Prof. Pier Luca Lanzi

Clustering Applications

•  Marketing§ Help marketers discover distinct groups in their customer bases,

and then use this knowledge to develop targeted marketing programs

•  Land use§ Identification of areas of similar land use in an earth observation

database•  Insurance§ Identifying groups of motor insurance policy holders with a high

average claim cost•  City-planning§ Identifying groups of houses according to their house type, value,

and geographical location•  Earth-quake studies§ Observed earth quake epicenters should be clustered along

continent faults

10

Prof. Pier Luca Lanzi

What Is Good Clustering?

•  A good clustering consists of high quality clusters with§ High intra-class similarity § Low inter-class similarity

•  The quality of a clustering result depends on both the similarity measure used by the method and its implementation•  The quality of a clustering method is also measured by its ability

to discover some or all of the hidden patterns•  Evaluation§ Various measure of intra/inter cluster similarity§ Manual inspection§ Benchmarking on existing labels

11

Prof. Pier Luca Lanzi

Measure the Quality of Clustering

•  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric d(i, j)

•  There is a separate “quality” function that measures the “goodness” of a cluster

•  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables

•  Weights should be associated with different variables based on applications and data semantics

•  It is hard to define “similar enough” or “good enough” as the answer is typically highly subjective

12

Prof. Pier Luca Lanzi

Data Structures

0d(2,1) 0d(3,1) d(3, 2) 0: : :

d(n,1) d(n, 2) ... ... 0

!

"

######

$

%

&&&&&&

Outlook   Temp   Humidity   Windy   Play  

Sunny   Hot   High   False   No  

Sunny   Hot     High     True   No  

Overcast     Hot       High   False   Yes  

…   …   …   …   …  

x11 ... x1f ... x1p... ... ... ... ...xi1 ... xif ... xip... ... ... ... ...xn1 ... xnf ... xnp

!

"

########

$

%

&&&&&&&&

Data Matrix

13

Dis/Similarity Matrix

Prof. Pier Luca Lanzi

Type of Data in Clustering Analysis

•  Interval-scaled variables•  Binary variables•  Nominal, ordinal, and ratio variables•  Variables of mixed types

14

Prof. Pier Luca Lanzi

Distance Measures

Prof. Pier Luca Lanzi

Distance Measures

•  Given a space and a set of points on this space, a distance measure d(x,y) maps two points x and y to a real number, ���and satisfies three axioms

•  d(x,y) ≥ 0

•  d(x,y) = 0 if and only x=y

•  d(x,y) = d(y,x)

•  d(x,y) ≤ d(x,z) + d(z,y)

16

Prof. Pier Luca Lanzi

Euclidean Distances 17

3.5. DISTANCE MEASURES 93

2. d(x, y) = 0 if and only if x = y (distances are positive, except for thedistance from a point to itself).

3. d(x, y) = d(y, x) (distance is symmetric).

4. d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).

The triangle inequality is the most complex condition. It says, intuitively, thatto travel from x to y, we cannot obtain any benefit if we are forced to travel viasome particular third point z. The triangle-inequality axiom is what makes alldistance measures behave as if distance describes the length of a shortest pathfrom one point to another.

3.5.2 Euclidean Distances

The most familiar distance measure is the one we normally think of as “dis-tance.” An n-dimensional Euclidean space is one where points are vectors of nreal numbers. The conventional distance measure in this space, which we shallrefer to as the L2-norm, is defined:

d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) =

!

"

"

#

n$

i=1

(xi − yi)2

That is, we square the distance in each dimension, sum the squares, and takethe positive square root.

It is easy to verify the first three requirements for a distance measure aresatisfied. The Euclidean distance between two points cannot be negative, be-cause the positive square root is intended. Since all squares of real numbers arenonnegative, any i such that xi ̸= yi forces the distance to be strictly positive.On the other hand, if xi = yi for all i, then the distance is clearly 0. Symmetryfollows because (xi − yi)2 = (yi − xi)2. The triangle inequality requires a gooddeal of algebra to verify. However, it is well understood to be a property ofEuclidean space: the sum of the lengths of any two sides of a triangle is no lessthan the length of the third side.

There are other distance measures that have been used for Euclidean spaces.For any constant r, we can define the Lr-norm to be the distance measure ddefined by:

d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = (n

$

i=1

|xi − yi|r)1/r

The case r = 2 is the usual L2-norm just mentioned. Another common distancemeasure is the L1-norm, or Manhattan distance. There, the distance betweentwo points is the sum of the magnitudes of the differences in each dimension.It is called “Manhattan distance” because it is the distance one would have to

•  Lr-norm

•  Euclidean distance (r=2)

•  Manhattan distance (r=1)

•  L∞-norm

3.5. DISTANCE MEASURES 93

2. d(x, y) = 0 if and only if x = y (distances are positive, except for thedistance from a point to itself).

3. d(x, y) = d(y, x) (distance is symmetric).

4. d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).

The triangle inequality is the most complex condition. It says, intuitively, thatto travel from x to y, we cannot obtain any benefit if we are forced to travel viasome particular third point z. The triangle-inequality axiom is what makes alldistance measures behave as if distance describes the length of a shortest pathfrom one point to another.

3.5.2 Euclidean Distances

The most familiar distance measure is the one we normally think of as “dis-tance.” An n-dimensional Euclidean space is one where points are vectors of nreal numbers. The conventional distance measure in this space, which we shallrefer to as the L2-norm, is defined:

d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) =

!

"

"

#

n$

i=1

(xi − yi)2

That is, we square the distance in each dimension, sum the squares, and takethe positive square root.

It is easy to verify the first three requirements for a distance measure aresatisfied. The Euclidean distance between two points cannot be negative, be-cause the positive square root is intended. Since all squares of real numbers arenonnegative, any i such that xi ̸= yi forces the distance to be strictly positive.On the other hand, if xi = yi for all i, then the distance is clearly 0. Symmetryfollows because (xi − yi)2 = (yi − xi)2. The triangle inequality requires a gooddeal of algebra to verify. However, it is well understood to be a property ofEuclidean space: the sum of the lengths of any two sides of a triangle is no lessthan the length of the third side.

There are other distance measures that have been used for Euclidean spaces.For any constant r, we can define the Lr-norm to be the distance measure ddefined by:

d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = (n

$

i=1

|xi − yi|r)1/r

The case r = 2 is the usual L2-norm just mentioned. Another common distancemeasure is the L1-norm, or Manhattan distance. There, the distance betweentwo points is the sum of the magnitudes of the differences in each dimension.It is called “Manhattan distance” because it is the distance one would have to

Prof. Pier Luca Lanzi

Jaccard Distance

•  Jaccard distance is defined as d(x,y) = 1 – SIM(x,y) where SIM is the Jaccard similarity,

• Which can also be interpreted as the percentage of identical attributes

18

Prof. Pier Luca Lanzi

Cosine Distance

•  The cosine distance between x, y is the angle that the vectors to those points make

•  This angle will be in the range 0 to 180 degrees, regardless of how many dimensions the space has.

•  Example: given x = (1,2,-1) and y = (2,1,1) the angle between the two vectors is 60

19

Prof. Pier Luca Lanzi

Edit Distance

•  Used when the data points are strings

•  The distance between a string x=x1x2…xn and y=y1y2…ym is the smallest number of insertions and deletions of single characters that will transform x into y

•  Alternatively, the edit distance d(x, y) can be compute as the longest common subsequence (LCS) of x and y and then,������d(x,y) = |x| + |y| - 2|LCS|

•  Example: the edit distance between x=abcde and y=acfdeg is 3 (delete b, insert f, insert g), the LCS is acde which is coherent with the previous result

20

Prof. Pier Luca Lanzi

Hamming Distance

•  Hamming distance between two vectors is the number of components in which they differ

•  Or equivalently, given the number of variables n, and the number m of matching components, we define

•  Example: the Hamming distance between the vectors 10101 and 11110 is 3.

21

Prof. Pier Luca Lanzi

Ordinal Variables

•  An ordinal variable can be discrete or continuous•  Order is important, e.g., rank•  It can be treated as an interval-scaled

§  replace xif with their rank

§ map the range of each variable onto [0, 1] by replacing ���i-th object in the f-th variable by

§ compute the dissimilarity using methods for interval-scaled variables

22

Prof. Pier Luca Lanzi

Requirements of Clustering in Data Mining

•  Scalability•  Ability to deal with different types of attributes•  Ability to handle dynamic data •  Discovery of clusters with arbitrary shape•  Minimal requirements for domain knowledge to determine input

parameters•  Able to deal with noise and outliers•  Insensitive to order of input records•  High dimensionality•  Incorporation of user-specified constraints•  Interpretability and usability

23

Prof. Pier Luca Lanzi

Curse of Dimensionality

in high dimensions, almost all pairs of pointsare equally far away from one another

almost any two vectors are almost orthogonal