cse 634 data mining techniques
DESCRIPTION
CSE 634 Data Mining Techniques. CLUSTERING Part 2( Group no: 1 ) By: Anushree Shibani Shivaprakash & Fatima Zarinni Spring 2006 Professor Anita Wasilewska SUNY Stony Brook. References. Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8) . Morgan Kaufman, 2002. - PowerPoint PPT PresentationTRANSCRIPT
CSE 634 Data Mining Techniques
CLUSTERINGPart 2( Group no: 1 )
By: Anushree Shibani Shivaprakash & Fatima Zarinni
Spring 2006Professor Anita Wasilewska
SUNY Stony Brook
References
Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002.
M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf
How to explain hierarchical clustering. http://www.analytictech.com/networks/hiclus.htm
Tian Zhang, Raghu Ramakrishnan, Miron Livny. Birch: An efficient data clustering method for very large databases
Data mining- Margaret H. Dunham http://cs.sunysb.edu/~cse634/ Presentation 9 – Cluster
Analysis
Introduction
Major clustering methods
Partitioning methods Hierarchical methods Density-based methods Grid-based methods
Hierarchical methods
Here we group data objects into a tree of clusters.
There are two types of hierarchical clustering
1. Agglomerative hierarchical clustering.2. Divisive hierarchical clustering
Agglomerative hierarchical clustering
Group data objects in a bottom-up fashion. Initially each data object is in its own
cluster. Then we merge these atomic clusters into
larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.
A user can specify the desired number of clusters as a termination condition.
Divisive hierarchical clustering
Groups data objects in a top-down fashion.
Initially all data objects are in one cluster. We then subdivide the cluster into smaller
and smaller clusters, until each object forms cluster on its own or satisfies certain termination conditions, such as a desired number of clusters is obtained.
AGNES & DIANA
Application of AGNES( AGglomerative NESting) and DIANA( Divisive ANAlysis) to a data set of five objects, {a, b, c, d, e}.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
AGNES-Explored
1. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:
2. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.
3. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
AGNES
4. Compute distances (similarities) between the new cluster and each of the old clusters.
5. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
6. Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and average-link clustering
Similarity/Distance metrics
single-link clustering, distance= shortest distance
complete-link clustering, distance = longest distance
average-link clustering, distance = average distance
from any member of one cluster to any member of the other cluster.
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Overview
Divisive Clustering starts by placing all objects into a single group. Before we start the procedure, we need to decide on a threshold distance. The procedure is as follows:
The distance between all pairs of objects within the same group is determined and the pair with the largest distance is selected.
Overview-contd
This maximum distance is compared to the threshold distance. If it is larger than the threshold, this group is divided
in two. This is done by placing the selected pair into different groups and using them as seed points. All other objects in this group are examined, and are placed into the new group with the closest seed point. The procedure then returns to Step 1.
If the distance between the selected objects is less than the threshold, the divisive clustering stops.
To run a divisive clustering, you simply need to decide upon a method of measuring the distance between two objects.
DIANA- Explored
In DIANA, a divisive hierarchical clustering method, all of the objects form one cluster.
The cluster is split according to some principle, such as the minimum Euclidean distance between the closest neighboring objects in the cluster.
The cluster splitting process repeats until, eventually, each new cluster contains a single object or a termination condition is met.
Difficulties with Hierarchical clustering
It encounters difficulties regarding the selection of merge and split points.
Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters.
It will not undo what was done previously. Thus, split or merge decisions, if not well
chosen at some step, may lead to low-quality clusters.
One promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques. A few such methods are:
1. Birch2. Cure3. Chameleon
Solution to improve Hierarchical clustering
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Paper by:
Tian Zhang
Computer Sciences Dept.
University of Wisconsin- Madison
Raghu RamakrishnanComputer Sciences Dept.
University of Wisconsin- [email protected]
Miron LivnyComputer Sciences Dept.
University of Wisconsin- [email protected]
In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114, In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114,
Montreal, Canada, June, 1996.Montreal, Canada, June, 1996.
Reference For Paper
www2.informatik.huberlin.de/wm/mldm2004/zhang96birch.pdf
Birch (Balanced Iterative Reducing and Clustering Using Hierarchies)
A hierarchical clustering method. It introduces two concepts :1. Clustering feature2. Clustering feature tree (CF tree)
These structures help the clustering method achieve good speed and scalability in large databases.
Clustering Feature Definition
Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,
CF = (N, LS, SS) N is the number of data points in the
cluster, LS is the linear sum of the N data points, SS is the square sum of the N data
points.
Clustering feature concepts Each record (data object) is a tuple of values of
attributes and here is called a vector. Here is a database.
We define (Vi1, …Vid) = Oi
N N N N
LS = ∑ Oi = (∑Vi1, ∑ Vi2,… ∑Vid)
i=1 i=1 i=1 i =1
Linear Sum Definition
NameDefinition
Square sum
N N N N
SS = ∑ Oi2 = ( ∑Vi12, ∑Vi22… ∑Vid2)
i =1 i=1 i=1 i=1
DefinitionName
Example of a case
Assume N = 5 and d = 2Linear Sum
5 5 5
LS = ∑ Oi = (∑Vi1, ∑ Vi2)
i=1 i=1 i=1
Square Sum
5 5SS =( ∑Vi12), ∑Vi22)
i=1 i=1
Example 2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
Object Attribute1 Attribute2
O1 3 4
O2 2 6
O3 4 5
O4 4 7
O5 3 8
Clustering feature = CF=( N, LS, SS)
N = 5
LS = (16, 30)
SS = ( 54, 190)
CF-Tree
A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T.
The entry in each nonleaf node has the form [CFi, childi]
The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'.
The CF tree is basically a tree used to store all the clustering features.
CF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
Root
Non-leaf node
Leaf node Leaf node
BIRCH Clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
BIRCH Algorithm Overview
Summary of Birch
Scales linearly- with a single scan you get good clustering and the quality of clustering improves with a few additional scans.
It handles noise (data points that are not part of the underlying pattern) effectively.
Density-Based Clustering Methods Clustering based on density, such as density-
connected points instead of distance metric. Cluster = set of “density connected” points. Major features:
Discover clusters of arbitrary shape Handle noise Need “density parameters” as termination condition- (when no new objects can be added to the cluster.)
Example: DBSCAN (Ester, et al. 1996) OPTICS (Ankerst, et al 1999) DENCLUE (Hinneburg & D. Keim 1998)
Density-Based Clustering: Background Eps neighborhood: The neighborhood within a
radius Eps of a given object MinPts: Minimum number of points in an Eps-
neighborhood of that object.
Core object :If the Eps neighborhood contains at least a minimum number of points Minpts, then the object is a core object
Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if
1) p is within the Eps neighborhood of q
2) q is a core objectp
qMinPts = 5
Eps = 1
Figure showing the density reachability and density connectivity in density based clustering
M, P, O, R and S are core objects since each is in an Eps neighborhood containing at least 3 points
Minpts = 3
Eps=radius of the circles
Directly density reachable
Q is directly density reachable from M. M is directly density reachable from P and vice versa.
Indirectly density reachable
Q is indirectly density reachable from P since Q is directly density reachable from M and M is directly density reachable from P. But, P is not density reachable from Q since Q is not a core
object.
Core, border, and noise points
DBSCAN is a density-based algorithm. Density = number of points within a specified
radius (Eps)
A point is a core point if it has more than a specified number of points (MinPts) within Eps
These are points that are at the interior of a cluster.
A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
A noise point is any point that is not a core point nor a border point.
DBSCAN (Density based Spatial clustering of Application with noise): The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
Conclusions
We discussed two hierarchical clustering methods – Agglomerative and Divisive.
We also discussed Birch- a hierarchical clustering which produces good clustering over a single scan and with a few additional scans you get better clustering.
DBSCAN is a density based clustering algorithm and through this algorithm we discover clusters of arbitrary shapes. Distance is not the metric unlike the case of hierarchical methods.
GRID-BASED CLUSTERING METHODS
This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.
So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
Age
Salary (10,000)
Our “space” is this plane
20 30 40 50 60
88
77
66
5 5
44
33
22
11
00
Techniques for Grid-Based Clustering
The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUest.) STING (STatistical Information Grid.) WaveCluster
Looking at CLIQUE as an Example
CLIQUE is used for the clustering of high-dimensional data present in large tables. By high-dimensional data we mean records that have many attributes.
CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.
Definitions That Need to Be Known
Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.
Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.
Cluster: A cluster is defined as a maximal set of connected dense units.
How Does CLIQUE Work? Let us say that we have a set of records that we would like to cluster in terms of n-attributes.
So, we are dealing with an n-dimensional space.
MAJOR STEPS : CLIQUE partitions each subspace that has
dimension 1 into the same number of equal length intervals.
Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.
CLIQUE: Major Steps (Cont.)
Now CLIQUE’S goal is to identify the dense n-dimensional units.
It does this in the following way: CLIQUE finds dense units of higher
dimensionality by finding the dense units in the subspaces.
So, for example if we are dealing with a 3-dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.)
It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.)
Each maximal set of connected dense units is considered a cluster.
Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.
The information of the subspaces is then used to find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.
Example for CLIQUE
Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.
The data space for the this data would be 3-dimensional.
age
salary
vacation
Example (Cont.)
After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.
Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.
Now, our goal is to find the dense 3-D rectangular units.
Example (Cont.)
To do this, we find the dense units of the subspaces of this 3-d space.
So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.
We also find the dense 2-D rectangular units for the vacation-age plane.
Example 1
Sal
ary
(10,
000)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
Example (Cont.)
Now let us try to visualize the dense units of the two planes on the following 3-d figure :
age
Vac
atio
n
Salary 30 50
age
Vac
atio
n
Salary 30 50
= 3
Example (Cont.)
We can extend the dense areas in the vacation-age plane inwards.
We can extend the dense areas in the salary-age plane upwards.
The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.
We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.
Example (Cont.)
Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units.
So, What was the main idea? We used the dense units in subspaces in
order to find the dense units in the 3-dimensional space.
After finding the dense units, it is very easy to find clusters.
Reflecting upon CLIQUE
Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?
Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.
The property for CLIQUE says that if a k-dimensional unit is dense then so are its projections in the (k-1) dimensional space.
Strength and Weakness of CLIQUE Strength
It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.
It is quite efficient. It is insensitive to the order of records in input and
does not presume some canonical data distribution. It scales linearly with the size of input and has good
scalability as the number of dimensions in the data increases.
Weakness The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity of this method.
STING: A Statistical Information Grid Approach to Spatial Data Mining
Paper by:
Wei Wang
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
Jiong Yang
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
Richard Muntz
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
VLDB Conference Athens, Greece, 1997VLDB Conference Athens, Greece, 1997
Reference For Paper
http://georges.gardarin.free.fr/Cours_XMLDM_Master2/Sting.PDF
Definitions That Need to Be Known
Spatial Data: Data that have a spatial or location
component. These are objects that themselves are located
in physical space. Examples: My house, lake Geneva, New York
City, etc. Spatial Area:
The area that encompasses the locations of all the spatial data is called spatial area.
STING (Introduction)
STING is used for performing clustering on spatial data.
STING uses a hierarchical multi resolution grid data structure to partition the spatial area.
STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.
We want to cluster the records that are in a spatial table in terms of location.
Placement of a record in a grid cell is completely determined by its physical location.
Hierarchical Structure of Each Grid Cell
The spatial area is divided into rectangular cells. (Using latitude and longitude.)
Each cell forms a hierarchical structure. This means that each cell at a higher level is
further partitioned into 4 smaller cells in the lower level.
In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.
The union of the 4 children cells would give back the parent cell in the level above them.
Hierarchical Structure of Cells (Cont.)
The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.
So, Why do we have a hierarchical structure for cells?
We have them in order to provide a better granularity, or higher resolution.
A Hierarchical Structure for Sting Clustering
Statistical Parameters Stored in each Cell
For each cell in each layer we have
attribute dependent and attribute independent parameters. Attribute Independent Parameter:
Count : number of records in this cell. Attribute Dependent Parameter:
(We are assuming that our attribute values are real numbers.)
Statistical Parameters (Cont.)
For each attribute of each cell we store the following parameters:
M mean of all values of each attribute in this cell.
S Standard Deviation of all values of each attribute in this cell.
Min The minimum value for each attribute in this cell.
Max The maximum value for each attribute in this cell.
Distribution The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.
Storing of Statistical Parameters
Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.
The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.
The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.
How are Queries Processed ? STING can answer many queries, (especially
region queries) efficiently, because we don’t have to access full database.
How are spatial data queries processed? We use a top-down approach to answer spatial
data queries. Start from a pre-selected layer-typically with a
small number of cells. The pre-selected layer does not have to be the
top most layer. For each cell in the current layer compute the
confidence interval (or estimated range of probability) reflecting the cells relevance to the given query.
Query Processing (Cont.)
The confidence interval is calculated by using the statistical parameters of each cell.
Remove irrelevant cells from further consideration.
When finished with the current layer, proceed to the next lower level.
Processing of the next lower level examines only the remaining relevant cells.
Repeat this process until the bottom layer is reached.
Different Grid Levels during Query Processing.
Sample Query Examples Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens. Our records represent apartments that are
present throughout the above region. Query : “ Find all the apartments that are for
rent near Stony Brook University that have a rent range of: $800 to $1000”
The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University.
Advantages and Disadvantages of STING
ADVANTAGES: Very efficient. The computational complexity is O(k) where k
is the number of grid cells at the lowest level. Usually k << N, where N is the number of records.
STING is a query independent approach, since statistical information exists independently of queries.
Incremental update. DISADVANTAGES:
All Cluster boundaries are either horizontal or vertical, and no diagonal boundary is selected.
Thank you !