saurav kumar singh department of computer science & …cs5080225/file/presentation.pdf ·...
TRANSCRIPT
Saurav Kumar Singh
Department of Computer Science & Engineering Dual degree 4th year
Outline Motivation
Basics
Hierarchical Structure
Parameter Generation
Query Types
Algorithm
Motivation All previous clustering algorithm are query dependent
They are built for one query and generally no use for other query.
Need a separate scan for each query.
So computation more complex at least O(n).
So we need a structure out of Database so that various queries can be answered without rescanning.
Basics Grid based method-quantizes the object space into a
finite number of cells that form a grid structure on which all of the operations for clustering are performed
Develop hierarchical Structure out of given data and answer various queries efficiently.
Every level of hierarchy consist of cells
Answering a query is not O(n) where n is the number of elements in the database
A hierarchical structure for STING clustering
continue …..
The root of the hierarchy be at level 1
Cell in level i corresponds to the union of the areas of its children at level i + 1
Cell at a higher level is partitioned to form a number of cells of the next lower level
Statistical information of each cell is calculated and stored beforehand and is used to answer queries
Cell parameter Attribute Independent parameter-
n- number of objects (points) in this cell
Attribute dependent parameters- m - mean of all values in this cell s - standard deviation of all values of the attribute in this cell min - the minimum value of the attribute in this cell max - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows
Parameter Generation n, m, s, min, and max of bottom level cells are
calculated directly from data
Distribution can be either assigned by user or can be obtained by hypothetical tests - χ2 test
Parameters of higher level cells is calculated from parameter of lower level cells.
continue…..
n, m, s, min, max, dist be parameters of current cell
ni, mi, si, mini, maxi and disti be parameters of corresponding lower level cells
dist for Parent Cell Set dist as the distribution type followed by most points in
this cell
Now check for conflicting points in the child cells call it confl.
1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an amount of ni;
2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then set confl to n
3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by 0;
4. If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then confl is set to n.
continue….. If is greater than a threshold t set dist as NONE.
Other wise keep the original type.
Example :
continue….. Parameter for parent cell would be
n = 220 m = 20.27 s = 2.37
min = 3.8 max = 40 dist = NORMAL
210 points whose distribution type is NORMAL
Set dist of parent as Normal
confl = 10
= 0.045 < 0.05 so keep the original.
Query types STING structure is capable of answering various queries
But if it doesn’t then we always have the underlying Database
Even if statistical information is not sufficient to answer queries we can still generate possible set of answers.
Common queries
Select regions that satisfy certain conditions
Select the maximal regions that have at least 100 houses per unit
area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence
SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND WITH CONFIDENCE 0.9
continue…. Selects regions and returns some function of the region
Select the range of age of houses in those maximal regions where there
are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.
SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND LOCATION California
Algorithm With the hierarchical structure of grid cells on hand,
we can use a top-down approach to answer spatial data mining queries
For any query, we begin by examining cells on a high level layer
calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell
If the distribution type is NONE, we estimate the likelihood using some distribution free techniques instead
continue…. After we obtain the confidence interval, we label this
cell to be relevant or not relevant at the specified confidence level
Proceed to the next layer but only consider the Childs of relevant cells of upper layer
We repeat this until we reach to the final layer
Relevant cells of final layer have enough statistical information to give satisfactory result to query.
However for accurate mining we may refer to data corresponding to relevant cells and further process it.
Finding regions After we have got all the relevant cells at the final level
we need to output regions that satisfies the query
We can do it using Breadth First Search
Breadth First Search we examine cells within a certain distance from
the center of current cell
If the average density within this small area is greater than the density specified mark this area
Put the relevant cells just examined in the queue.
Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.
Statistical Information Grid-based Algorithm 1. Determine a layer to begin with.
2. For each cell of this layer, we calculate the confidence interval (or estimated range) of probability that this cell is relevant to the query.
3. From the interval calculated above, we label the cell as relevant or not relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer.
6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7.
7. Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9.
8. Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9.
9. Stop.
Time Analysis: Step 1 takes constant time. Steps 2 and 3 require
constant time.
The total time is less than or equal to the total number of cells in our hierarchical structure.
Notice that the total number of cells is 1.33K, where K is the number of cells at bottom layer.
So the overall computation complexity on the grid hierarchy structure is O(K)
Time Analysis: STING goes through the database once to compute the
statistical parameters of the cells
time complexity of generating clusters is O(n), where n is the total number of objects.
After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.
Comparison
CLIQUE: A Dimension-Growth Subspace Clustering Method First dimension growth subspace clustering algorithm
Clustering starts at single-dimension subspace and move upwards towards higher dimension subspace
This algorithm can be viewed as the integration Density based and Grid based algorithm
Informal problem statement Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the data points.
CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set.
A unit is dense if the fraction of total data points contained in it exceeds an input model parameter.
In CLIQUE, a cluster is defined as a maximal set of connected dense units.
Formal Problem Statement Let A= {A1, A2, . . . , Ad } be a set of bounded, totally
ordered domains and S = A1× A2× · · · × Ad a d-dimensional numerical space.
We will refer to A1, . . . , Ad as the dimensions (attributes) of S.
The input consists of a set of d-dimensional points V = {v1, v2, . . . , vm}
Where vi = vi1, vi2, . . . , vid . The j th component of vi is drawn from domain Aj .
Clique Working 2 Step Process
1st step – Partitioning the d- dimensional data space
2nd step- Generates the minimal description of each cluster.
1st step- Partitioning Partitioning is done for each dimension.
Example continue….
continue…. The subspaces representing these dense units are
intersected to form a candidate search space in which dense units of higher dimensionality may exist.
This approach of selecting candidates is quite similar to Apiori Gen process of generating candidates.
Here it is expected that if some thing is dense in higher dimensional space it cant be sparse in lower dimension state.
More formally If a k-dimensional unit is dense, then so are its projections
in (k-1)-dimensional space.
Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection unit is not dense then kth dimensional unit cannot be dense
So,we can generate candidate dense units in k-dimensional space from the dense units found in (k-1)-dimensional space
The resulting space searched is much smaller than the original space.
The dense units are then examined in order to determine the clusters.
Intersection
Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate
search space for dense units of higher dimensionality.
2nd stage- Minimal Description For each cluster, Clique determines the maximal
region that covers the cluster of connected dense units.
It then determines a minimal cover (logic description) for each cluster.
Effectiveness of Clique- CLIQUE automatically finds subspaces of the highest
dimensionality such that high-density clusters exist in those subspaces.
It is insensitive to the order of input objects
It scales linearly with the size of input
Easily scalable with the number of dimensions in the data
Thank You
References: STING : A Statistical Information Grid Approach to Spatial Data Mining
Wei Wang, Jiong Yang, and Richard Muntz
Department of Computer Science University of California, Los Angeles
,February 20, 1997
Data Mining: Concepts and Techniques Second Edition
Jiawei Han
University of Illinois at Urbana-Champaign
Micheline Kamber