saurav kumar singh department of computer science & …cs5080225/file/presentation.pdf ·...

Saurav Kumar Singh

Department of Computer Science & Engineering Dual degree 4th year

Outline Motivation

Basics

Hierarchical Structure

Parameter Generation

Query Types

Algorithm

Motivation All previous clustering algorithm are query dependent

They are built for one query and generally no use for other query.

Need a separate scan for each query.

So computation more complex at least O(n).

So we need a structure out of Database so that various queries can be answered without rescanning.

Basics Grid based method-quantizes the object space into a

finite number of cells that form a grid structure on which all of the operations for clustering are performed

Develop hierarchical Structure out of given data and answer various queries efficiently.

Every level of hierarchy consist of cells

Answering a query is not O(n) where n is the number of elements in the database

A hierarchical structure for STING clustering

continue …..

The root of the hierarchy be at level 1

Cell in level i corresponds to the union of the areas of its children at level i + 1

Cell at a higher level is partitioned to form a number of cells of the next lower level

Statistical information of each cell is calculated and stored beforehand and is used to answer queries

Cell parameter Attribute Independent parameter-

n- number of objects (points) in this cell

Attribute dependent parameters- m - mean of all values in this cell s - standard deviation of all values of the attribute in this cell min - the minimum value of the attribute in this cell max - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows

Parameter Generation n, m, s, min, and max of bottom level cells are

calculated directly from data

Distribution can be either assigned by user or can be obtained by hypothetical tests - χ2 test

Parameters of higher level cells is calculated from parameter of lower level cells.

continue…..

n, m, s, min, max, dist be parameters of current cell

ni, mi, si, mini, maxi and disti be parameters of corresponding lower level cells

dist for Parent Cell Set dist as the distribution type followed by most points in

this cell

Now check for conflicting points in the child cells call it confl.

1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an amount of ni;

2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then set confl to n

3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by 0;

4. If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then confl is set to n.

continue….. If is greater than a threshold t set dist as NONE.

Other wise keep the original type.

Example :

continue….. Parameter for parent cell would be

n = 220 m = 20.27 s = 2.37

min = 3.8 max = 40 dist = NORMAL

210 points whose distribution type is NORMAL

Set dist of parent as Normal

confl = 10

= 0.045 < 0.05 so keep the original.

Query types STING structure is capable of answering various queries

But if it doesn’t then we always have the underlying Database

Even if statistical information is not sufficient to answer queries we can still generate possible set of answers.

Common queries

Select regions that satisfy certain conditions

Select the maximal regions that have at least 100 houses per unit

area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence

SELECT REGION

FROM house-map

WHERE DENSITY IN (100, ∞)

AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1)

AND AREA (100, ∞)

AND WITH CONFIDENCE 0.9

continue…. Selects regions and returns some function of the region

Select the range of age of houses in those maximal regions where there

are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.

SELECT RANGE(age)

FROM house-map

WHERE DENSITY IN (100, ∞)

AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1)

AND AREA (100, ∞)

AND LOCATION California

Algorithm With the hierarchical structure of grid cells on hand,

we can use a top-down approach to answer spatial data mining queries

For any query, we begin by examining cells on a high level layer

calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell

If the distribution type is NONE, we estimate the likelihood using some distribution free techniques instead

continue…. After we obtain the confidence interval, we label this

cell to be relevant or not relevant at the specified confidence level

Proceed to the next layer but only consider the Childs of relevant cells of upper layer

We repeat this until we reach to the final layer

Relevant cells of final layer have enough statistical information to give satisfactory result to query.

However for accurate mining we may refer to data corresponding to relevant cells and further process it.

Finding regions After we have got all the relevant cells at the final level

we need to output regions that satisfies the query

We can do it using Breadth First Search

Breadth First Search we examine cells within a certain distance from

the center of current cell

If the average density within this small area is greater than the density specified mark this area

Put the relevant cells just examined in the queue.

Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.

Statistical Information Grid-based Algorithm 1. Determine a layer to begin with.

2. For each cell of this layer, we calculate the confidence interval (or estimated range) of probability that this cell is relevant to the query.

3. From the interval calculated above, we label the cell as relevant or not relevant.

4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.

5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer.

6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7.

7. Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9.

8. Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9.

9. Stop.

Time Analysis: Step 1 takes constant time. Steps 2 and 3 require

constant time.

The total time is less than or equal to the total number of cells in our hierarchical structure.

Notice that the total number of cells is 1.33K, where K is the number of cells at bottom layer.

So the overall computation complexity on the grid hierarchy structure is O(K)

Time Analysis: STING goes through the database once to compute the

statistical parameters of the cells

time complexity of generating clusters is O(n), where n is the total number of objects.

After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.

Comparison

CLIQUE: A Dimension-Growth Subspace Clustering Method First dimension growth subspace clustering algorithm

Clustering starts at single-dimension subspace and move upwards towards higher dimension subspace

This algorithm can be viewed as the integration Density based and Grid based algorithm

Informal problem statement Given a large set of multidimensional data points, the

data space is usually not uniformly occupied by the data points.

CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set.

A unit is dense if the fraction of total data points contained in it exceeds an input model parameter.

In CLIQUE, a cluster is defined as a maximal set of connected dense units.

Formal Problem Statement Let A= {A1, A2, . . . , Ad } be a set of bounded, totally

ordered domains and S = A1× A2× · · · × Ad a d-dimensional numerical space.

We will refer to A1, . . . , Ad as the dimensions (attributes) of S.

The input consists of a set of d-dimensional points V = {v1, v2, . . . , vm}

Where vi = vi1, vi2, . . . , vid . The j th component of vi is drawn from domain Aj .

Clique Working 2 Step Process

1st step – Partitioning the d- dimensional data space

2nd step- Generates the minimal description of each cluster.

1st step- Partitioning Partitioning is done for each dimension.

Example continue….

continue…. The subspaces representing these dense units are

intersected to form a candidate search space in which dense units of higher dimensionality may exist.

This approach of selecting candidates is quite similar to Apiori Gen process of generating candidates.

Here it is expected that if some thing is dense in higher dimensional space it cant be sparse in lower dimension state.

More formally If a k-dimensional unit is dense, then so are its projections

in (k-1)-dimensional space.

Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection unit is not dense then kth dimensional unit cannot be dense

So,we can generate candidate dense units in k-dimensional space from the dense units found in (k-1)-dimensional space

The resulting space searched is much smaller than the original space.

The dense units are then examined in order to determine the clusters.

Intersection

Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate

search space for dense units of higher dimensionality.

2nd stage- Minimal Description For each cluster, Clique determines the maximal

region that covers the cluster of connected dense units.

It then determines a minimal cover (logic description) for each cluster.

Effectiveness of Clique- CLIQUE automatically finds subspaces of the highest

dimensionality such that high-density clusters exist in those subspaces.

It is insensitive to the order of input objects

It scales linearly with the size of input

Easily scalable with the number of dimensions in the data

Thank You

References: STING : A Statistical Information Grid Approach to Spatial Data Mining

Wei Wang, Jiong Yang, and Richard Muntz

Department of Computer Science University of California, Los Angeles

,February 20, 1997

Data Mining: Concepts and Techniques Second Edition

Jiawei Han

University of Illinois at Urbana-Champaign

Micheline Kamber

saurav kumar singh department of computer science & …cs5080225/file/presentation.pdf ·...

Documents