parsymony: scalable parallel data mining

Alok Choudhary [email protected] 1 Northwestern University

PARSYMONY: Scalable Parallel Data Mining

Alok N. ChoudharyNorthwestern University

(ACK: Harsha Nagesh (Bell Labs) and Sanjay Goil (Sun)


Outline

• Overview of PARSIMONY• MAFIA (Subspace clustering)• pMAFIA (Parallel Subspace Clustering)• Performance Results• PARSIMONY

– multidimensional data analysis– parallel classification

• Summary


Overview of Knowledge Discovery Process


PARSIMONY Overview


MAFIA: Subspace Clustering for High-Dimensional Data Sets

•Clustering and subspace clustering•Base MAFIA Algorithm•pMAFIA (parallelization)•Performance Results


Clustering Multi-dimensional Dimensional Data Sets

Determine range of attributes in each dimension of the cluster(s)


Issues to be addressed• Basic algorithm computational optimizations• Scalability with database size (out of core data

sets)• Scalability with the dimensionality of data • Efficient Parallelization• Recognition of arbitrary shaped clusters


Related Work• Partition based Clustering:

– User specified k representative points taken as cluster centers and points assigned to cluster centers

• k-means, k-mediods, CLARANS (VLDB 94), BIRCH (SIGMOD 96),..• Consider clustering partitioning of points

• Hierarchical Clustering:– Each point is a cluster. Merge similar points together gradually.

• CURE (SIGMOD 98) (use sampling)• Categorical Clustering:

– Clustering of categorical data e.g automobile sales data: color, year, model, price, etc

– Best suited for non-numerical data• CACTUS (KDD 99), STIRR (VLDB 98)


Related Work • Density and Grid Based Clustering :

– Clusters are high density regions than its surroundings• WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98)

– Number of subspaces is exponential in the data dimensionality– Multidimensional space divided into grids. The histogram in each

hyper-rectangle is found. Grid regions with a significant histogram value are cluster regions.

– Post-processing done to grow the connected cluster regions.– Fine grid size results in explosion in the number hyper-rectangles,

coarser grids fail to detect clusters.– Correct Grid Size is very critical !


Subspace Clustering • Observation: “If a collection of points S is a cluster in

a k-dimensional space, then S is also a part of a cluster in any (k-1) dimensional projection of the space”

• Algorithm : Growing clusters…candidate dense units in any k dimensions are obtained by merging dense units in (k-1) dimensions which share any (k-2) dimensions.– Ex: ( {a1,b7,d9}, {b7,c8,d9} ) --> {a1,b7,c8,d9}– Candidate dense units are populated by a pass on the data set

and the dense units are found out in each dimension.– Dense units found are combined to form candidate dense units.– Algorithm terminates when no more candidate dense units

found.


Adaptive Grids (reducing computation in practice)

• Automatic Grid fitting based on data distribution– MAFIA : Merging of Adaptive Finite Intervals !

• Optimal Bins in each dimension leads to very few units in the grid (candidate dense units)

– (a) : CLIQUE– (b) : MAFIA


Base MAFIA Algorithm• Divide each dimension into very fine regions.• Compute histogram in these regions along every

dimension.• Set the value of a sliding window to the

maximum histogram value in the window.• Adjacent units which have nearly same histogram

values are merged together to form larger bins.• Threshold of each bin formed is computed

automatically. • A bin having a histogram value much greater (by a factor 2~3)

than that of equi-distribution of data is DENSE.


MAFIA Algorithm (merging dense units) Algorithm : Candidate dense units in any k

dimensions are obtained by merging dense units in (k-1) dimensions which share any (k-2) dimensions.

Ex: ( {a1,c7,b8}, {c7,b8,d9} ) --> {a1,c7,b8,d9}


• CLIQUE : CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share first (k-2) dimensions.

• MAFIA: CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share any (k-2) dimensions.

Data Set

CLIQUE

MAFIA

Fixed Grids

Adaptive Grids

Fixed Grids first (k-2) algo

any (k-2) algo

Huge CDU Set

Non Cluster Dims reported

Huge Search Space

much reduced CDU Set

Correct Cluster Dims reported

Reduced Search Space

Dimensions aware of data distribution


Parallel MAFIA

• pMAFIA: Parallel Subspace Clustering– Scalable in data size and number of dimensions

Grid and Density based clustering algorithm– Parallelization provides speedup for the subspace

clustering algorithm.


pMAFIA Algorithm:

Each processor reads part of the data in a Data Parallel fashion and constructs histogram in every dimension. // Data read in chunks (out of core) of data

Reduce communication to obtain global histogram. All processors build Adaptive grids using the histogram. // Each bin formed is a Candidate Dense Unit.

Current Dimension k = 1 while (no more dense units found)

if( k > 1) Build Candidate Dense units();

Populate the candidate dense units in a data parallel fashion and in chunks (out-of-core) of B records.

Reduce communication to obtain global CDU population. Identify the dense units (); Build dense unit data structures(); // for the next higher dimension


Build-Candidate-Dense-Units Current Dimension(k) = 3.


Build Candidate Dense Units• Each dense unit is compared with every other dense unit to form CDUs,

resulting in an O(Ndu2) algorithm.(Ndu-number of dense units)• For large values of Ndu CDUs built in parallel. Processors 0,..,(p-1) work in parallel on parts of total Ndu dense units.

Processor k compares dense units between Ni and Ni+1 with all the other dense units; for optimal task partitioning we have

• Identical CDUs generated during the process need to be discarded. Each generated CDU compared with every other CDU to identify similar ones

resulting in O(Ncdu2) algorithm.


Parallelization (Building CDUs)

Total work done by Pi is shown


pMAFIA : ANALYSIS• Data parallelism in populating the CDUs in every dimension : effective for massive data sets.• Gains of task parallelism realized when data contains large number of dense units (clusters).

– A ‘k’ dimensional dense unit allocated just 2k bytes of memory, k bytes for dimensions and k for bin indices.

– Data structures in form of linear arrays of bytes: communicate very small message buffers, space optimization.

• Although bottom-up algorithm is exponential in the data dimension, for low subspace coverage, with use of Adaptive grids and parallel formulation very promising results.– k dimension of the highest dimension dense unit, we explore all possible subspaces of these k

dimensions ==> O(ck)– For k passes over the data set O( (N/pB) * k*Tio),

• N-total number of records, p-processors, • B- records per chunk, Tio-I/O access time for a block of B records.

– Communication overhead results in O(Tcomm * S * p * k), • Tcomm - constant for communication, S- size of message exchanged,.

O(ck + (N/pB) * k*Ti + Tcomm * S * p * k )


pMAFIA Outline

• Clustering and subspace clustering• Base MAFIA Algorithm• pMAFIA (parallelization)• Performance Results

– Adaptivity performance – Scalability with data set size.– Scalability with data dimensionality.– Scalability with cluster dimensionality.– Some Real world data sets.


Advantage of Adaptive Grids

• 300,000 records in a 15 Dimension space with 1 cluster of 5 dimensions.– A speedup of 80 obtained

over CLIQUE. – CLIQUE failed to produce

results with our modified CDU generation algorithm even in 2 hours on 16 processors.

– This relatively small data set mined in 32 seconds on 1 processor


Scalability with Data Set Size

• 20 Dimension data with 5 clusters in 5 different subspaces

• data sets :up to 11.8 million records

• Clusters detected in just about 3 minutes on 16 processors !

• Almost Linear with the increase in the data set size (because most time in scanning)


Parallelization (on IBM SP2)

• 30 Dimension data with 8.3 million records, 5 clusters each in a 6 dimension subspace.

• Near linear speedups• Negligible Communication overheads (<1%)


Data Dimensionality

• 250,000 records, 3 clusters in different 5 dimensional subspaces.

• Near linear behavior with data dimensionality : Algorithm depends on the maximum number of dimensions in a clusters and not on the data dimensionality.


Cluster Dimensionality

• 50 dimension data with 1 cluster, 650,000 records; cluster dimension from 3 to 10.

• Behavior in line with the order of the algorithm : increases with subspace coverage of the cluster.


Scalability on Movie Data

72,916 users rated 1628 movies in 18 months: 2.8 Million ratings

4D data: {user-id, movie-id, weight, score}

Discovered seven interesting 2d clusters !


Other Data Sets

• One day Ahead Prediction of DAX (German Stock Exchange)– DAX prediction data set was based on a 12 input time series like stock indices, bond

indices, gold prices, etc– 22 dimensions with 2757 records: Major gains from task parallelism– Mined clusters in 8.16 seconds on 8 processors.– Unique clusters discovered in 3,4,5 and 6 dimensional subspaces.

• Ionosphere data: (UCI repository)– Radar data collected in Goose Bay, Labrador

• 34 dimension data, 351 records.• Discovered unique clusters only in 3 and 4 dimensional sub spaces.


Summary and Conclusions for pMAFIA

• MAFIA : unsupervised subspace clustering algorithm• Introduced the adaptive grids formulation• First parallel subspace clustering algorithm for

massive data sets– Incorporates both task and data parallelism. .

• pMAFIA : Scalable and Parallel Implementation in data size, no of dimensions


PARSIMONY Overview


Sparse Data Structures and Representations


Using OLAP Framework for

parsymony: scalable parallel data mining

Documents

clustering algorithm

grid regions

connected cluster regions

cluster centersk

core data setsscalability

average cluster dimensionality

user specified grid

relevant cluster dimensions