cis664 kd&dm spartan: a model-based semantic compression system for massive data tables s. babu,...

CIS664 KD&DM

SPARTAN: A Model-Based Semantic Compression System for

Massive Data Tables

S. Babu, M. Garofalakis, R. Rastogi

Presented by Uroš Midić

Apr 11, 2007

Introduction

Definitions

Fascicles compression method

Model-based semantic compression

SPARTAN

Experimental results

Results

3

Introduction

• Massive amounts of data are being produced daily in various environments:

• Corporate: communications traffic, banking

• Scientific: high throughput experiments, sensor networks, satellites

• Governmental: surveys

• Compression can reduce the amount of space for storage, and bandwidth for transfer of data.

4

Introduction

• Many statistical and dictionary-based lossless compression methods have been developed, that were proven to be (asymptotically) optimal with respect to various criteria

• Some very effective lossy compression methods have been developed for specific applications, such as multimedia: pictures, sound, movies

5

Introduction

• When dealing with large databases, traditional syntactic compression approaches are usually not very effective in general case.

• However, there may be many semantic patterns that syntactic compression approaches do not exploit.

• Due to the nature of many data-analysis applications, we may afford to use lossy compression techniques, as long as it provides a certain upper bound on the compression error.

6

Definitions

• n-attribute table T

• X = {X1, …, Xn} is the set of n attributes of T and dom(Xi) is the domain of Xi

• Tc is the compressed version of T

• |Tc| and |T| denote the storage-space requirements (in bytes) for Tc and T

• e = [e1, …, en] defines the per-attribute acceptable degree of information loss

7

Definitions

• e = [e1, …, en] defines the per-attribute acceptable degree of information loss

• For a numeric attribute Xi, ei defines an upper bound on the absolute difference between the actual value of Xi in T and the approximate value in Tc.

• For categorical values, ei defines an upper bound on the probability that the approximate value of Xi in Tc is different than the actual value in T.

8

Fascicles compression method

• The “Fascicles” method searches for groups of rows in the table (i.e. fascicles of rows) that have similar values for some attributes. The values for these attributes in a fascicle can be approximated by one representative.

• Example: e = [2, 5000, 25000, 0]

Original Compressed

30 90,000 200,000 Good

50 110,000 250,000 Good

70 35,000 125,000 Poor

75 15,000 100,000 Poor

30 90,000 225,000 Good

50 110,000

70 35,000 112,500 Poor

75 15,000

9

Model-Based Semantic Compression

Given a table T and a vector of tolerances e,

find a subset {X1, …, Xp} of X,

set of models {M1, …, Mp} such that model Mi is a predictor for values of attribute Xi based on attributes in X \ {X1, …, Xp} and the error bound ei is not exceeded.

Models Mi come from a predefined family of models (e.g. CaRTs).

10

Model-Based Semantic Compression

Then TC = <T’, {M1, …, Mp}>, where T’ is the projection of T to X \ {X1, …, Xp}, is a compressed version of T.

The problem is to find {X1, …, Xp} and {M1, …, Mp} such that the storage requirements of the compressed table |Tc| are minimized.

11

SPARTAN

Overview

12

SPARTAN

DependencyFinder

Finds structure of Bayesian network from data.

● Constraint-based methods● Scoring-based methods

SPARTAN uses a scoring-based algorithm that utilizes pairwise computational tests based on mutual information divergence. It requires at most O(n4) CI tests. Each CI test requires one pass – therefore only a sample of table T is used.

13

SPARTAN

CaRTSelector

● The problem of choosing a storage-optimal subset of attributes Xpred to be predicted using attributes X-Xpred is NP-hard.

● SPARTAN uses two heuristic approaches to select Xpred. Both approaches rely on the Bayesian network provided by the DependencyFinder.

14

SPARTAN

CaRTSelector – Greedy approach

● Greedy selection algorithm visits the attributes in the order imposed by the Bayesian network:

1. If Xi has no parent nodes in BN, it is placed in Xmat (set of attributes to be materialized).

2. If Xi has parent nodes, then CaRTBuilder is invoked to construct a CaRT-based predictor for Xi (within the specified error tolerance ei) using the attributes that are currently in Xmat. Xi will be chosen for prediction if the estimated storage benefit is at least θ (input parameter), and materialized otherwise.

15

SPARTAN

CaRTSelector – Greedy approach

+Greedy constructs at most (n-1) CaRT predictors.

- The problem is that the decision whether Xi will be predicted or materialized is done solely on its “localized” prediction benefit.

- Another problem is that it involves the parameter θ:

• High θ may mean that too few attributes are chosen for prediction.

• Low θ may mean that low-benefit predictors are chosen early, which may exclude some high-benefit predictors later.

16

SPARTAN

CaRTSelector – MaxIndependentSet selector

1. Xmat = X

2. For each Xi in Xmat build CaRT for Xi based on its “predictive neighborhood” and estimate the storage-cost benefit.

3. Build a node-weighted undirected graph Gtemp on X with:● all edges (Xi,Y) where Y is used in the predictor for Xi● weights for each node Xi set equal to the storage-cost benefit

of predicting Xi

4. Run a heuristic for Weighted Maximum Independent Set problem on Gtemp.

5. If the sum of storage-cost benefits for attributes corresponding to the nodes in result S of WMIS is negative, finish.

6. Othervise, move these nodes from Xmat to Xpred and go to 2.

17

SPARTAN


We look for a Weighted Maximum Independent Set in Gtemp because:

1. We want to maximize the storage-cost benefits (sum of weights in the nodes).

2. We do not want to select a pair of nodes that are connected with an edge (because one would be needed to predict another).

Note that in step 2 we need to take the account the possible negative effect of moving Xi to Xpred, because it might have been previously assigned to be used in prediction of some Xj. Therefore the benefit may be negative.

18

SPARTAN


● This heuristic builds at most O(rn2) predictors and runs WMIS solution heuristic at most O(n) times (r is the upper bound on the number of predictor attributes used in any of the CaRTs).

● With very loose assumptions this becomes O(rnlog n) and O(log n).

● There are no parameters to this heuristic.

19

SPARTAN

CaRTBuilder

● For categorical attributes, an algorithm, previously published by authors, is used that builds a low storage cost CaRT and also explicitly stores sufficient number of outliers such that the fraction of misclassified

● For numeric attributes, a novel algorithm is used that efficiently builds a compact regression tree with a guaranteed bound on error (also previously published by authors).

20

SPARTAN

RowAggregator

A fascicle-based algorithm is used to further compress the table T’ of predictor attributes. Because compression error for T’ can accumulate with the prediction error of models Mi the following restriction is added to the fascicle algorithm:

Range of values [x’,x’’] for attribute Xi can be represented in a fascicle by (x’+x’’)/2, only if x’’-x’ < 2ei and none of the split values for Xi belongs to [x’,x’’]

(vi is a split value for Xi if there is a split condition Xi>vi in some CaRT Mi)

21

Experimental results● The paper lists a number of positive exp. results with

various real-life datasets, both compared to gzip and to the Fascicles method.

● For all examples the compression ratio for SPARTAN decreased with the increase of the error tolerance.

● Only with one dataset (census data) was Fascicles more efficient, and that only happened with a very high error tolerance.

22

Experimental results● Greedy algorithm had a better (Compression

Ratio/Running time) ratio.● For all examples the compression ratio for SPARTAN

decreased with the increase of the error tolerance.● Fascicles was more efficient with only one dataset

(census data), and only with a very high error tolerance.

● It did not payoff to use a very large sample to build a BN, or to increase the error tolerance.

23

References● Shivnath Babu, Minos Garofalakis, Rajeev Rastogi, "SPARTAN: A

Model-Based Semantic Compression System for Massive Data Tables", ACM SIGMOD, May 2001, Santa Barbara, CA, pp. 283-295.

● J. Cheng, D. A. Bell, and W. Liu. “Learning Belief Networks from Data: An Information Theory Based Approach”. In Proc. of the 6th Intl. Conf. on Information and Knowledge Management, 1997.

● R. Rastogi and K. Shim. “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”. In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998.

● S. Babu, M. Garofalakis, and R. Rastogi. “SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables”. Bell Labs Tech. Report, 2001.

cis664 kd&dm spartan: a model-based semantic compression system for massive data tables s. babu,...

Documents