hybrid query and data ordering for fast and progressive ...dimitris/publications/ijdwm05.pdf ·...

21
Hybrid Query and Data Ordering for Fast and Progressive Range-Aggregate Query Answering Cyrus Shahabi, University of Southern California, USA Mehrdad Jahangiri, University of Southern California, USA Dimitris Sacharidis, University of Southern California, USA ABSTRACT Data analysis systems require range-aggregate query answering of large multidimensional datasets. We provide the necessary framework to build a retrieval system capable of providing fast answers with progressively increasing accuracy in support of range-aggregate queries. In addition, with error forecasting, we provide estimations on the accuracy of the generated approximate results. Our framework utilizes the wavelet transformation of query and data hypercubes. While prior work focused on the ordering of either the query or the data coefficients, we propose a class of hybrid ordering techniques that exploits both query and data wavelets in answering queries progressively. This work effectively subsumes and extends most of the current work where wavelets are used as a tool for approximate or progressive query evaluation. The results of our experimental studies show that independent of the characteristics of the dataset, the data coefficient ordering, contrary to the common belief, is the inferior approach. Hybrid ordering, on the other hand, performs best for scientific datasets that are inter-correlated. For an entirely random dataset with no inter-correlation, query ordering is the superior approach. Keywords: approximate query; multidimensional dataset; OLAP; progressive query; range- aggregate query; wavelet transformation INTRODUCTION Modern data analysis systems need to perform complex statistical queries on very large multidimensional datasets; thus, a number of multivariate statistical meth- ods (e.g., calculation of covariance or kur- tosis) must be supported. On top of that, the desired accuracy varies per applica- tion, user, and/or dataset, and it can well be traded off for faster response time. Fur- thermore, with progressive queries—a.k.a., anytime algorithms (Grass & Zilberstein, 1995; Bradley, Fayyad, & Reina, 1998) or online algorithms (Hellerstein, Haas, & Wang, 1997)—during the query run time, a measure of the quality of the running an- swer, such as error forecasting, is required. This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com ITJ2783 IDEA GROUP PUBLISHING

Upload: others

Post on 15-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 49

Hybrid Query and Data Orderingfor Fast and Progressive

Range-Aggregate Query AnsweringCyrus Shahabi, University of Southern California, USA

Mehrdad Jahangiri, University of Southern California, USA

Dimitris Sacharidis, University of Southern California, USA

ABSTRACT

Data analysis systems require range-aggregate query answering of large multidimensionaldatasets. We provide the necessary framework to build a retrieval system capable of providingfast answers with progressively increasing accuracy in support of range-aggregate queries. Inaddition, with error forecasting, we provide estimations on the accuracy of the generatedapproximate results. Our framework utilizes the wavelet transformation of query and datahypercubes. While prior work focused on the ordering of either the query or the data coefficients,we propose a class of hybrid ordering techniques that exploits both query and data wavelets inanswering queries progressively. This work effectively subsumes and extends most of the currentwork where wavelets are used as a tool for approximate or progressive query evaluation. Theresults of our experimental studies show that independent of the characteristics of the dataset,the data coefficient ordering, contrary to the common belief, is the inferior approach. Hybridordering, on the other hand, performs best for scientific datasets that are inter-correlated. Foran entirely random dataset with no inter-correlation, query ordering is the superior approach.

Keywords: approximate query; multidimensional dataset; OLAP; progressive query; range-aggregate query; wavelet transformation

INTRODUCTION

Modern data analysis systems needto perform complex statistical queries onvery large multidimensional datasets; thus,a number of multivariate statistical meth-ods (e.g., calculation of covariance or kur-tosis) must be supported. On top of that,the desired accuracy varies per applica-

tion, user, and/or dataset, and it can well betraded off for faster response time. Fur-thermore, with progressive queries—a.k.a.,anytime algorithms (Grass & Zilberstein,1995; Bradley, Fayyad, & Reina, 1998) oronline algorithms (Hellerstein, Haas, &Wang, 1997)—during the query run time, ameasure of the quality of the running an-swer, such as error forecasting, is required.

This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar.Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission

701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USATel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com

�������

IDEA GROUP PUBLISHING

Page 2: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

50 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

We believe that our methodologies de-scribed in this article can contribute towardsthis end.

We propose a general frameworkthat utilizes the wavelet decomposition ofmultidimensional data, and we explore pro-gressiveness by selecting data values toretrieve based on some ordering function.The use of the wavelet decomposition isjustified by the well-known fact that thequery cost is reduced from query size tothe logarithm of the data size, which is amajor benefit especially for large-rangequeries. The main motivation for choosingordering functions to produce differentevaluation plans lies in the observation thatthe order in which data is retrieved fromthe database has an impact on the accu-racy of the intermediate results (see Ex-perimental Results). Our work, as describedin this article, effectively extends and gen-eralizes most of the related work wherewavelets are used as a tool for approxi-mate or progressive query evaluation(Vitter, Wang, & Iyer, 1998; Vitter & Wang,1999; Lemire, 2002; Wu, Agrawal, &Abbadi, 2000; Schmidt & Shahabi, 2002;Garofalakis & Gibbons, 2002).

Our general query formulation cansupport any high-order polynomial range-aggregate query (e.g., variance, covari-ance, kurtosis, etc.) using the joint data fre-quency distribution similar to Schmidt andShahabi (2002). However, we fix the querytype at range-sum queries to simplify ourdiscussion. Furthermore, the main distinc-tion between this article and the work inSchmidt and Shahabi (2002) is that theyonly studied the ordering of query wave-lets, while in this article we discuss a gen-eral framework that includes the orderingof not only query and data wavelets, butalso the hybrid of the two.

Let us note that our work is not asimple application of wavelets to scientificdatasets. Traditionally, data transformationtechniques such as wavelets have beenused to compress data. The idea is to trans-form the raw data set to an alternativeform, in which many data points (termedcoefficients) become zero or small enoughto be negligible, exploiting the inherent cor-relation in the raw data set. Consequently,the negligible coefficients can be droppedand the rest would be sufficient to recon-struct the data later with minimum errorand hence the compression of data.

However, there is a major differencebetween the main objective of compres-sion applications using wavelets and thatof database applications. With compressionapplications, the main objective is to com-press data in such a way that one can re-construct the data set in its entirety withas minimal error as possible. Consequently,at the data-generation time, one can de-cide which wavelets to keep and which todrop. Instead, with database queries, eachrange-sum query is interested in somebounded area (i.e., subset) of the data. Thereconstruction of the entire signal is onlyone of the many possible queries.

Hence, for the database applications,at the data generation or population time,one cannot optimally sort the coefficientsand specify which coefficients to keep ordrop. Even at query time, we need to re-trieve all required data coefficients to planthe optimal order. Thus, we study alterna-tive ways of ordering both the query anddata coefficients to achieve optimal pro-gressive (or approximate) answers to poly-nomial queries. The major observation fromour experiments is that no matter whetherthe data is compressible or not, orderingdata coefficients alone is the inferior ap-proach.

Page 3: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 51

CONTRIBUTIONS

The main contributions of this articleare as follows:

• Introduction of Ordering Functions thatmeasure the significance of waveletcoefficients and form the basis of ourframework. Ordering functions formal-ize and generalize the methodologiescommonly used for wavelet approxima-tion; however, following our definition,they can be used to provide either ex-act, approximate, or progressive viewsof a dataset. In addition, depending onthe function used, ordering functions canprovide deterministic bounds on the ac-curacy of the dataset view.

• Incorporation of existing wavelet ap-proximation techniques (first-B, highest-B) to our framework. To our knowledge,the only other work that takes under con-sideration the significance of a waveletcoefficient in answering arbitrary rangequeries is that of Garofalakis and Gib-bons (2002). We have modified theirLow-Bias Probabilistic Wavelet Synop-ses technique to produce the MEOWordering function, which measures sig-nificance with respect to a predefinedset of queries (workload).

• Definition of the query vector, the datavector, and the answer vector(hypercubes in the multidimensionalcase). By ordering any of these vectors,we construct different progressiveevaluation plans for query answering.One can either apply an ordering func-tion for the query or answer vector onlineas a new query arrives, or can apply anordering function to the data vector off-line as a preprocessing step. We provethat online ordering of the answer vec-

tor results in the ideal, yet not feasible,progressive evaluation plan.

• Proposition of Hybrid Ordering, whichuses a highly compact representation ofthe dataset to produce an evaluation planthat is very close to the ideal one. Care-ful utilization of this compact represen-tation leads to our Hybrid* Ordering Al-gorithm.

• Conducting several experiments withlarge real-world and synthetic datasets(of over 100 million values) to comparethe effectiveness of our various pro-posed ordering techniques under differ-ent conditions.

The remainder of the article is orga-nized as follows. In next section we presentprevious work on selected OLAP topicsthat are closely related to our work. Next,we introduce the ordering functions that ourframework is built upon and the ability toforecast some error metrics. Later, wepresent our framework for providing pro-gressive answers to Range-Sum queries,and we prescribe a number of differenttechniques. We examine all the differentapproaches of our framework on very largeand real multidimensional datasets, anddraw useful conclusions for the applicabil-ity of each approach in the experimentalresults section. Finally in the last section,we conclude and sketch our future workon this topic.

RELATED WORK

Gray, Bosworth, Layman, andPirahesh (1996) demonstrated the fact thatanalysis of multidimensional data was in-adequately supported by traditional rela-tional databases. They proposed a new re-lational aggregation operator, the data cube,that accommodates aggregation of multi-

Page 4: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

52 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

dimensional data. The relational model,however, is inadequate to describe suchdata, and an inherent multidimensional ap-proach using sparse arrays was suggestedin Zhao, Deshpande, and Naughton (1997)to compute the data cube. Since the mainuse of a data cube is to support aggregatequeries over ranges on the domains of thedimensions, a large amount of work hasbeen focused on providing faster answersto such queries at the expense of higherupdate and maintenance cost. To this end,a number of pre-aggregation techniqueswere proposed. Ho, Agrawal, Megiddo, andSrikant (1997) proposed a data cube (Pre-fix Sum) in which each cell stored the sum-mation of the values in all previous cells, sothat it can answer a range-sum query inconstant time (more precisely, in timeO(2d)). The update cost, however, can beas large as the size of the cube. Varioustechniques (Geffner, Agrawal, Abbadi, &Smith, 1999; Chan & Ionnidis, 1999) havebeen proposed that balance the cost of que-ries and updates.

For applications where quick approxi-mate answers are needed, a number of dif-ferent approaches have been taken. His-tograms (Poosala & Ganti, 1999; Gilbert,Kotidis, Muthukrishnan, & Strauss, 2001;Gunopulos, Kollios, Tsotras, & Domeniconi,2000) have been widely used to approxi-mate the joint data distribution and there-fore provide approximate answers in ag-gregate queries. Random sampling (Haas& Swami, 1995; Gibbons & Matias, 1998;Garofalakis & Gibbons, 2002) has also beenused to calculate synopses of the data cube.Vitter and colleagues have used the wave-let transformation to compress the prefixsum data cube (Vitter et al., 1998) or theoriginal data cube (Vitter & Wang, 1999),constructing compact data cubes. Such ap-proximations share the disadvantage of

being highly data dependent and as a resultcan lead to bad performance in some cases.

The notion of progressiveness in queryanswering with feedback, using runningconfidence intervals, was introduced inHellerstein et al. (1997) and further exam-ined in Lazaridis and Mehrotra (2001) andRiedewald, Agrawal, and Abbadi (2000).Wavelets and their inherent multi-resolu-tion property have been exploited in pro-viding answers that progressively get bet-ter. In Lemire (2002) the relative prefix sumcube is transformed to support progressiveanswering, whereas in Wu et al. (2000) thedata cube is directly transformed. Let usnote that the type of queries supported in atypical OLAP system is quite limited. Oneexception is the work of Schmidt andShahabi (2002), where general polynomialrange-sum queries are supported using thetransformation of the joint data distributioninto the wavelet domain.

Ordering Wavelet Coefficients

The purpose of this section is to de-fine ordering functions for waveletdatasets which form the foundation for theproposed framework.

Ordering Functions and Significance

In this section, we order wavelet co-efficients based on the notion of “signifi-cance.” Such an ordering can be utilizedfor either approximating or getting progres-sively better views of a dataset.

A Data Vector of size N will be de-noted as d = (d

1d

2...d

N), whereas a Wave-

let Vector of size N is w = (w1w

2...w

N).

These two vectors are associated with eachother through the Discrete Wavelet Trans-form: DWT(d) = w ⇔ DWT-1(w) = d.

Page 5: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 53

Given a wavelet vector w =(w

1w

2...w

N) of size N, we can define a to-

tal order ≥ on the set of wavelet coeffi-cients W = (w

1,w

2, ..., w

n). The totally or-

dered set {wσ1, wσ2

, ..., wσN} has the prop-

erty ∀σi ≤ σ

j, wσi

≥ wσj. The indexes σ

i

define a permutation σ of the wavelet co-efficients that can be expressed throughthe Mapping Vector m = (σ

1, σ

2, ..., σ

N).

A function f : W → R that obeys theproperty w

i ≥ w

j ⇔ f(w

i) ≥ f(w

j) is called

an Ordering Function. This definition sug-gests that the sets (W, ≥) and (f(W), ≥) areorder isomorphic—that is, they result in thesame permutation of wavelet coefficients.With the use of ordering function, the se-mantic of the ≥ operator and its effect ismore clearly portrayed. Ordering functionsassign a real value to a wavelet coefficientwhich is called Significance.

At this point we should note that foreach total order ≥, there is a correspondingclass of ordering functions. However, anyrepresentative of this class is sufficient tocharacterize the ordering of the waveletcoefficients.

Common Ordering Functions

In the database community waveletshave been commonly used as a tool forapproximating a dataset (Vitter et al., 1998;Vitter & Wang, 1999). Wavelets indeedpossess some nice properties that serve thepurpose of compressing the stored data.This process of compression involves drop-ping some “less” significant coefficients.Such coefficients carry little information,and thus setting them to zero should havelittle effect on the reconstruction of thedataset. This is the intuition used for ap-proximating a dataset.

The common question that a databaseadministrator may face is: Given a storage

capacity that allows storing only B<<Wcoefficients of the wavelet vector W, whatis the best possible B-length subset of W?We shall answer this question in differentways, with respect to the semantics of thephrase best possible subset.

We can now restate two widely usedmethods using the notion of ordering func-tions.

FIRST-B: The most significant coef-ficients are the lowest frequency ones. Theintuition of this approach is that high-fre-quency coefficients can be seen as noiseand thus they should be dropped. One simpleordering function for this method could befFB

(wi) = -i, which leads to a linear com-

pression rule. The First-B ordering func-tion has the useful property that it is de-pendant on the indices of the coefficients.

HIGHEST-B: The most significant co-efficients are those that have the most en-ergy. The energy of a coefficient w

i is de-

fined as E(wi) = w

i2. The intuition comes

from signal processing: the highest pow-ered coefficients are the most importantones for the reconstruction of the trans-formed signal which leads to the commonnon-linear “hard thresholding” compressionrule. One ordering function therefore isf

HB(w

i) = w

i2.

Extension toMultidimensional Hypercubes

Up to now, we were constricting our-selves to the one-dimensional case, wherethe data is stored in a vector. In the gen-eral case the data is of a multidimensionalnature and can be viewed as a multidimen-sional hypercube (Gray et al., 1996). Eachcell in such an n-dimensional (hyper-)cubecontains a single data value and is indexedby n numbers, each number acting as anindex in a single dimension. Without loss of

Page 6: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

54 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

generality, each of the n dimensions of thecube d have S distinct values and are in-dexed by the integers 0,1, ..., S – 1 so thatthe size of the cube is d = Sn. The de-composition of the data cube into the wave-let domain is done by the application of theone-dimensional DWT on the cube for onedimension, then applying DWT on the re-sulting cube for the next dimension and con-

tinuing for all dimensions. We will use �d todenote the transformed data cube:

�1 1( )n nDWT DWT DWT−= ⋅d d�

Applying an ordering to a multidimen-sional cube is no longer a permutation onthe domain of the indices, but it is rather amultidimensional index on the cube. Theresult of ordering a hypercube is essentiallya one-dimensional structure, a vector. Eachelement of the mapping vector m is a setof n indices, instead of a single index in theone dimensional case. To summarize, theordered vector together with the mappingvector is just a different way to describethe elements in the cube, with the advan-tage that a measure of significance is takenunder consideration. One last issue that wewould like to discuss is the definition of theFirst-B ordering of the previous section. Itis not exactly clear what the First-B order-ing in higher dimensions should be, but weclarify this while staying in accordance withour definition. We would like the lower fre-quency coefficients to be the most signifi-cant ones. Due to the way we have ap-plied DWT on the cube, the lower fre-quency coefficients are placed along thelower sides of the cube. So, for an n-di-mensional cube of size S, at each dimen-sion we define a First-B ordering functionfor the element at cell (i

0, i

1, ..., i

n–1) as:

1 0 1 1( )

nFB i i …if d−, , , =

0 1 1 0 1 1[ min( ) ( )]n nS i i … i i i … i− −− ⋅ , , , + + + +

However, one could choose a slightlydifferent ordering function that starts at thelowest-indexed corner and visits all nodesin a diagonal fashion until it reaches thehighest-indexed one:

2 0 1 1 0 1 1( ) ( )nFB i i …i nf d i i … i

−, , , −= − + + +

Minimum Error Ordering in theWavelet Domain (MEOW)

Formerly, we redefined the two mostwidely used ordering functions for waveletvectors. These methods however can pro-duce approximations that, in some cases,are very inaccurate, as illustrated inGarofalakis and Gibbons (2002). The au-thors show that for some queries the ap-proximate answer can vary significantlycompared to the actual answer, and theypropose a probabilistic approach to dynami-cally select the most “significant” coeffi-cients. In this section, we construct an or-dering function termed MEOW based onthese observations such that it fits into ourgeneral framework. We assume that aquery workload has been provided in theform of a set of queries, and we try to findthe ordering that minimizes some errormetric when queries similar to the workloadare submitted. We refer to the workload,that is a set of queries, using the notation .This set contains an arbitrary number ofpoint-queries and range-queries.

In real-world scenarios, such aworkload can be captured by having thesystem run for a while, so that queries canbe profiled, similar to a typical databasesystem profiling for future optimization de-cision. A query on the data cube d will be

Page 7: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 55

denoted by q. The answer to this query isD{q}. Answering this query with a losslesswavelet transformation of the data cubeyields the result W

N{q} = D{q}, where

N=d is the size of the cube. Answeringthe query with a B-approximation (B coef-ficients retained) of the wavelet transfor-mation yields the result W

B{q}. We use the

notation err(org, rec) to refer to the errorintroduced by approximating the original(org) value by the reconstructed approxi-mate (rec) value. For example the relativeerror metric would be:

( )rel

org recerr org rec

org

| − |, =

MEOW assigns significance to a co-efficient with respect to how bad a set ofqueries would be answered if this coeffi-cient was dropped; in other words, the mostsignificant coefficient is the one in whichthe highest aggregated error occurs acrossall queries. Towards this end, a list of can-didate coefficients is initialized with all thewavelet coefficients. At each iteration themost significant coefficient is selected tobe maintained and hence removed from thelist of candidates, so that the next-best co-efficient is always selected in a greedymanner. Note that removing the most sig-nificant coefficient from the candidate listis equivalent to maintaining that coefficientas part of the B-approximation. More for-mally:

( )MEOW if w a=

( ( { } ( { }){ }))N M i

q QS

err D q W w qagg −∀ ∈

, �

where N=d is the size of the cube; WN–M

is the (N-M)-coefficient wavelet approxi-

mation of the data vector (running candi-date list), which is the result of selectingthe M most significant coefficients seen so

far; { }N M iW w− � is the (N-M-1)-coeffi-cient wavelet approximation, which is theresult of additionally selecting the coeffi-cient under consideration. The function isused for aggregating the error values oc-curred for the set of queries; any Lp normcan be used if the error values are consid-ered as a vector.

Ordering Functions andError Forecasting

We have seen that obtaining an or-dering on the wavelet coefficients can beuseful when we have to decide the best-Bapproximation of a dataset. Having an or-dering also helps us answer a query in aprogressive manner, from the most signifi-cant to the least significant coefficient. Thereason behind this observation is the factthat the ordering function assigns a valueto each coefficient, namely its significance.If the ordering function is related to someerror metric that we wish to minimize, thesignificance can be further used for errorforecasting. That is, we can foretell thevalue of an error metric by providing a strictupper bound to it. In this section, we shallsee why such a claim is plausible.

HIGHEST-B: The ordering obtainedfor Highest-B minimizes the L2 error be-tween the original vector and its B-lengthapproximation. Let w be a vector in thewavelet domain and wB its highest B ap-proximation. Then the L2 error is given bythe summation of the power of the coeffi-cients not included in the B approximation.

2 2

1i

i B

w| |

= +

| − | = ∑w

Bw w

Page 8: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

56 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

This implies that including the highestB coefficients in the approximation mini-mizes the L2 error for the reconstruction ofthe original wavelet vector. In progressivequery evaluation we can forecast an errormetric as follows. Suppose we have justretrieved the k-th coefficient w whose sig-

nificance is 2kw| | . The L2 error is strictly

less than the significance of this coefficientmultiplied by the number of the remainingcoefficients:

2 ( ) ( )B HB kW k f w| − | ≤ | | − ⋅w w

Notice that in this case the forecastederror is calculated on the fly. We simplyneed to keep track of the number of coef-ficients we have retrieved so far.

MEOW: The ordering function usedin MEOW is directly associated with anerror metric that provides guarantees onthe error occurred in answering a set ofqueries. Thus we can immediately reportthe significance of the last coefficient re-trieved as the error value that we forecast.There are several issues that arise fromthis method, the most important being thespace issue of storing the significance ofcoefficients. For the time being, we shallignore these issues and assume that for anycoefficient we can also look up its signifi-cance. Consequently, we can say that ifthe submitted query q is included in thequery set q ∈ QS and the L∞ norm is usedfor aggregation, then the error for answer-ing this query is not more than the signifi-cance of the current coefficient w

k re-

trieved.

( { } { }) ( )k MEOW kerr D q W q f w, ≤

ANSWERINGRANGE-SUM QUERIESWITH WAVELETS

In this section, we provide a defini-tion for the range-sum query, as well as ageneral framework for answering queriesin a progressive manner where results getbetter as more data is retrieved from thedatabase.

Range-Sum Queries andEvaluation Plans

Let us assume that the data we willdeal with is stored in an n-dimensionalhypercube d. We are interested in answer-ing queries that are defined on an n-dimen-sional hyper-rectangle R. We define thegeneral type of queries that we would liketo be capable of answering.

Definition 1: A range-sum query Q(R,d)of range R on the cube d is the summa-tion of the values of the cube that arecontained in the range. Such a query canbe expressed by a cube of the same sizeas the data cube that has the value ofone in all cells within the range R andzero in all cells outside that range. Wewill call this cube the query cube q.

The answer to such a query is givenby the summation of the values of the datacube d for each cell ξ contained in therange:

( )R

ξ∈

= ∑ d .

We can rewrite this summation as the mul-tiplication of a cell in the query cube q witha corresponding cell in the data cube d.

Page 9: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 57

( ) ( )aξ

ξ ξ∈

= ⋅∑d

q d (1)

As shown in Schmidt and Shahabi(2002), Equation 1 is a powerful formaliza-tion that can realize not only SUM,COUNT, and AVG, but also any polyno-mial up to a specific degree on a combina-tion of attributes (e.g., VARIANCE andCOVARIANCE).

Let us provide a very useful lemmathat applies not only for the Discrete Wave-let Transform, but also for any transforma-tion that preserves the Euclidean norm.

Lemma 1: If a� is the DWT of a vector a

and �b is the DWT of a vector b then

� �[ ] [ ] [ ] [ ]i j

i i j j< , >=< , >⇔ ⋅ = ⋅∑ ∑a b a b a b a b� �

We borrowed the following theoremfrom Schmidt and Shahabi (2002) to justifythe use of wavelets in answering range-sum queries.

Theorem 1: The n-dimensional query cubeq of size S in each dimension definedfor a range R can be transformed intothe wavelet domain in time O(2nlognS),and the number of non-zero coefficients

in �q is less than O(2nlognS). This theo-rem results in the following useful cor-ollary.

Corollary 1: The answer to a range-sumquery Q(R,d) is given by the followingsummation:

� �( ) ( )aξ

ξ ξ∈

= ⋅∑d

q d

and can be computed in O(2nlognS) re-trieves from the database.

This corollary suggests that an exactanswer can be given in O(2nlognS) steps.We can exploit this observation to answera query in a progressive manner, so thateach step produces a more precise evalu-ation of the actual answer.

Definition 2: A progressive evaluationplan for a range-sum query

� �( ) ( ) ( )Q Rξ

ξ ξ∈

, = ⋅∑d

d q d�

is an n-dimensional index σ of the space

defined by the cube �d . The sum at the j-thprogressive step

� � � �

0

( ) ( ) ( ) ( )j

j Q Raσ

σ σσξ

ξ ξ=

≡ , = ∑d q d

is called the approximate answer at itera-tion j.

It should be clear that if the multidi-mensional index visits only the non-zeroquery coefficients then approximate answerconverges to the actual answer as j reachesO(2nlognS). We would like to find an evalu-ation plan that causes the approximate an-swer to converge fast to the actual value.Towards this end, we will use an errormetric to measure the quality of the ap-proximate answer.

Definition 3: Given a range-sum query,an optimal progressive evaluationplan σ

0 is a plan that at each iteration

produces an approximate answer ã thatis closest to the actual answer a, basedon some error metric, when comparedto any other evaluation plan σ

i at the

same iteration.

Page 10: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

58 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

� �( ( )) ( ( ))o i

err a j err a j j i oa aσ σ, ≤ , , ∀ , ∀ ≠

One important observation is that then-dimensional index σ of a progressiveevaluation plan defines three single-dimen-sional vectors—the query vector, the datavector, and another vector that we call theanswer vector. The answer vector isformed by the multiplication of a query witha corresponding data coefficient.

Definition 4: The multiplication of a querycoefficient with a data coefficient yieldsan answer coefficient. These answercoefficients form the answer vector a:

� �[ ] [ ] [ ]i i i= ⋅a q d

Note that the answer vector is not awavelet vector. In order to answer a query,we must sum across the coefficients of theanswer vector.

Definition 5: Given the permutation σ ofa progressive evaluation plan, we canredefine the approximate answer ã asthe summation of a number of answercoefficients.

0

( ) [ ]j

i

j iaσ=

= ∑ a

Similarly, the actual answer is givenby the following equation.

0

[ ]i

a i| |

=

= ∑a

a

To summarize up to this point, we havedefined the data cube, the query cube, andtheir wavelet transformations, and haveseen that by choosing a multidimensional

index, these cubes can be viewed as single-dimensional vectors. Finally we have de-fined the answer vector for a given range-sum query. In the next section we will ex-plore different progressive evaluation plansto answer a query. By choosing an order-ing for one of the three vectors describedabove—the query vector, the data vector,and the answer vector—we can come upwith different evaluation plans for answer-ing a query. Furthermore, we have a choiceof different ordering functions for the se-lected vector.

Ordering of the Answer Vector

In this section, we will prove that aprogressive evaluation plan that is producedby an ordering of the answer vector is theoptimal one, yet not practically applicable.Recall that we have defined optimality interms of obtaining an approximate answeras close to the actual answer as possible.Thus, the objective of this section can berestated as following: given an answer vec-tor what is the permutation of its indicesthat yields the closest-to-actual approximateanswer.

The careful reader should immediatelyrealize that we have already answered thisquestion. If we use the L2 error norm tomeasure the quality of an approximate an-swer, then the ordering of choice has to beHighest-B. Please note that although wehave defined Highest-B for a vector whosecoefficients correspond to the wavelet ba-sis, the property of this ordering applies forany vector defined over orthonormal ba-sis; the standard basis is certainly orthonor-mal.

The L2 error norm of an approximateanswer is:

Page 11: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 59

� 2 2

1

( [ ])i j

a a i| |

= +| − | = ∑

a

a

Therefore we have proven that theHighest-B ordering of the answer vectorresults in an optimal progressive evalua-tion plan. However, there is one seriousproblem with this plan. We have to firsthave the answer vector to order it, whichmeans we have to retrieve all data coeffi-cients from the database. This defeats ourmain purpose of finding a good evaluationplan so that we can progressively retrievedata coefficients. Therefore, we can neverachieve the optimal progressive evalua-tion plan, but we can try to get close to it,and henceforth we use the Highest-B or-dering of the answer vector as our lowerbound.

Since the answer vector will not beavailable at query time, we can try to applyan ordering on either the query or the datacube. This is the topic of the following sec-tions.

Ordering of the Query Cube

We will answer a range-sum querywith an evaluation plan that is produced byordering the query cube. Since we are notinterested in the data cube, the evaluationplan will be generated at query time andwill depend only on the submitted query.Below are the steps required to answer anyrange-sum query.

Preprocessing Steps (Off-line)

1. Transform the data cube d in the wave-

let domain and store it in that form �d .

Answering a Range-Sum Query (Online)

1. Construct the query cube and transformit into the wavelet domain.

� ( )DWT←q q

2. Apply an ordering function to the trans-

formed query cube �q to obtain the map-ping vector (m-vector).

�( )order←m q

3. Iterate over the m-vector, retrieve thedata coefficient corresponding to the in-dex stored in the m-vector, multiply itwith the query coefficient, and add theresult to the approximate answer.

� �[ ] [ ]i

a i i∈

= ⋅∑m

q d

It is important to note that the num-ber of non-zero coefficients is O(2nlognS)for an n-dimensional cube of size S in eachdimension which dramatically decreases thenumber of retrieves from the database andthus the cost of answering a range-sumquery. This is the main reason why we usewavelets.

The ordering functions that we canuse are First-B (Wu et al., 2000) and High-est-B. We cannot conclude which of thetwo ordering functions would produce aprogressive evaluation plan that is closerto the optimal one for an arbitrary data vec-tor. The Highest-B ordering technique sug-gested in Schmidt and Shahabi (2002) yieldsoverall-better results only when we aver-age across all possible data vectors, orequivalently when the dataset is completely

Page 12: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

60 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

random. In Experimental Results we willsee the results of using both methods ondifferent datasets.

Ordering of the Data Cube

In this section, we will answer arange-sum query using an evaluation planthat is produced by applying an orderingfunction on the data cube. We will providethe steps required and discuss some diffi-culties that this technique has. First, let uspresent the algorithm to answer a query ina similar way to query ordering.

Preprocessing Steps (Off-line)

1. Transform the data cube d in the wave-

let domain and store it in that form �d .2. Apply an ordering function on the trans-

formed data cube to obtain the mappingvector.

�( )order←m d

Answering a Range-Sum Query (Online)

1. Construct the query cube and transformit into the wavelet domain.

� ( )DWT←q q

2. Retrieve from database a value from them-vector and retrieve the data coeffi-cient corresponding to that value. Multi-ply it with the query coefficient and addthe result to the approximate answer.Repeat until all values in the m-vectorhave been retrieved.

� �[ ] [ ]i

a i i∈

= ⋅∑m

q d

There is a serious difference betweenthis algorithm and the query ordering algo-rithm; the m-vector is stored in databaseand cannot be kept in memory, since it isas big as the data cube. This means thatwe can only get one value of the m-vectorat a time together with the correspondingdata coefficient. Besides the fact that weare actually retrieving two values at eachiteration, the most important drawback isthe fact that the approximate answer con-verges to the actual answer much slower,as it needs all data coefficients to be re-trieved, thus time of O(Sn) .

This problem has been addressed inVitter et al. (1998) and Vitter and Wang(1999) with the Compact Data Cube ap-proach. The main idea is to compress thedata by keeping only a sufficient numberof coefficients based on the orderingmethod used. The resulting cube is so smallthat it can fit in main memory and queriescan be answered quickly. This, however,has the major drawback that the databasecan support only approximate answers.

Last but not least, if the significanceof a coefficient is stored in the database,then it can be used to provide an error es-timation as discussed earlier. The data cubecoefficients are partitioned in disjoint sets(levels) of decreasing significance, with onesignificance value assigned to each level(the smallest significance across all coeffi-cients of the same level). When the lastcoefficient of a level is retrieved, the sig-nificance of that level is also retrieved andused to estimate the associated error met-ric. When our framework is used for pro-gressive answering, the error metric pro-vides an estimation of how good the an-swer is. Furthermore, the level error metricscan be used to calculate an estimation ofthe time our methodology needs to reach acertain level of accepted inaccuracy for a

Page 13: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 61

submitted query, before actually retrievingdata from the database.

Hybrid Ordering

We have seen that the best progres-sive evaluation plan results from orderingthe answer cube with the Highest-B or-dering function. This, however, is not fea-sible since it requires that all the relevantdata coefficients are retrieved, whichcomes in direct contrast with the notion ofprogressiveness. Also, we have seen howwe can answer a query by applying an or-dering to either the query or the data cube.The performance of these methods dependslargely on the dataset and its compressibil-ity. In this section, we will try to take thisdependence out of the loop and provide amethod that performs well in all cases.

With this observation in mind, it shouldbe clear that we wish to emulate the waythe best evaluation plan is created, whichis ordering the answer vector. To createthe answer vector, we need the query cube,which is available at the query time, andthe data cube, which is stored in the data-base. As noted earlier, we cannot have bothof these cubes in memory to do the coeffi-cient-by-coefficient multiplication and thenorder the resulting vector. However we canuse a very compact representation of thedata cube, construct an approximate an-swer vector, and come up with an orderingthat can be used to retrieve coefficientsfrom the database in an optimal order. Thissentence accurately summarizes the intu-ition behind our hybrid ordering method.

Hybrid Ordering Algorithms

For our hybrid method we need a verycompact, yet close representation of thedata cube transformed into the wavelet

domain. We will call this cube the approxi-

mate transformed data cube �d . This cubemust be small enough so that it can befetched from the database in only a fewretrieves. We will come back shortly to dis-cuss the construction of such a cube; forthe time being this is all we need to knowfor describing the algorithm of our hybridmethod.

Preprocessing Steps (Off-line)

1. Transform the data cube d in the wave-

let domain and store it in that form �d .

2. Create a compact representation �d of

the data cube; ˆ≅d d� .

Answering a Range-Sum Query (Online)

1. Construct the query cube and transformit into the wavelet domain:

� ( )DWT←q q

2. Retrieve the necessary data from thedatabase to construct the approximate

transformed data cube �d .3. Create the approximate answer cube by

multiplying element by element the querycube and the approximate data cube.

� �[ ] [ ] [ ]i i i= ⋅a q d�

4. Apply the Highest-B ordering functionto the approximate answer cube ã to ob-tain the mapping vector m.

5. Iterate over the m-vector, retrieve thedata coefficient corresponding to the in-dex stored in the m-vector, multiply itwith the query coefficient, and add theresult to the approximate answer.

Page 14: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

62 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

� �[ ] [ ]i

a i i∈

= ⋅∑m

q d

There is an online overhead associ-ated with the additional retrieval cost forconstructing the approximate transformeddata cube, since data has to be retrievedfrom the database. This cost however canbe amortized for a batch of queries. In ad-dition this approximate data cube can beused to provide an approximation of theactual answer. The Hybrid* Ordering al-gorithm takes advantage of this observa-tion and consequently converges faster tothe actual answer.

The initial value of ans is no longerzero, but is equal to an approximation ofthe answer constructed from the approxi-

mate data cube � �[ ] [ ]i i⋅∑d q . Later, each

time we retrieve an actual data coefficient

�[ ]id from the database, the value in ans is

corrected by adding the error of the ap-

proximation � � � �[ ] [ ] [ ] [ ]i i i i⋅ − ⋅d q d q .

Assuming that the approximate datacube is memory resident (which is the casefor batched queries for example), the Hy-brid* Algorithm can give an approximateanswer without any retrieves from the da-tabase, while correcting the initial approxi-mation with each retrieval. In fact, Hybridand Hybrid* Orderings, as we demonstratein the Experimental Results section, out-perform other orderings for scientificdatasets since they approximate optimalordering by exploiting both query and datainformation.

Approximating theTransformed Data Cube

In this section, we will construct adifferent approximate transformed data

cube to be used by our Hybrid* Algorithm;we defer the performance comparison tothe next section. The purpose of such acube is not to provide approximate answers,but rather to accurately capture the trendof the transformed data cube, so that wecan use its approximation to produce anevaluation plan close to optimal. Themethod of constructing an approximatedata cube must also be able to easily adaptto changes in the desired accuracy.

Equi-Sized Partitioning of theUntransformed Data cube (EPU): Theuntransformed data cube is partitioned intokn equi-sized partitions and the average ofeach partition is selected as a representa-tive. Each partition in EPU consists of thesame value, which is the average, so thatthe total different values stored are onlykn. The transformation of this cube into theWavelet Domain is also compact sincethere are at most knlognS non-zero coeffi-cients in the approximate transformed datacube resulting from this partitioningscheme.

Equi-Sized Partitioning of theTransformed Data cube (EPT): WithEPT the transformed data cube is parti-tioned into kn equi-sized partitions, and theaverage of each partition is selected as arepresentative. Again, each partition inEPT consists of the same value, which isthe average, so that the total different val-ues stored are only kn.

Resolution Representatives (RR):This method differs from EPT in that therepresentatives are selected per resolutionlevel, rather than per partition. For a one-dimensional vector, there are logS resolu-tion levels, and in the n-dimensional case,there are lognS resolution levels. Thus,picking one representative per resolutionlevel yields lognS distinct values. In the

Page 15: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 63

case where a less compact approximatedata cube is desired, some representativescan be dropped; in the case where moreaccuracy is required, more representativesper resolution level can be kept.

Wavelet Approximation of theTransformed Data cube: The WaveletDecomposition is used to approximate thedata cube. Given a desired storage of Bcoefficients, either Highest-B (C-HB), anyof the First-B ordering functions (f

FB1, f

FB2)

(C-FB), or MEOW can be used to com-press the data cube. The result consists ofB distinct values.

EXPERIMENTAL RESULTS

First, we describe our experimentalsetup. Subsequently, we compare differ-ent evaluation plans and draw our mainconclusions. Later, we compare and selectthe best approximation of the data cube.Finally we investigate how larger approxi-mate data cubes effect the performanceof our Hybrid* algorithm.

Experimental Setup

We report results from experimentson three datasets. TEMPERATURE andPRECIPITATION (Widmann &Bretherton, 1949) are real-world datasets,and RANDOM is a synthetic dataset.RANDOM, PRECIPITATION, and TEM-PERATURE are incompressible, semi-compressible, and fully compressibledatasets, respectively. Here, we use theterm compressible to describe a dataset thathas a compact, yet accurate wavelet ap-proximation.

TEMPERATURE is a real-worlddataset that measures the temperatures atpoints all over the globe at different alti-

tudes for 18 months, sampled twice everyday. We construct a four-dimensional cubewith latitude, longitude, altitude, and timeas dimension attributes, and temperatureas the measure attribute. The correspond-ing size of the domain of these dimensionsare 64,128,16, and 1024 respectively. Thisresults in a dense data cube of more than134 million cells.

RANDOM is a synthetic dataset ofthe same size as TEMPERATURE whereeach cell was assigned with a random num-ber between 0 and 100. The result is a com-pletely random dataset, which cannot beapproximated.

PRECIPITATION is a real-lifedataset that measures the daily precipita-tion for the Pacific North West for 45 years.We build a three-dimensional cube with lati-tude, longitude, and time as dimensionalattributes, and precipitation as the measureattribute. The corresponding size of thesedimensions are 8, 8, and 16,384 respec-tively. This makes a data cube of one mil-lion cells.

The data cubes are decomposed intothe wavelet domain using the multidimen-sional wavelet transformation. We gener-ated random range-sum queries with auniform distribution. We measure the MeanRelative Error across all queries to com-pare different techniques in the followingsections.

Comparison of Evaluation Plans

Figure 1 compares the performanceof five evaluation plans. Please note thatall graphs display the progressive accuracyof various techniques for queries on TEM-PERATURE, RANDOM, and PRECIPI-TATION datasets. The horizontal axis al-ways displays the number of retrieved val-ues, while the vertical axis of each graph

Page 16: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

64 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

displays the mean relative error for the setof generated queries. Let us note that theonline overhead of Hybrid* Ordering is notshown in the figures, as it applies only forthe first query submitted. Because of itssmall size, it can be maintained in memoryfor all subsequent queries. The main ob-servations are as follows.

Answer Ordering presents thelower bound for all evaluation plans, as it isan optimal, yet impractical evaluation plan.Answer ordering is used with the Highest-B ordering function, which leads to an op-timal plan as discussed previously. (Hybrid*Ordering can outperform Answer Order-ing at the beginning, because of the factthat it has precomputed an approximateanswer.)

Data Ordering with the Highest-Bordering function performs slightly betterthan Query Ordering at the beginning, onlyfor a compressible dataset like TEMPERA-TURE, but it can never answer a query

with 100% accuracy using the same num-ber of retrieves as the other methods. (Re-call that the total number of required re-trieves to answer any query is O(Sn) forData Ordering and is O(2nlognS) for allother techniques). The use of the Highest-B ordering function is justified by the well-known fact that without query knowledge,Highest-B yields the lowest Lp error norm.When dealing with the RANDOM datasetor PRECIPITATION dataset, Data Order-ing can never provide a good evaluationplan. Although we believe MEOW data or-dering introduces improvement for progres-sive answering, we have not includedMEOW ordering into the experiments withdata ordering due to its large preprocess-ing cost. However, in the next section weuse MEOW as a data cube approximationtechnique.

Query Ordering with Highest-Bordering function performs well with alldatasets. Such an evaluation plan is the ideal

Figure 1. Comparison of evaluation plans

a. TEMPERATURE dataset b. RANDOM dataset

c. PRECIPITATION dataset

Page 17: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 65

when dealing with datasets that are badlycompressible, but is inferior to Hybrid* forcompressible datasets. We also see thatQuery Ordering with First-B, althoughtrivial, can still perform well, especially withthe PRECIPITATION dataset.

Hybrid* Ordering is the closest tooptimal evaluation plan with a compress-ible dataset. In addition, this method hasthe advantage that it can immediately pro-vide an answer with a reasonable error fora compressible dataset. However, Hybrid*does not introduce significant improvementsover Query Ordering for an incompress-ible dataset. At this point, let us note thatthe approximate data cube was createdusing the Resolution Representatives (RR)method; the reason behind this is explainedin the following section. The size of theapproximate data cube is as small as

0.002% of the data cube size to emphasizeboth the ability to maintain such a structurein memory and the negligible retrieval cost.Later we will allow larger approximate datacubes. Let us indicate that Hybrid* Order-ing subsumes and surpasses Hybrid Order-ing; therefore, we do not report HybridOrdering in our experiments.

To conclude, Hybrid* Ordering isrecommended for any real-life dataset thatis compressible. Query ordering, on theother hand, can be the smartest choice fora completely incompressible dataset.

Choosing an ApproximateData Cube

In this section, we measure the per-formance of our Hybrid* Algorithm usingour proposed techniques for creating ap-

a. TEMPERATURE dataset b. RANDOM dataset

c. PRECIPITATION dataset

Figure 2. Hybrid* ordering

Page 18: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

66 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

proximate data cubes. Our purpose is tofind the superior approximation of thedataset to use with Hybrid*.

Figures 2a and 2c suggest that Reso-lution Representatives performs betteroverall. However, when retrieving only avery small percentage of the data, theWavelet Approximation techniques outper-form Resolution Representatives. In par-ticular, Highest-B ordering outperformsFirst-B ordering. As a rule of thumb, if theuser-level application mostly needs ex-tremely fast but approximate results, Wave-let Approximation should be used for cre-ating an Approximate Data Cube. On theother hand, if the user-level application cantolerate longer answering periods, Resolu-tion Representatives can provide more ac-curate answers.

We used MEOW with different Lp

norm as the aggregation function amongwhich L1 norm (sum of errors) performsthe best. As illustrated in Figure 2a, MEOWwith the L1 norm performs badly at the be-ginning, but soon catches up with the otherWavelet Approximation techniques. Figure2c clearly shows the poor performance ofEPT and MEOW for a semi-compressibledataset.

When dealing with a complete ran-dom dataset, Figure 2b implies that the pre-vious observations still hold. ResolutionRepresentatives still produce overall bet-ter results. Among the Wavelet Approxi-mation techniques, the First-B orderingmethods consistently outperform Highest-B. The reason behind this lies in the factthat RANDOM is not compressible (allwavelet coefficients are equally “signifi-cant”). The previous observations also holdfor the PRECIPITATION dataset (see Fig-ure 2c).

The most important conclusion onecan draw from these figures is that Reso-lution Representatives is the best perform-

ing technique for constructing approximatedata cubes. Hybrid* ordering evaluationplans that utilize Resolution Representativesproduce good overall results, even in theworst case of a very badly compressibledataset.

Performance of LargerApproximate Data Cubes

In this section, we investigate the per-formance improvement one can expect byusing Hybrid* Ordering with a larger ap-proximate data cube. Recall that in ourexperiments we used an approximate datacube that is 50,000 times smaller than theactual cube ( percentage of the cube). Wecreated the approximate data cubes usingthe Resolution Representatives method, aspreviously recommended.

Figure 3a illustrates that for a com-pressible dataset, the more coefficients wekeep, the better performance the systemwill have. In contrast, Figure 3b shows thatkeeping more coefficients for an incom-pressible dataset does not result in any im-provement. This is because such a highlyrandom dataset cannot have a good approxi-mation, even when a lot of space is uti-lized. Figure 3c also shows a slight improve-ment with larger approximate cubes.

In sum, we suggest that the size ofthe approximate data cube is a factor thatshould be selected with respect to the typeof the dataset. For compressible datasets,Hybrid* Ordering is more efficient withlarger approximations.

CONCLUSION ANDFUTURE WORK

We introduced a general frameworkfor providing fast and progressive answersfor range-aggregate queries by utilizing the

Page 19: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 67

wavelet transformation. We conductedextensive comparisons among differentordering schemes and prescribed solutionsbased on factors such as the compressibil-ity of the dataset, the desired accuracy, andthe response time. In sum, our proposedHybrid* algorithm results in the minimalretrieval cost for real-world datasets with(even slight) inter-correlation. In addition,its utilization of a small approximate datacube as the initial step generates a goodand quick approximate answer with mini-mum overhead. Another main contributionof this article is the observation that data-only ordering of wavelets results in theworst retrieval performance, unlike the caseof the traditional usage of wavelets in sig-nal compression applications. Our currentand future work is oriented towards inves-tigating other factors, such as the type andfrequency of queries, and the amount ofavailable additional storage.

REFERENCES

Bradley, P.S., Fayyad, U.M., & Reina, C.(1998). Scaling clustering algorithms tolarge databases. Knowledge Discoveryand Data Mining, 9-15.

Chan, C.-Y. & Ionnidis, Y.E. (1999). Hier-archical cubes for range-sum queries.Proceedings of the 25th InternationalConference on Very Large Data Bases(pp. 675-686).

Garofalakis, M. & Gibbons, P.B. (2002).Wavelet synopses with error guaran-tees. Proceedings of SIGMOD 2002.ACM Press.

Geffner, S., Agrawal, D., Abbadi, A.E., &Smith, T. (1999). Relative prefix sums:An efficient approach for querying dy-namic OLAP data cubes. Proceedingsof the 15th International Conferenceon Data Engineering (pp. 328-335).IEEE Computer Society.

a. TEMPERATURE dataset b. RANDOM dataset

c. PRECIPITATION dataset

Figure 3. Performance of larger approximate data cubes

Page 20: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

68 International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Gibbons, P.B. & Matias, Y. (1998, June 2-4). New sampling-based summary sta-tistics for improving approximate queryanswers. Proceedings of the ACMSIGMOD International Conferenceon Management of Data (pp. 331-342),Seattle, Washington, USA. ACM Press.

Gilbert, A.C., Kotidis, Y., Muthukrishnan,S., & Strauss, M.J. (2001). Optimal andapproximate computation of summarystatistics for range aggregates. Pro-ceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems (pp.228-237).

Grass, J. & Zilberstein, S. (1995). Anytimealgorithm development tools. Techni-cal Report No. UM- CS-1995-094.

Gray, J., Bosworth, A., Layman, A., &Pirahesh, H. (1996). Datacube: A rela-tional aggregation operator generalizinggroup-by, cross-tab, and sub-total. Pro-ceedings of the 12th InternationalConference on Data Engineering (pp.152-159).

Gunopulos, D., Kollios, G., Tsotras, V.J., &Domeniconi, C. (2000). Approximatingmulti-dimensional aggregate range que-ries over real attributes. Proceedingsof the ACM SIGMOD InternationalConference on Management of Data(pp. 463-474).

Haas, P.J. & Swami, A.N. (1995, March6-10). Sampling-based selectivity esti-mation for joins using augmented fre-quent value statistics. Proceedings ofthe 11th International Conferenceon Data Engineering (pp. 522-531),Taipei, Taiwan. IEEE Computer So-ciety.

Hellerstein, J.M., Haas, P.J., & Wang, H.(1997). Online aggregation. Proceed-ings of the ACM SIGMOD Interna-tional Conference on Management ofData (pp. 171-182). ACM Press.

Ho, C., Agrawal, R., Megiddo, N., &Srikant, R. (1997). Range queries inOLAP data cubes. Proceedings of theACM SIGMOD International Confer-ence on Management of Data (pp. 73-88). ACM Press.

Lazaridis, I. & Mehrotra, S. (2001). Pro-gressive approximate aggregate querieswith a multi-resolution tree structure.Proceedings of the ACM SIGMOD In-ternational Conference on Manage-ment of Data (pp. 401-412).

Lemire, D. (2002, October). Wavelet-based relative prefix sum methods forrange sum queries in data cubes. Pro-ceedings of CASCON 2002.

Poosala, V. & Ganti, V. (1999). Fast ap-proximate answers to aggregate querieson a data cube. Proceedings of the 11thInternational Conference on Scien-tific and Statistical Database Man-agement (pp. 24-33). IEEE ComputerSociety.

Riedewald, M., Agrawal, D., & Abbadi,A.E. (2000). pCube: Update-efficientonline aggregation with progressive feed-back. Proceedings of the 12th Inter-national Conference on Scientific andStatistical Database Management (pp.95-108).

Schmidt, R. & Shahabi, C. (2002).Propolyne: A fast wavelet-based tech-nique for progressive evaluation of poly-nomial range-sum queries. Proceedingsof the Conference on Extending Da-tabase Technology (EDBT’02). Ber-lin: Springer-Verlag (LNCS).

Vitter, J.S. & Wang, M. (1999). Approxi-mate computation of multidimensionalaggregates of sparse data using wave-lets. Proceedings of the ACMSIGMOD International Conferenceon Management of Data (pp. 193-204).ACM Press.

Page 21: Hybrid Query and Data Ordering for Fast and Progressive ...dimitris/publications/IJDWM05.pdf · Gunopulos, Kollios, Tsotras, & Domeniconi, 2000) have been widely used to approxi-mate

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(2), 49-69, April-June 2005 69

Vitter, J.S., Wang, M., & Iyer, B.R. (1998).Data cube approximation and histogramsvia wavelets. Proceedings of the 7thInternational Conference on Informa-tion and Knowledge Management (pp.96-104). ACM.

Widmann, M. & Bretherton, C. (1949-1994). 50 km resolution daily precipi-tation for the Pacific Northwest.

Wu, Y.-L., Agrawal, D., & Abbadi, A.E.(2000). Using wavelet decomposition tosupport progressive and approximate

range-sum queries over data cubes.Proceedings of the 9th InternationalConference on Information andKnowledge Management (pp. 414-421). ACM.

Zhao, Y., Deshpande, P.M., & Naughton,J.F. (1997). An array-based algorithmfor simultaneous multidimensional aggre-gates. Proceedings of the ACMSIGMOD International Conferenceon Management of Data (pp. 159-170).

Cyrus Shahabi is currently an associate professor and director of the Information Laboratorywith the Computer Science Department and is also a research area director a the NSF’s IntegratedMedia Systems Center at the University of Southern California. He earned his PhD in computerscience from the University of Southern California (August 1996). He has two books and morethan 100 articles, book chapters, and conference papers in the areas of databases andmultimedia. Dr. Shahabi is the recipient of the 2002 National Science Foundation CAREERAward and 2003 Presidential Early Career Awards for Scientists and Engineers (PECASE).

Dimitris Sacharidis was born in Athens, Greece. He earned a BS in electrical and computerengineering from the National Technical University of Athens (2001). In 2004, he earned anMS in computer science from the University of Southern California. Currently, he is a PhDstudent in the National Technical University of Athens, working on data streams analysis.

Mehrdad Jahangiri was born in Tehran, Iran. He earned a BS in civil engineering from theUniversity of Tehran at Tehran (1999). He is currently working toward a PhD in computerscience at the University of Southern California. He is also a research assistant working onmultidimensional data analysis at Integrated Media Systems Center (IMSC) - InformationLaboratory (InfoLAB) in the Computer Science Department of the University of SouthernCalifornia.