[acm press the fifteenth international workshop - maui, hawaii, usa (2012.11.02-2012.11.02)]...

8
Approximate Answers to OLAP Queries on Streaming Data Warehouses Michel De Rougemont Univ. Paris II & LIAFA-CNRS Paris, France [email protected] Phuong Thao Cao LRI, Univ. Paris-Sud Gif-sur-Yvette, France [email protected] ABSTRACT We study streaming data for a data warehouse, which com- bines different sources. We consider the relative answers to OLAP queries on a schema, as distributions with the L1 distance and approximate the answers without storing the entire data warehouse. We first study how to sample each source and combine the samples to approximate any OLAP query. We then consider a streaming context, where a data warehouse is built by streams of different sources. We first show a lower bound on the size of the memory necessary to approximate queries and then consider a statistical hypoth- esis where some attributes determine fixed distributions of the measure. We use the sampling methods to learn the statistical model and approximate OLAP queries. In this case, we approximate OLAP queries with a finite memory. We apply the method to a dataset which simulates the data of sensors, which provide weather parameters over time and locations from different sources. Categories and Subject Descriptors H[Information Systems]; H.2.4 [Database Manage- ment]: System—Query Processing Keywords Approximation, sampling algorithm, OLAP, approximate query answering, data exchange, streaming data 1. INTRODUCTION OLAP (OnLine Analytical Processing) is a fundamen- tal tool for the analysis of large data warehouses. An OLAP schema captures the functional dependencies between groups of attributes of a relational schema and defines the possible dimensions of analysis and specific measures. An OLAP query fixes some of the dimensions and the result of a query is the aggregation of a measure for the different val- ues of the dimensions. In this paper we view an answer to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DOLAP’12, November 2, 2012, Maui, Hawaii, USA. Copyright 2012 ACM 978-1-4503-1721-4/12/11 ...$15.00. an OLAP query as a distribution, which is best understood by a pie-chart which gives relative weights to each value. An important subject concerns the approximation of such queries [1, 6, 7, 12, 13, 17], in particular with sampling tech- niques. This subject has become important in the context of streaming data [4, 15, 19]. In our approach, we consider the L1 distance between distributions: two answers are ε-close if the L1 distance between two distributions is less than ε. We then study sampling strategies which associate with any data warehouse I , a random subset of tuples b I such that with high probability, an answer of any query on b I will be close to the answer on I . The number of tuples of b I is in- dependent of the size of I and only depends on the approx- imation parameters. We consider the uniform distribution and a measure-based distribution also used in [3] in a dif- ferent setting. Under our hypothesis, OLAP queries can be approximated for both distributions by replacing the large data warehouse with the small b I . We then study a typical data exchange situation where the data warehouse is built from several sources, and show how to sample each source independently. This problem has been analyzed in [5] in the general case. We study a specific natural case when some attributes determine fixed distributions of the measure: this statistical model generalizes the functional dependencies and make the analysis simpler. The central part of the paper considers a data warehouse built from streaming data, coming from several sources. We first show an O(N ) lower bound on the space needed to ap- proximate some simple OLAP queries on a stream of length N . We then decompose a stream in large intervals, and use the previous results on sampling on each interval, i.e. we can sample each source independently and only keep a ran- dom subset b I . In this case, the size of b I will grow but at a much smaller rate than I . In the important case where the data warehouse follows a simple statistical model, we can learn the model from the samples using a finite memory and approximate OLAP queries. The main results are: If I is a target data warehouse built from sources, S1,S2, .., S k identified by a unique attribute value a A which follows a fixed distribution δ, we can sam- ple independently each source and the union of sam- ples will guarantee the same approximation to OLAP queries, Theorem 1. The approximation of OLAP queries requires a mem- ory Ω(N ) in the worst case, i.e. proportional to the length of the stream. If the stream of length N is de- composed into blocks of size n, then we can approxi- 121

Upload: phuong-thao

Post on 31-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Approximate Answers to OLAP Queries on Streaming DataWarehouses

Michel De RougemontUniv. Paris II & LIAFA-CNRS

Paris, [email protected]

Phuong Thao CaoLRI, Univ. Paris-Sud

Gif-sur-Yvette, [email protected]

ABSTRACTWe study streaming data for a data warehouse, which com-bines different sources. We consider the relative answers toOLAP queries on a schema, as distributions with the L1

distance and approximate the answers without storing theentire data warehouse. We first study how to sample eachsource and combine the samples to approximate any OLAPquery. We then consider a streaming context, where a datawarehouse is built by streams of different sources. We firstshow a lower bound on the size of the memory necessary toapproximate queries and then consider a statistical hypoth-esis where some attributes determine fixed distributions ofthe measure. We use the sampling methods to learn thestatistical model and approximate OLAP queries. In thiscase, we approximate OLAP queries with a finite memory.We apply the method to a dataset which simulates the dataof sensors, which provide weather parameters over time andlocations from different sources.

Categories and Subject DescriptorsH [Information Systems]; H.2.4 [Database Manage-ment]: System—Query Processing

KeywordsApproximation, sampling algorithm, OLAP, approximatequery answering, data exchange, streaming data

1. INTRODUCTIONOLAP (OnLine Analytical Processing) is a fundamen-

tal tool for the analysis of large data warehouses. AnOLAP schema captures the functional dependencies betweengroups of attributes of a relational schema and defines thepossible dimensions of analysis and specific measures. AnOLAP query fixes some of the dimensions and the result ofa query is the aggregation of a measure for the different val-ues of the dimensions. In this paper we view an answer to

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DOLAP’12, November 2, 2012, Maui, Hawaii, USA.Copyright 2012 ACM 978-1-4503-1721-4/12/11 ...$15.00.

an OLAP query as a distribution, which is best understoodby a pie-chart which gives relative weights to each value.

An important subject concerns the approximation of suchqueries [1, 6, 7, 12, 13, 17], in particular with sampling tech-niques. This subject has become important in the context ofstreaming data [4, 15, 19]. In our approach, we consider theL1 distance between distributions: two answers are ε-closeif the L1 distance between two distributions is less than ε.We then study sampling strategies which associate with any

data warehouse I, a random subset of tuples I such that

with high probability, an answer of any query on I will be

close to the answer on I. The number of tuples of I is in-dependent of the size of I and only depends on the approx-imation parameters. We consider the uniform distributionand a measure-based distribution also used in [3] in a dif-ferent setting. Under our hypothesis, OLAP queries can beapproximated for both distributions by replacing the large

data warehouse with the small I. We then study a typicaldata exchange situation where the data warehouse is builtfrom several sources, and show how to sample each sourceindependently. This problem has been analyzed in [5] in thegeneral case. We study a specific natural case when someattributes determine fixed distributions of the measure: thisstatistical model generalizes the functional dependencies andmake the analysis simpler.

The central part of the paper considers a data warehousebuilt from streaming data, coming from several sources. Wefirst show an O(N) lower bound on the space needed to ap-proximate some simple OLAP queries on a stream of lengthN . We then decompose a stream in large intervals, and usethe previous results on sampling on each interval, i.e. wecan sample each source independently and only keep a ran-

dom subset I. In this case, the size of I will grow but at amuch smaller rate than I. In the important case where thedata warehouse follows a simple statistical model, we canlearn the model from the samples using a finite memory andapproximate OLAP queries. The main results are:

• If I is a target data warehouse built from sources,S1, S2, .., Sk identified by a unique attribute value a ∈A which follows a fixed distribution δ, we can sam-ple independently each source and the union of sam-ples will guarantee the same approximation to OLAPqueries, Theorem 1.

• The approximation of OLAP queries requires a mem-ory Ω(N) in the worst case, i.e. proportional to thelength of the stream. If the stream of length N is de-composed into blocks of size n, then we can approxi-

121

mate queries within a factor ε.Nn

with space N∗mn

, The-orem 3.

• If the data warehouse follows a specific statisticalmodel, where some attributes determine the distribu-tion of the measure, we can learn the model from sam-ples, and answer OLAP queries with a finite memory,Theorem 4, which can be extended in a data exchangesetting, Theorem 5.

In section 2, we review the basic notions on OLAP queries,approximation, data exchange and streaming data. In sec-tion 3, we show that approximate answers can be obtainedon samples, in a data exchange context. In section 4, weconsider a streaming data warehouse, prove a lower boundand adapt the sampling algorithm to large blocks of thestream. We then study the special case when the datawarehouse follows a specific statistical model. In section 5,we describe our implementation on sensors data.

2. PRELIMINARIESWe first describe the basic notations for OLAP queries and

the distance between answers. We then introduce the no-tions of approximation, used in this paper, and apply themin the data exchange context and for streaming data.

2.1 OLAP Schemas and QueriesWe follow the functional model associated with an OLAP

schema [18], i.e. the OLAP or star schema is a tree whereeach node is a set of attributes, the root is the set of all theattributes of the data warehouse relation, and an edge existsif there is a functional dependency between the attributesof the origin node and the attributes of the extremity node.The measures is a specific node at depth 1 from the root.An OLAP query for a schema S is determined by: a filtercondition, a measure, the selection of dimensions or classi-fiers, C1, ..., Cp where each Ci is a node of the schema S,and an aggregation operator (COUNT, SUM, AVG, ...).

A filter is a condition which selects a subset of the tuples ofthe data warehouse, and we assume for simplicity that SUMis the Aggregation Operator. The answer to an OLAP queryis a multidimensional array, along the dimensions C1, ..., Cpand the measure M . Each tuple c1, ..., cp,mi of the answer is

such thatmi =

∑t:t.C1=c1,...t.Cp=cp

t.M∑t∈I t.M

. We consider relative

measures as answers to OLAP queries and write QIC as thedensity vector for the answer to Q on dimension C and ondata warehouse I. Each component of QIC is written QIC=c

or QIC [c] and is the relative density for the dimension C = c.We suppose that N the number tuples of the data ware-

house is always large, and that the number of tuples aftera selection is always large. Sampling will not give good ap-proximations if it is not the case.

Example 1. Consider the simple OLAP schema of Fig-ure 1, where the functional dependencies follow the edges up.

We suppose two relations: the data warehouseDW(recordID, sensorID, date, month, year, sunlight,rain) which stores every day the measures sunlight andrain in hours in the interval [1, 10] for all sensors, and anauxiliary table C(sensorID, manuf, city, country).

A typical OLAP query may select country as a dimen-sion, asking for the relative sunlight of the countries, and

,

City

Country

Manufacturer

Year

Month

Date

Rain

Sunlight

SensorID RecordID

Figure 1: An OLAP schema.

an absolute answer would be as in Figure 2(a). A relativeanswer would be: (France, 0.65), (Germany, 0.2), (U.K.,0.15), as 0.65 = 450022/689960. We consider the answerto any OLAP query as the vector of relative values. In ourexample, QIcountry = (0.65, 0.2, 0.15) as in Figure 2(b).

The distance between two relative answers to an OLAPquery is the L1 distance between relative densities. There-fore, the values of the distance are in [0, 2]. For example,the distance between the (0.65, 0.2, 0.15) distribution over(France, Germany, U.K.) and the (0.6, 0.4) distribution over(France, Germany) is (0.05 + 0.2 + 0.15) = 0.4.

2.2 ApproximationIn our context, we approximate density values less than 1,

and use randomized algorithms with an additive approxima-tion. Strictly speaking, the L1 distances are less than 2, andwe usually have a 1

2factor to normalize, which we omit for

simplicity. There are usually two parameters 0 ≤ ε, δ ≤ 1where ε is the error, and 1− δ is the confidence. In the caseof a function F : Σ∗ → R, let A be a randomized algorithmwith input x and output y = A(x). The algorithm A ε,δ-approximates the function F if for all x,

Prob[F (x)− ε ≤ A(x) ≤ F (x) + ε] ≥ 1− δ

The randomized algorithms take samples t ∈ I with differ-ent distributions, introduced in the next subsection. Theprobability is over these probabilistic spaces.

(a) Absolute answer to Q1:Analysis for each country.

MeasuresCOUNTRY SUM OF SUNLIGHTAll country 689,960France 450,022U.K. 104,980Germany 134,958

France

U.K.

Germany

All country

(b) Piechart answer ofQ1: Analysis for eachcountry.

Figure 2: Representations of answers.

122

2.2.1 Sampling a Data WarehouseThere are many different methods to sample a data ware-

house I [1, 12, 13, 17, 19] and we consider two specific tech-niques:

• Uniform sampling: we select Iu, made of m distinctsamples of I, with a uniform distribution on the N tu-ples.

• Measure-based sampling: we select IM made of m dis-tinct samples in two phases: we select firstly a tu-ple t with a uniform distribution, but then keep itwith probability which is proportional to its measure,t.M/max where max is the maximum value of themeasure. We then replace the value of the measureby 1 if we keep the sample.

These two methods produce distinct random data ware-

houses I. We approximate a query QIC by QIC , i.e. re-

placing the large source I by a small I. If max is smallin relation with N , both techniques allow to approximateOLAP queries if the number of samples is large enough,but independent of N . It is easy to show that if m ≥12.( |C|

ε)2. log 1

δthen the answer QIC1,...,Cp

to any OLAPquery Q on dimensions C1, . . . , Cp without selection, is ε-

close to QIC1,...,Cpwith probability is larger than 1−δ, where

|C| = |C1| ∗ |C2| ∗ ... ∗ |Cp|. Figure 3 provides the approxi-mate answer to Q1 for both distributions: the error is 1.78%for the uniform distribution (Figure 3a), and 1.72% for themeasure-based distribution (Figure 3b) over a data ware-house with 106 tuples.

This is a simple application of a Chernoff bound to esti-mate the error on each density, and a union bound on theglobal vectors. It is important that the number of tuples islarge, and selections may alter the result. In case of a selec-tion, the number of tuples after the selection must be largeto apply for the result.

Notice that if max is large (unbounded), these two dis-tributions differ, and only the second one can approximateOLAP queries in our sense. Another important differenceis when we have several measures (2 in our example): oneset of samples is enough for the uniform distribution, butwe need two sets for the measure-based distribution, one foreach measure.

The specific L1 distance between answers is important. Asnoticed in [1], the uniform sampling may miss small groupsof data, and the relative error for such a group could be100%. As the measures are bounded, the total value wouldbe small and our relative error would then be less than ε.In case of a selection, if the number of selected tuples is toosmall the error could be large for the same reason, but ifthe number of selected tuples is large, the same approxima-tion holds.

2.2.2 Statistical ModelsAssume each set of attributes ranges over a finite domain

D and let ∆ be the set of distributions over D. We considerthe L1 distance between distributions, i.e. two distributionsδ, µ are ε-close if |δ−µ|1 =

∑v |δ(v)−µ(v)| ≤ ε. We first de-

fine the notion of a distribution for a set of attributes A1..Akand then the notion of a statistical dependency, which gener-alizes the classical functional dependency. For simplicity, weassume only one attribute A which follows a distribution δ,

MeasuresCOUNTRY SUM OF SUNLIGHTAll country 5,627France 3,620U.K. 873Germany 1,134

France

U.K.

Germany

All country

(a) Approximate answerto Q1 with the uniformsampling. The error is1.78%

MeasuresCOUNTRY SUM OF SUNLIGHTAll country 1,000France 653U.K. 160Germany 187

France

U.K.

Germany

All country

(b) Approximate answer toQ1 with the measure-basedsampling. The error is1.72%

Figure 3: Q1: analysis of sunlight for each country

on the two random I.

and we note δ(a) the probability that A = a. We defineA/B if each value a ∈ A implies a fixed distribution µa overthe values of B.

Definition 1. The attribute A follows a distribution δ iffor all source I of an OLAP schema S, for all a ∈ A,

limN→∞

Probt∈rI [t.A = a] = δ(a)

The notation t ∈r I means that the tuple t is taken withthe uniform distribution on I.

Definition 2. The attribute A over values a1, ..., ap sta-tistically implies B, written A / B, if there are fixed distri-butions µa1 , µa2 , ..., µap over the domain of B, such that forall source I of the OLAP schema, for all i = 1, ..., p, forall b ∈ B,

limN→∞

Probt∈rI [t.A = ai ∧ t.B = b] = µai(b)

The specific statistical model we later consider is whensome attributes of the OLAP schema statistically impliesthe measure M . It is very natural for an OLAP schema,as it captures the minimum set of attributes which imply afixed distribution on the measure and generalizes the clas-sical decision trees. In a decision tree, a set of attributedetermines the value of the measure with high probability.In our statistical model, a set of attributes determines fixeddistributions of the measure, with high probability.

2.3 Data ExchangeA data exchange context [10] captures the situation when

a source I exchanges data with a target. A setting is a triple(S,Σ, T ) where S is a source schema, T a target schemaand Σ a set of dependencies between the source and thetarget. A tuple dependency states that if a tuple t is in arelation R of the S, it is also in a relation R′ of T , maybeslightly modified. In [8], an approximate version of the dataexchange problems is introduced in order to cope with errorsin the source and in the target. In [9], the data exchangesetting is generalized to probabilistic sources and targets.

Standard problems are: Consistency i.e. whether thereis a solution J which satisfies the setting (S,Σ, T ) for a givensource I, Typechecking and Query-answering. In this paperwe study approximate query answering in the context of an

123

OLAP target data warehouse with several sources. We con-sider the simplest situation where the tuples of each sourceare copied in the target: in this case the tuple dependency Σstates this constraint. In a more general situation, differentsources may follow different statistical models and the maindifficulty is to combine them for query answering.

2.4 Streaming DataWe consider tuples of a data warehouse arriving as a

stream and study if it is possible with a small memory toapproximately answer OLAP queries. If there are N tuples,an O(logN) memory would be small. In the classical lit-erature [16], there are many examples for upper and lowerbounds. A standard example [2] is the stream of valuesa1, a2, ..., an ∈ [1, ...,m] where n is unknown and n,m arearbitrarily large. The Frequencies fj = |i ∈ [1;n], ai =

j|, j ∈ [1;m] and the Moments Fk =m∑j=1

(fj)k. The Mo-

ments F0 = |j ∈ [1;m], fj 6= 0|, F1 = n and F2 areapproximable with a small memory, whereas F p is not ap-proximable for p > 2 and in particular F∞ = max

jfj is not

approximable.In the OLAP setting, tuples t ∈ I are the ai, and we only

want to approximate OLAP queries. We will show a lowerbound based on the non approximability of F∞.

2.5 Comparison with other approachsThe use of sampling to approximate OLAP queries is a

classical technique [1, 6, 7, 12, 13, 17] and its application tostreaming data [4, 15, 19] is natural.

We concentrate on the data exchange setting for datawarehouses, and assume a natural statistical hypothesis, thestatistical implication of the measure by some attrributes.We differ from [5], as we don’t study the worst-case scenariobut a natural situation which generalizes the decision trees.We show in this context that random samples can be usedto learn the statistical model and that approximate answersto OLAP queries can be obtained by merging the estimatesof each source.

3. OLAP DATA EXCHANGEIn the context of OLAP data exchange, we consider the

situation where k different sources feed a data warehouseI. For example, the relation I1 of source S1 feeds the datafrom England, the relation I2 of source S2 feeds the data

from France, etc. We want to select Ii made of mi samples

from each source and define Ie = I1∪I2∪...∪Ik where each Iifollows the uniform distribution. We ask whichmi guaranteethat any OLAP query Q on I will be well approximated

by Ie.We first consider the uniform distribution, and then the

measure-based distribution. If each source corresponds to aunique attribute value a ∈ A (country for example), andA may follow a fixed distribution δA. In this case, if weselect mi = m ∗ δA(ai) with the uniform distribution, we

will guarantee the approximation of any OLAP query on Ie.We say that I is the union on A of I1, I2, ..., Ik.

Theorem 1. If I is the union on A of I1, I2, ..., Ik, A

follows the distribution δ, Ie = I1 ∪ I2 ∪ ... ∪ Ik, m ≥12.( |C|

ε)2. log 1

δand mi = m ∗ δA(ai) uniform samples,

... …

Data warehouse

Source S1

Source S2

Source Sk Source Si

Figure 4: An OLAP data exchange context.

then the answer QIeC1,...,Cpto any query Q on dimensions

C1, . . . , Cp, is ε-close to QIC1,...,Cpwith probability 1− δ.

Proof. We suppose that mi tuples are taken uniformlyfrom the set Ii, whose union is I = I1∪I2∪...∪Ik. Therefore

Ie is close to Iu. In the case of a single dimension C, we usethe Chernoff-Hoeffding bound [11] and a union bound toconclude that:

Pr[| QIC −QIeC |≤ ε] ≥ 1− δ.

In the case of the measure-based distribution IM , we can’t

take mi samples on each source Ii with the distribution IM .How could we combine the sources in this case?

If A follows the distribution δ and A / M , let us definem′i by:

m′i = m ∗ δ(ai) ∗Avgµ(ai)∑i δ(ai) ∗Avgµ(ai)

where Avgµ(ai) is the average value of the distributionµ(ai). A theorem similar to theorem 1 could then be stated,where we replace the mi uniform samples on each sourceIi by m′i samples with the measure-based distribution oneach source.

4. STREAMING DATAWe consider tuples of the data warehouse as a stream,

and study how to reduce the space needed to answer OLAPqueries, i.e. without storing all the tuples. We first considera lower bound, directly obtained from Communication Com-plexity, and then proceed with approximate solutions, firstwith blocks of the stream and then with a learning method,when the data follow a statistical model.

The stream is the sequence of tuples t1, ..., tn of the datawarehouse I. In Example 1 for the schema of Figure 1,ti = (i, s1, 3, 12, 2010, 7, 2) stating that sensor s1 measures 7hours of sunlight and 2 hours of rain on December 3rd 2010.The auxiliary tables such as C(CITY,COUNTRY) are fixedand independent of the stream.

4.1 Lower BoundsCommunication Complexity [14] studies the number of

bits Alice and Bob must exchange to compute a function

124

f(x, y) when Alice holds x and Bob holds y. In a ProtocolP (x, y) which combines the decisions of both Alice and Bob,the complexity of P (x, y) is the number of bits |P (x, y)| sentbetween Alice and Bob, i.e. C(P ) = Maxx,y|P (x, y)|. LetD(f) be the Minimum C(P ) over all deterministic protocolsto compute the function, i.e. D(f) = MinPC(P ) and Rε(f)be the Minimum C(P ) over randomized protocols with pub-lic coins and error ε, i.e. Prob[P (x, y) 6= f(x, y)] ≤ ε,Rε(f) = MinPC(P ). In the one-way model, only Alicesends bits to Bob. In this case, we define the one-way Com-

munication Complexity−−−→D(f) and

−−−→Rε(f) as before.

The memory M used by a deterministic (resp. random-ized) streaming algorithm is always less then D(f) (resp.Rε(f)). Suppose a streaming algorithm computes f(X) andlet us write the stream X = x|y as the concatenation ofthe input x with the input y. We can conceive the protocolwhere Alice transmits the Memory content M to Bob who

can then computes f(X). Hence−−−→D(f) ≤M .

Therefore a lower bound on the Communication Complex-ity provides a lower bound on the space of a streaming algo-rithm. If x, y ⊆ 1, 2, ..., n, let DISJ (x, y) = 1 if x∩ y = ∅and 0 otherwise. A classical result shows that

−−−→D(f) = O(n)

and−−−→Rε(f) = O(n).

Consider the special case of an OLAP schemaS1 where the data warehouse contain tuples(a1, 1), (a2, 1), (a3, 1), ..., (aN , 1), ..., i.e. a1, a2, ..., an ∈[1, ...,m] and all the measures are 1. The frequencyfj = |i ∈ [1;n], ai = j|, j ∈ [1;m] and F∞ = max

jfj . In

this case, we have 1 dimension (the first attribute) and alltuples have the same measure.

Theorem 2. The approximation of the OLAP query onS1 requires a memory O(N), i.e. proportional to the lengthof the stream.

Proof. We use the classical reduction from DISJ (x, y)to F∞. Let x, y ∈ 0, 1n be the input toDISJ and letXx =i1, i2, ...., ik for ij ∈ 1, 2, ..., n such that ij < ij+1 andxij = 1. For example if x = 011101 then Xx = 2, 3, 4, 6. Ifx∩y = ∅ then F∞ = 1 and if x∩y 6= ∅ then F∞ = 2. There-fore if we could approximate F∞, we could decide DISJ . Inour context, the approximation of the heaviest componentis precisely F∞, As DISJ (x, y) requires Ω(N) space forany randomized protocol, so does the approximation of theOLAP query on S1.

As the answer to OLAP queries requires Ω(N) space, wecan only hope for gains of constant factors, in the gen-eral case.

4.2 Block AlgorithmA natural approach is to decompose the interval [1...N ]

into large subintervals of size n << N and take m sam-ples uniformly into each interval [i ∗ n....(i + 1) ∗ n] of thestream. For the uniform distribution, we keep the selected

tj for m = 12.( |C|

ε)2. log 1

δ. For the measure-based distri-

bution, we preselect m ∗ avg random positions in each in-terval in advance, where avg = AV Gt.M is the averagevalue of the measure. We preselect the tuple of the streamif its position is one of the selected positions, and keep itwith probability proportional to its weight. We describe the

definition of IM,s, the measure-based distribution for thestreaming model.

Algorithm 1: Updating IM,s, for a new block

Data: Stream I and IM,s, n

Result: updated IM,s1 Select A = i1, ...., im∗avg random positions in [1...n] ;2 for i := 1 to n do3 Read the new stuple ti of the stream;4 if i ∈ A then5 Select k in [1,max] with uniform distribution;6 /* ∀i t.M ≤ max */

7 if k ≤ t.M then

8 Add t in ˆIM,s ;9 t.M := 1 ;

The total space needed is N∗mn

, i.e. better than N , by theconstant factor m

n. If n is large this could provide a substan-

tial gain. Another advantage is when we only want the datawarehouse in the last block. In this case, we approximateOLAP queries as before.

For the error analysis, let us decompose IM,s as the union

of IiM,s for each block i.

4.3 Approximate Answers on IM,s

Let m0 = 12.( |C|

ε)2. log 1

δas in the previous section. For

each block, we have the same analysis and in the worst case,the error is added for N

nblocks.

Theorem 3. For large N , and any query Q on dimen-sion C, then

Pr[| QIC=c1 −QˆIM,s

C=c1|≥ ε.N

n ∗ |C| ] ≤ δ.

Proof. From the analysis on each block i:

Pr[| QIC=c1 −QIiM,s

C=c1|≥ ε

|C| ] ≤ δ

By definition QIM,s

C=c1=

∑iQ

IiM,s

C=c1and QIC=c1 =

∑iQ

Ii

C=c1 .Hence by a union bound on i:

Pr[| QIC=c1 −QˆIM,s

C=c1|≥ ε.N

n ∗ |C| ] ≤ δ.

Corollary 1. The answer QIC1,...,Cpto any query Q on di-

mensions C1, . . . , Cp, is ε.Nn

-close to QˆIM,s

C1,...,Cp.

Proof. In the case of a single dimension C, we apply theprevious theorem,

Pr[| QIC=c1 −QˆIM,s

C=c1|≥ ε.N

n ∗ |C| ] ≤ δ

and with a union bound for c1, ...ck ∈ C , we can con-clude that

Pr[| QIC −QˆIM,s

C |≥ ε.N

n] ≤ δ.

Notice that the error goes to ∞ as N → ∞, but it isconsistent with the lower bound.

There are many special cases, where we can beat the lin-ear lower bound. In particular if the data warehouse is closeto a decision tree, we can then use a finite memory, by usingcounters for each edge of the decision tree. An approximateanswer to an OLAP query on dimensions which appear in

125

the decision tree can be obtained by some linear combina-tion of the counters. We study in the next subsection theimportant case when some attributes of the OLAP schemadetermine the distributions on the measure.

4.4 Learning a Statistical ModelIf we assume that some attributes follow some distribu-

tions and some statistical dependencies, we may learn thesedistributions from some uniform samples and only storesome fixed distributions. To approximate OLAP queries,it is fundamental that some attributes statistically implythe measure. We show that for a measure M , if there is anattribute A such that A/M , then the distribution over C.Ais enough to approximate an OLAP query over the dimen-sion C.

Lemma 1. If A /M , we can ε-approximate each distribu-

tion µai with probability 1−δ if m > 12.( |M|

ε)2. log 1

δsamples

and N large enough.

Proof. We approximate the distribution µai by the den-sity of tuples for various values of the measure M . Considerm uniform samples such that t.A = ai and let d(i, v) =|t : t.A=ai∧t.M=v|

m. As

∑v d(i, v) = 1, we interpret d over

the values v ∈ M as µai . Let us show that d and µai areε-close.

For each v ∈ M , let Xj = 1 if the j-th tuple of I is suchthat t.M = v, andXj = 0 otherwise. Then IE(Xj) = µai(v),

d(i, v) =∑

j Xj

mand IE(d(i, v)) = IE(

∑j Xj

m) = µai(v). As

the tuples are taken independently, can apply a Chernoff-Hoeffding bound [11]:

Pr[| d(i, v)− IE(d(i, v)) |≥ t] ≤ e−2t2.m.

In this form, t is the error and δ = e−2t2.m is the confidence.

We set t = ε|M| , and δ = e−2t2.m and conclude that if m >

12.( |M|

ε)2. log 1

δthen:

Pr[| d(i, v)− µai(v) |≤ ε

|M | ] ≥ 1− δ.

We then apply a union bound on v, as this inequality is truefor all v:

Pr[| d(i, v)− µai |≤ ε] ≥ 1− δ.

We now describe how to approximate OLAP queries, as-suming A / M for some attribute A. To answer the OLAPquery on dimension C, it is enough to keep the distributionover C.A. For the same reason, as in the previous lemma,we can approximate the distribution C.A.

Lemma 2. We can ε-approximate the distribution over

C.A with probability 1−δ if m > 12.( |C|∗|A|

ε)2. log 1

δsamples

and N large enough.

Let M be the measure in the model, where we approx-imate all the distributions Ci, A for the all the dimen-sions Ci of the OLAP schema, and all the distributionsµai of the dependency A / M . These distributions requireO(

∑i |Ci| ∗ |A| + |M | ∗ |A|) space, i.e. independent of N .

For each distribution µai , let Avgµ(ai) be the Average ofthe distribution. For the distribution C.A, let πC=c(C.A)the projection on the value C = c, and πC=c(C.A)(ai) is

the probability that A = ai in this distribution. We willestimate QIC=c by

QMC=c =∑A=ai

πC=c(C.A)(ai) ∗Avgµ(ai)

and define the approximate answer QMC .

Theorem 4. If A/M , we can ε-approximate QIC by QMCwith probability 1 − δ if m > ( |C|

ε)2. log 1

δsamples and N

large enough.

Proof. As in lemma 1, we consider each QIC which

is ε/2-approximated by QIC for m > ( |C|ε

)2. log 1δ. But

IE(QIC) = QMC , hence we can also apply the Hoeffding

bound and conclude that QMC ε/2-approximates by QIC for

m > ( |C|ε

)2. log 1δ. By the triangular inequality, we get the

theorem.

4.5 Streaming Data ExchangeIn a data exchange setting, we can apply the same analysis

as in the previous section if we assume the same type ofstatistical model for each source. We need to combine bothhypothesis:

• For each source, there is the same attribute A suchthat A / M ,

• Each source corresponds to a distinct attribute valueb ∈ B and B follows a fixed distribution δB .

We can then combine the results of the preceding sectionsand determine:

• The approximate models of each source, i.e. the distri-butions µj(ai) on the measure, and the distributionsC.A, for each source j = 1, ..., k.

• The weight mi of each source, such that∑imi = m,

computed as in theorem 1, from the distribution δB .

We can then approximate any OLAP query Q by linear-ity. Let QMC,j be the approximation of Q by source j, as intheorem .

Theorem 5. If A / M on each source, we can ε-approximate QIC by QMC =

∑jmj ∗ QMC,j with probability

1− δ if m > ( |C|ε

)2. log 1δ

samples and N large enough.

In this case, it is simpler to assume the uniform distribu-tion on each source.

5. ANALYSIS OF SENSORS DATAWe use MySQL for the relational data, Mondrian for the

OLAP engine, and an improved version of Jpivot where an-swers are graphically represented by multi-dimensional pie-charts. The different OLAP schemas can be selected with1-click. The following results are achieved for N ' 106 tu-ples in DW, for the OLAP schema Sensor.

126

5.1 DataWe consider sensors which provide weather data, simu-

lated by Algorithm 2, such as hours of sunlight and hoursof rain each day, in the different cities. We assume an auxil-iary table which provides for each sensor, its location (city,country), manufacturer’s name (manuf), and a data ware-house DW which lists the two measures (sunlight, rain)every day for each sensor, as in Example 1. For simplicity,there are only 2 manufacturers Siemens and Thomson.

In our experiment, we have 12 sensors, 6 in France, 3 inGermany and 3 in the U.K. We simulate (sunlight, rain)values in the interval [1, 10] with the following random gener-ator. We associate a value τj ∈ [1, 10] for each city j ∈ [1, 9]among the 9 cities, so that the distributions of sunlight arebiased towards high values if τj ≥ 5 and low values if τj < 5.The sensor 1 in London has τ1 = 3 and the sensor 9 in Mar-seille has τ9 = 8. We write k ∈r [1, 10] for selecting k in[1, 10] with the uniform distribution.

Real data usually introduce noise, for example undefinedor incorrect values. It is important that these samplingmethods are robust to noise.

Algorithm 2: Generating data for DW

Data: τ1, . . . , τ12, NResult: Table DW

1 /* generate 12 ∗N tuples */

2 for i := 1 to N do3 for j := 1 to 12 do4 k ∈r [1, 10] ;5 /* select k in [1, 10] with uniform distribution */

6 if τj ≤ k then sunlight ∈r [1, τj ] ;7 else sunlight ∈r [τj + 1, 10] ;8 rain := (10− sunlight)/2 ;

9 date := date− 1 ;

Notice that we satisfy the specific statistical constraints,as city / sunlight, i.e. each city determines a fixed distribu-tion over the values 1, 2, ..., 10. As an example, the distribu-tion µ(Marseille) can be described as: the values 1, 2, ..., 8with probability 1/40 and the values 9, 10 with probability2/5. The average of this distribution is 8.5. Similarly for allthe 9 cities.

5.2 ResultsWe compared the same query Q1: analysis on the dimen-

sion country (1-dimension) for the measure sunlight on thethree schemas: Sensor which defines the real data, Uniformwhich defines m samples with a uniform distribution, andMeasure-based which defines m samples with the biased dis-tribution. We now consider Q2: analysis on manufacturer(1-dimension). The results are on http://www.up2.fr/ .

• Approximate answers for the different distribu-tions

Figure 5 shows that the distance between the two an-swers is very small, as predicted by the theory. Noticethat this holds for any query of the schema, providedthe number of tuples is larger than N0. If we apply aselection, the number of tuples after the selection maybe less than N0 and in this case the error will not beguaranteed.

• Approximate answer from the Model

We can compare the distribution for the analysis onmanufacturer, on the real data in Figure 6 and

Siemens

Thomson

All manufacturer

(a) Approximate answerto Q2 on schema Uniform.The error is 1.2%.

Siemens

Thomson

All manufacturer

(b) Approximate answerto Q2 on schema Measure-based. The error is 3.7%.

Figure 5: Analysis of sunlight for each manufacturer.

with the QMC estimate. In our case, the distributionmanufacturer.city is made of 18 pairs: the probabil-ity (Siemens, Paris) is 1/12, as an example. If weapply the formula of section 4.4, we find the densityfor Siemens to be 0.39 and for Thomson 0.61. This isclose to the exact distribution of Figure 6.

A complete analysis would have considered threestreaming sources, S1, S2 and S3, one for each of thecountries. We would have then built a model for eachsource, and the coefficients for the uniform distributionare m1 = 2/3 and m2 = m3 = 1/3. We would thenapproximate QMC,j for each source j, and the globalapproximation would be given by the formula:

QMC =2

3∗QMC,1 +

1

3∗QMC,2 +

1

3∗QMC,3

• Multi dimensional pie-charts

We present other natural queries to the schema Sen-sor. Figure 7 is an example of the multi dimensionalpie-charts available for the Jpivot tool. The numberof dimensions is arbitrary.

The error analysis for the query Q2 is given by Table 1, forthe three methods: uniform distribution, measure-based dis-tribution and linear estimation by the data-exchange tech-nique presented in this paper. This last estimation has thesmallest error 0.4%.

Siemens

Thomson

All country

Figure 6: Analysis of sunlight for each manufactureron the schema Sensor.

127

Figure 7: Piechart answer of Q3: Analysis of sunlightover country and manufacturer.

Distribution of Answers Manufacturer Figure 5a

(Uniform sampling)

Figure 5b (Measure-based sampling)

QCM

(Linear estimation by the data exchange )

Figure 6 (Exact answer)

Siemens 0.3851 0.4100 0.3890 0.3911 Thomson 0.6149 0.5900 0.6110 0.6089 TOTAL ERROR 0.0120 0.0378 0.0042

Table 1: Error Analysis for the three approxima-tions on Q2

6. CONCLUSIONWe use sampling algorithms on a data warehouse I with

N tuples and construct I with m samples, where m is inde-pendent of N , for two distinct distributions. We showed that

answers to OLAP queries on I are close to the answers onI, and in the case of a data exchange, we can sample eachsource independently and only combine the samples. Fora streaming data warehouse, we first prove a Ω(N) lowerbound on the memory used and then give a block algorithmwhich uses O(N∗m

n) space when we decompose the stream

in blocks of size n, i.e. better than O(N). We then studydata warehouses which follow a statistical dependency, i.e.some attributes determine the distributions on the measure.It generalizes the classical decision trees.

In this case, we can learn the statistical model from thesamples, and approximate OLAP queries with a constantmemory. The method generalizes to sources which followthe same type of statistical dependency. The case whensources follow other dependencies remains open.

7. REFERENCES[1] S. Acharya, P. B. Gibbons, and V. Poosala.

Congressional Samples for Approximate Answering ofGroup-By Queries. In Proc. of SIGMOD Conference,pages 487–498, 2000.

[2] N. Alon, Y. Matias, and M. Szegedy. The SpaceComplexity of Approximating the FrequencyMoments. Journal of Computer and System Sciences,58(1):137–147, 1999.

[3] F. Banaei Kashani and C. Shahabi. Fixed-PrecisionApproximate Continuous Aggregate Queries inPeer-to-Peer Databases. In Proc. of ICDE, pages1427–1429, 2008.

[4] G. Cormode, S. Muthukrishnan, and D. Srivastava.Finding hierarchical heavy hitters in data streams. InProc. of VLDB, pages 464–475, 2003.

[5] G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang.Continuous sampling from distributed streams. J.ACM, 59(2):10:1–10:25, 2012.

[6] A. Cuzzocrea. CAMS: OLAPing MultidimensionalData Streams Efficiently. In Proc. of DaWaK, pages48–62, 2009.

[7] A. Cuzzocrea and D. Gunopulos. EfficientlyComputing and Querying Multidimensional OLAPData Cubes over Probabilistic Relational Data. InProc. of ADBIS, pages 132–148, 2010.

[8] M. de Rougemont and A. Vieilleribiere. ApproximateData Exchange. In Proc. of ICDT, pages 44–58, 2007.

[9] R. Fagin, B. Kimelfeld, and P. G. Kolaitis.Probabilistic data exchange. In Proc. of ICDT, pages76–88, 2010.

[10] R. Fagin, P. Kolaitis, R. Miller, and L. Popa. DataExchange: Semantics and Query Answering. In Proc.of ICDT, pages 207–224, 2003.

[11] W. Hoeffding. Probability inequalities for sums ofbounded random variables. Journal of the AmericanStatistical Association, 58(2):13–30, 1963.

[12] C. Jermaine, S. Arumugam, A. Pol, and A. Dobra.Scalable approximate query processing with the DBOengine. In Proc. of SIGMOD Conference, pages725–736, 2007.

[13] S. Joshi and C. Jermaine. Materialized Sample Viewsfor Database Approximation. IEEE Trans. on Knowl.and Data Eng., 20(3):337–351, Mar. 2008.

[14] E. Kushilevitz and N. Nisan. Communicationcomplexity. Cambridge University Press, New York,1997.

[15] G. S. Manku and R. Motwani. ApproximateFrequency Counts over Data Streams. In Proc. ofVLDB, pages 346–357, 2002.

[16] S. Muthukrishnan. Data Streams: Algorithms andApplications. Foundations and Trends in TheoreticalComputer Science, 1(2), 2005.

[17] T. Palpanas, N. Koudas, and A. Mendelzon. UsingDatacube Aggregates for Approximate Querying andDeviation Detection. IEEE Trans. on Knowl. andData Eng., 17(11):1465–1477, Nov. 2005.

[18] N. Spyratos. A Functional Model for Data Analysis. InProc. of FQAS, pages 51–64, 2006.

[19] S. Wu, B. C. Ooi, and K.-L. Tan. Continuous samplingfor online aggregation over multiple queries. In Proc.of SIGMOD Conference, pages 651–662, 2010.

128