range selectivity estimation for continuous attributes

10
Range Selectivity Estimation for Continuous Attributes Flip Korn AT&T Labs - Research Florham Park, NJ 07932 [email protected] Theodore Johnson AT&T Labs - Research Florham Park, NJ 07932 [email protected] H.V. Jagadish UIUC Urbana, IL 61801 [email protected] Abstract Many commercial database systems maintain his- tograms to efficiently estimate query selectivities as part of query optimization. Most work on histogram design is implicitly geared towards discrete or categorical attribute value domains. In this paper, we consider approaches that are better suited for the continuous valued attributes com- monly found in scientific and statistical databases. We pro- pose two methods based on spline functions for estimating the selectivity of range queries over univariate and multi- variate data. These methods are more accurate than histograms. As the results from our experiments on both real and synthetic data sets demonstrate, the proposed methods achieved sub- stantially better (up to 5.5 times) estimation error than the state-of-the-art histograms, at exactly the same stor- age space and with comparable CPU runtime overhead; moreover, the superiority of the proposed spline methods is amplified when applied to multivariate data. 1 Introduction Selectivity estimation is an important part of query op- timization. Estimates can be used to select the best from among many competing access plans. There are two gen- eral classes of methods for selectivity estimation: sampling methods and statistical methods. We consider nonparamet- ric statistical methods in this paper. For an instance of work on parametric methods see [2]; for an instance of work on sampling methods, see [5]. Nonparametric methods determine the shape of the dis- tribution from the available data, without necessarily con- forming to a formal process model. In this sense, the Work performed while with AT&T Labs. “data are allowed to speak for themselves.” Of the non- parametric methods we consider two approaches: his- tograms [8, 7, 9, 16, 14, 15] and curve-fitting [18, 1]. We briefly review some of the histogram methods (e.g., equiwidth, equidepth, end-biased, maxdiff) in Sec. 2.1; an excellent taxonomy of histograms can be found in [16]. Curve-fitting approaches alternative to the proposed ap- proach are considered in Sec. 6. Many data sets have continuous valued attributes such as scientific and statistical data sets. The state-of-the-art histograms [9, 16, 15] are implicitly geared towards dis- crete or categorical attribute value domains where there are relatively few distinct values in the attribute. As such, these methods can and are also used for estimating join selectiv- ities [9]. In the absence of many duplicate values, as is the case in many scientific and statistical data sets, an equi- join will effectively result in the empty set, rendering these methods ineffective. Let us examine some of the formal definitions upon which the recent work on histogram design is based. In [16], the authors define the data distribution of an attribute in a relation as follows: Definition 1 The data distribution of (in R) is the set of tuples (1) where are the unique values present in , and are the frequencies of (number of tuples with) value , for . For the continuous data sets we have in mind, where there are relatively few duplicates, for most . For these data sets, it makes more sense to talk about the den- sity of a value: Definition 2 The function is called the probability den- sity function (p.d.f.) of a continuous random variable , 1

Upload: adityaa2064

Post on 23-May-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Range Selectivity Estimation For Continuous Attributes

Range Selectivity Estimation for Continuous Attributes

Flip KornAT&T Labs - ResearchFlorham Park, NJ [email protected]

Theodore JohnsonAT&T Labs - ResearchFlorham Park, NJ 07932

[email protected]

H.V. Jagadish�

UIUCUrbana, IL [email protected]

Abstract

Many commercial database systems maintain his-tograms to efficiently estimate query selectivities as partof query optimization. Most work on histogram design isimplicitly geared towards discrete or categorical attributevalue domains. In this paper, we consider approaches thatare better suited for the continuous valued attributes com-monly found in scientific and statistical databases. We pro-pose two methods based on spline functions for estimatingthe selectivity of range queries over univariate and multi-variate data.

These methods are more accurate than histograms. Asthe results from our experiments on both real and syntheticdata sets demonstrate, the proposed methods achieved sub-stantially better (up to 5.5 times) estimation error thanthe state-of-the-art histograms, at exactly the same stor-age space and with comparable CPU runtime overhead;moreover, the superiority of the proposed spline methods isamplified when applied to multivariate data.

1 Introduction

Selectivity estimation is an important part of query op-timization. Estimates can be used to select the best fromamong many competing access plans. There are two gen-eral classes of methods for selectivity estimation: samplingmethods and statistical methods. We consider nonparamet-ric statistical methods in this paper. For an instance of workon parametric methods see [2]; for an instance of work onsampling methods, see [5].

Nonparametric methods determine the shape of the dis-tribution from the available data, without necessarily con-forming to a formal process model. In this sense, the

�Work performed while with AT&T Labs.

“data are allowed to speak for themselves.” Of the non-parametric methods we consider two approaches: his-tograms [8, 7, 9, 16, 14, 15] and curve-fitting [18, 1].We briefly review some of the histogram methods (e.g.,equiwidth, equidepth, end-biased, maxdiff) in Sec. 2.1; anexcellent taxonomy of histograms can be found in [16].Curve-fitting approaches alternative to the proposed ap-proach are considered in Sec. 6.

Many data sets have continuous valued attributes suchas scientific and statistical data sets. The state-of-the-arthistograms [9, 16, 15] are implicitly geared towards dis-crete or categorical attribute value domains where there arerelatively few distinct values in the attribute. As such, thesemethods can and are also used for estimating join selectiv-ities [9]. In the absence of many duplicate values, as isthe case in many scientific and statistical data sets, an equi-join will effectively result in the empty set, rendering thesemethods ineffective.

Let us examine some of the formal definitions uponwhich the recent work on histogram design is based. In[16], the authors define thedata distributionof an attributeX in a relationR as follows:

Definition 1 Thedata distributionofX (in R) is the set oftuples

T = f(v1; f1); (v2; f2); : : : ; (vD; fD)g (1)

wherevi are the unique values present inX, andfi arethe frequencies of (number of tuples with) valuevi, for i =1; : : : ; D;D � jXj.

For the continuous data sets we have in mind, wherethere are relatively few duplicates,fi = 1 for mosti. Forthese data sets, it makes more sense to talk about theden-sityof a value:

Definition 2 The functionf is called theprobability den-sity function (p.d.f.)of a continuous random variableX,

1

Page 2: Range Selectivity Estimation For Continuous Attributes

defined for all realx 2 (�1;1), if

PfX 2 Bg =

ZB

f(x)dx (2)

for any setB of real numbers.

Of course, one can always quantize continuous data, andin some sense, finite precision storage in digital computersalready does this for us. However, if this quantization isdone with fine enough granularity, most of the discrete cellswill still have no data items in them, and a few will haveone. On the other hand, a coarse quantization will result ina reasonable discrete data set while introducing significantquantization error, which may be unacceptable.

In this paper, we focus on the task of range selectivityestimation over univariate and multivariate data. We extendthe best discrete methods and propose two new methodsbased on splines for continuous domains. These methodsare implicitly built upon the continuous model of Eq. 2.The first method, calledKernelSplines, involves estimatingthe data density (p.d.f.) via smooth kernels and then storinga compact approximation of the density as cubic splines.The second method, calledOptimalSplines, involves esti-mating the density via maximum likelihood estimation.

The proposed methods are moreaccurate than his-tograms. As the results from our experiments on bothreal and synthetic data sets show, the proposed methodsachieve substantially lower (up to 5.5 times) estimation er-ror than the state-of-the-art histograms (equiwidth, end-biased, maxdiff), at exactly the same storage space andwith comparable CPU runtime overhead; moreover, thesuperiority of the proposed spline methods becomes evenmore dramatic for multivariate data.

The bulk of the database literature on histograms is fo-cused towards query optimizers, as is our own work. How-ever, approximate query answering is becoming increas-ingly important to provide data analysts with interactive re-sponses from large data warehouses [6, 11]. To the extentthat many data values (e.g., money amounts) stored in adata warehouse are drawn from a naturally continuous do-main, the techniques we present in this paper are applicableto data warehousing contexts as well as to query optimiz-ers.

The paper is organized as follows: Section 2 gives someintuition behind the statistical methods in this paper. Sec-tion 3 introduces the proposed univariate methods. Sec-tion 4 gives the experimental results. Section 5 presents theproposed multivariate methods and some results. Section 6mentions some related work. Section 7 lists the conclu-sions and directions for future research.

Symbol DefinitionN number of data points� number of univariate bins or knot pointsn;m numbers of bivariate bins in each dim[a; b] 1-d interval range[a; b]� [c; d] 2-d rectangular rangeV,vi attribute valueF,fi attribute frequencyS,si attribute spreadsi � (vi+1 � vi)A,ai attribute areaai � fi � sih kernel bandwidth

Table 1. Symbol table.

2 Background

In this section we review the literature on histogramsand briefly discuss the intuition behind the concepts usedfor the proposed methods: specifically, cubic splines andkernel density estimation.

2.1 Histograms

The time-honored histogram gives a (lossy) summariza-tion of an attribute. It is constructed according to aparti-tioning rule for dividing the data into� mutually disjointbins. Each bin represents a range of the data. Associatedwith each bin is a single number denoting the frequency(count) of items occurring in the given range. Individ-ual values within a bin are approximately reconstructed bymaking theuniform spread assumption, whereby values areassumed to be placed at equispaced intervals. Frequencieswithin a bin are approximately reconstructed by making theuniform frequency assumption, whereby all individual fre-quencies are assumed to be the same. Some traditional ex-amples of histograms areequiwidth, where the bin bound-aries are equispaced, andequidepth, where the bin bound-aries are placed at quantiles.

Following [16], histograms can be classified along threeorthogonal axes: partition constraint, sort parameter, andsource parameter. The partition constraint is the rule forassigning the data to mutually disjoint bins; the sort pa-rameter, most typically attribute value (V) or frequency(F), is the parameter along which bins represent contiguousranges; the source parameter, most typically the attributespread (S), frequency (F), or area (A), is the parameter ac-cording to which the partition constraint is based. Table 1lists these parameters along with their definitions.

2

Page 3: Range Selectivity Estimation For Continuous Attributes

Let p(s; u) denote a histogram with partition constraintp, sort parameters, and source parameteru. Then equi-width histograms can be written as equisum(V,S) andequidepth histograms as equisum(V,F). Of the histogramsintroduced in [16], we consider V-optimal-endbiased(F,F)and maxdiff(V,A). The V-optimal-endbiased(F,F) his-togram stores singleton bins and approximates the fre-quency of the remaining bins with their average frequency.The maxdiff(V,A) histogram puts bin boundaries in be-tween the(� � 1)-max consecutive (in sort parameter or-der) area differentials. Other histogram classes were in-troduced in [16] (e.g., compressed(V,F), compressed(V,A))but are not considered in this paper for the reasons put forthin Sec. 4.

2.2 Cubic Splines

Splines are widely used for curve fitting in graphics andstatistics. The most basic kind of spline is the cubic in-terpolation spline. Given a set of anchor points, calledknots, a piecewise polynomial is constructed by joiningeach successive pair ofknots with a separate cubic poly-nomial function beginning at one knot and extending to theother. See Fig. 3(c) for an example of a cubic spline.

The cubic polynomials satisfy continuity conditions tomake the spline continuous and twice differentiable, givingcubic splines a smooth appearance:

Definition 3 Given knotsa = x0 < x1 < � � � < x� = b

with valuesuj at each knot, acubic splineS is a functionon [a; b] satisfying the following conditions:

1. S is a cubic polynomial, denotedSj , on the interval[xj; xj+1] for 1 � j < �, i.e.,Sj(x) = aj+bj(x�xj)+cj (x�xj)2+dj(x�xj)3;

2. S(xj) = uj for eachj;

3. S is twice differentiable continuously in[a; b].

There are many different, more sophisticated, typesof splines, including regression splines and smoothingsplines. Of these, perhaps the most well known is theregression B-spline [3]. In this paper, we only use cubicinterpolation splines and their generalization in higher di-mensions.

2.3 Kernel Density Estimation

Kernel estimationis one of the most popular methodsin statistics for density estimation [20]. The basic idea is

0

0.05

0.1

0.15

0.2

0.25

y

x

kernel method

(a)N=2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

y

x

kernel method

(b)N=25

Figure 1. Illustration of Kernel density es-timate, with data points indicated by tickmarks, (a) for 2 data points, and (b) for 25data points.

simple: for each data pointXi, akernel(e.g., a Gaussian)centered aboutXi is summed. The kernel is a smooth,symmetric, weighted function that smears out the proba-bility in the neighborhood of the data pointXi. Figure 1shows what a kernel estimate would look like (a) after 2data points, and (b) after 25 data points have been summed.The data point values are indicated by tick marks on the x-axis.

Definition 4 Given data pointsX1; X2; : : : ; XN , a kernelestimatef̂ is constructed as follows:

f̂(x) =1

N

NXi=1

Kh(x�Xi) (3)

where Kh(x) is usually a unimodal, symmetric andbounded density function depending on the bandwidthh.The most typical kernel, and the one used in our experi-

3

Page 4: Range Selectivity Estimation For Continuous Attributes

ments, is a Gaussian, that is,

Kh(x�Xi) =1p2�h

e(x�Xi )

2

2h2 : (4)

There is an inherent trade-off in choosing the bandwidth(standard deviation, in the case of the Gaussian)h: widebandwidths will produce smooth estimates that may hidelocal features of the density; narrow bandwidths will cre-ate artifacts. Some kernel estimation algorithms involvefinding the “best” bandwidth, requiring several iterations[19]. Instead, we use a heuristic common in statistics forchoosing a good bandwidth, specifically,

h =log2N + 1

4(5)

which allows the construction of a kernel estimate in onepass.

2.4 Splinegrams

A splinegramis constructed by fitting a spline throughthe midpoints of an equiwidth histogram. The frequenciesof each histogram bin at the bin’s abscissa midpoint serveas spline knots. Splinegrams are similar to what is knownin the statistics literature as thefrequency polygon, bothof which suffer from bias [17]. Figure 3(a) illustrates anexample of a splinegram constructed from data that is nor-mally distributed.

3 Proposed Methods

Recall that a histogram is a collection of bins and fre-quencies, where each bin represents a data range and itsassociated frequency summarizes the number of values thatlie in the range. Two disadvantages come from the equiva-lence of a histogram to a piecewise constant function (whenbin ranges are continuous). First, the discontinuities im-posed by bin boundaries result in bin-dividershaving a very“local” influence, i.e., incrementing the frequency of onebucket will not affect the frequencies of the other buckets.Second, range query estimation requires the uniform fre-quency assumption, that all attribute values in a bucket areassumed to have the same frequency [16]. It is the combi-nation of these disadvantages that leads to estimation erroraccumulated at theboundary bins of a range query.

Figure 2 illustrates an example of a histogram sum-marization. Note the abrupt discontinuities between bins.Figure 2(a) shows a range interval[a; b] in which three

���������

���������

����������������

����������������

��������

��������

������

������

a b�����������

�����������

���������������

���������������

ba(a) partial intersection (b) same bin

Figure 2. Two common cases where his-tograms make the uniform frequency as-sumption: (a) in two partially intersectedbins, and (b) in the same bin.

histogram bins are completely intersected and two binsare partially intersected. The frequency in the rangesof the completely intersected bins can be determined ex-actly; however, the frequency must be approximated at theboundary bins. As mentioned in Sec. 2.1, histograms makethe uniform frequency assumption to approximate a partialinterval as a fraction of a complete interval. Figure 2(b)shows a range interval[a; b] properly contained inside abin. For highly skewed distributions, which are common inreal data sets, the uniformity assumption will suffer frombias due to the fact that an unweighted average is beingused where a weighted average is called for.

Splines, on the other hand, are smooth because, by defi-nition, they are polynomials in between knots and becausethey satisfy continuity conditions at the boundaries. Theselectivity of partially intersected ranges can be estimatedanalytically via integration, without having to assume uni-formity. In the next section, we propose two spline rep-resentations in which the influence of the knot locations isinherently non-local due to their continuous underpinnings.

We propose two novel methods involving splines for se-lectivity estimation: KernelSplines and OptimalSplines. Incontrast to histograms and splinegrams, these methods arebased on a completely different paradigm that is tailored tocontinuous attributes. The main idea behind these meth-ods is to assume that the data is generated by a continu-ous process, and to estimate the underlying p.d.f. usinga smooth nonparametric technique. These nonparametricmethods have a faster asymptotic rate of convergence ofthe mean square error than histograms:O(N�4=5), com-

4

Page 5: Range Selectivity Estimation For Continuous Attributes

pared toO(N�2=3), whereN is the number of data points[20]. We hope to exploit this in our methods, both of whichare explained in detail below.

3.1 KernelSplines

Given a set of� knot locations atx1 < x2 < � � � < x�,a KernelSplineruns a kernel estimator (see Sec. 2.3) toestimate the density at eachknot location, which are they-values of the knots. A cubic spline is then constructedthrough these knots to approximate the underlying kernelestimate. Thus, KernelSplines give a compact approxima-tion of the p.d.f. Because a kernel density estimate canbe obtained in a single pass, the KernelSpline requiresonepassto build. Furthermore, a KernelSpline can be main-tained incrementally, with no need for periodic re-builds, ifthe knot locations are fixed (for example, if the knots areequispaced). Note that this takes the same time that it takesto build and maintain an equiwidth histogram. Figure 3(b)illustrates a KernelSpline along with the density estimateon which it is based.

3.2 OptimalSplines

Given a set of� knot locations atx1 < x2 � � � < x�,an OptimalSplineconverges towards optimal coefficientsfor approximating the density with B-spline basis functionsvia maximum likelihood estimation. The MLE iteration isperformed in main memory on a sample of the data set (see[10] for details). Figure 3(c) illustrates an OptimalSpline.

3.3 Estimating Result Sizes

Our methods require on-the-fly construction of a cubicspline when estimating a range selectivity from the knotsthat are stored. As the experiment mentioned in Sec. 4.2indicates, this extra CPU overhead compared to histogramsis practically negligible. The reason is that computing cu-bic spline coefficients is cheap. In fact, it involves solving atridiagonal system in linear time on the number of buckets�.

Once the spline is constructed, range selectivity estima-tion involves calculating an analytic integral in each inter-val. This requires roughly the same CPU time as summingup histogram bucket frequencies.

4 Experiments

Here we compare histograms with the methods pro-posed in Section 3, namely, the KernelSpline and the Op-timalSpline. Section 4.1 discusses the results from estima-tion accuracy experiments. Section 4.2 discusses the run-time CPU costs of the methods.

method storage description bytes

histogramsequiwidth � freqs +min + interval � + 2eb-continuous (� � 1) singleton-coords + 2� + 1

min + interval + avgfreq

splinesequispaced � knot heights +min + interval � + 2end-biased (� � 1) singleton-coords + 2� + 1

min + interval + avgfreq

Table 2. Required space for the competingmethods, where min indicates the minimumvalue, interval the distance between abscis-sas, and avgfreq the average frequency ofthe remaining bins.

Competing Methods: Following [16], fair comparisonsbetween the methods are ensured by constructing themso that they occupy the same amount of space: approxi-mately 160 bytes. Table 2 summarizes the space require-ments of the methods. Equiwidth histograms group con-tiguous ranges of attribute values into� buckets at equally-spaced abscissas; equiknot splines are computed piece-wise at equally-spacedknot locations. Both equispacedmethods are incrementally maintainable. We do not con-sider equidepth histograms because they require multiplepasses to determine quantiles (e.g., by sorting) and are notincrementally maintainable, which is inefficient for largedata sets. We do, however, consider histogram strategiesproposed in [16] because, though not incrementally main-tainable, they can be built in a single pass.

In [16], the authors suggest that maxdiff(V,A) is the bestoverall histogram method for discrete or categorical data.However, maxdiff(V,A) turns out not to be the method ofchoice for continuous attributes, sincefi = 1 for practi-cally all i. To test this intuition, we implemented max-diff(V,A) and found that it did not perform as well as equi-width histograms in every single case we tried. For thesame reason, many of the other methods in [16] are notwell suited to our problem, in particular, those which useV as the sort parameter and either F or A as the source pa-rameter.

5

Page 6: Range Selectivity Estimation For Continuous Attributes

0

200

400

600

800

1000

1200

1400

1600

1800

2000

60 70 80 90 100 110 120 130 140

freq

uenc

y

bucket number

splinegram

"splinegram""bins"

"knots"

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

60 70 80 90 100 110 120 130 140

knot

.y

knot.x

kernelspline

"kernelspline""kernel""knots"

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

60 70 80 90 100 110 120 130 140

knot

.y

knot.x

optimalspline

"optimalspline""knots"

(a) splinegram (b) KernelSpline (c) OptimalSpline

Figure 3. Three spline-based methods: (a) splinegram (with its associated histogram bins), (b)KernelSpline (with its associated kernel estimate), and (c) OptimalSpline.

Of the recent work on histogram design, we choseto compare against (a variation on) V-optimal-end-biased(F,F) (henceforth called “end-biased”) because (a) itis the most readily extended to continuous data by way ofan initial grouping of the data, and (b) it was reported inprevious work to be the best [9]. We do not use the end-biased histogram directly; instead, we first group the datainto equiwidth bins to produce frequency counts and then,using these frequencies as the sort parameter, we bucketsimilar frequencies together to minimize the variance offrequencies in each bin, just as with end-biased. We callthis techniqueeb-continuous, because continuous data re-quires the initial grouping step. In the parlance of [16],this method would be called V-Optimal-End-Biased(Equi-sum(V,S),F). We compared the eb-continuous histogram tospline methods where knots are placed at the highest/lowesty-values, with an average y-value given for the remainingknots, such that the variance is minimized `a la end-biasedhistograms. In summary, we compared (4 methods)� (2binnings).

Software: The histogram methods were implemented inC.Thedensity command fromSplus was used for com-puting kernel estimates. Code for B-spline basis generationby MLE, which is the basis for OptimalSpline build-up,can be found inStatlib under the namelogspline .Knot/bin locations given by the methods were sent to a ba-sic cubic spline algorithm implemented inC during onlinesize estimation.

Queries: Queries were carefully chosen in the 1-d interval[a; b] so that they resulted in uniform selectivites (between0-100%), low selectivities (0-20%), and high selectivites(80-100%).

Data sets:Three real data sets were used:worldnet , us-age data from a random sample of 100,000 AT&T World-

Net users,thyroid , thyroid medical data from 7,200patients, andcloud , cloud data over 100,000 recordedintervals.1

Error Measure: Following [16], the errorE of selectivityestimates for a setQ of N queries is computed as

E =100

N

Xq2Q

jSq � S0qjSq

(6)

whereSq andS0q are, respectively, the actual and estimatesize of the query result. Since we are interested in rangequeries, the selectivity estimate is the area under the inte-gral of the estimated p.d.f., over the specified range.

4.1 Estimation Accuracy

We averaged the estimation errors of the histogram andspline methods over 1000 queries. Table 3 summarizes theresults for theworldnet , thyroid , andcloud datasets. For brevity, we present only the results of rangequeries where the selectivity is uniformly chosen between0-100% and omit high (80-100%) and low (0-20%) selec-tivity queries, as their relative results were similar.

As Table 3(a) shows, the eb-continuous (end-biased)knot (bin) placement did not increase the accuracy much intheworldnet data set and even significantly performedworse in the case of the OptimalSpline. This suggests thatthe goal of minimizing within-bucket frequency variancesis implicitly better suited for discrete data than for contin-uous data.

As expected, the OptimalSpline consistently achievedthe best results, equispaced and eb-continuous, for all of

1 thyroid is from http://www.ics.uci.edu/ mlearn/MLSummary.html;cloud is from http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html.

6

Page 7: Range Selectivity Estimation For Continuous Attributes

method equispaced eb-continuous

histogram 55.44% 51.46%splinegram 128.90% 130.67%KernelSpline 39.75% 37.52%OptimalSpline 16.83% 36.63%

(a) worldnet

method equispaced eb-continuous

histogram 80.95% 62.93%splinegram 100.35% 92.99%KernelSpline 63.98% 48.98%OptimalSpline 14.53% 26.21%

(b) thyroid

method equispaced eb-continuous

histogram 41.92% 32.11%splinegram 93.94% 76.31%KernelSpline 36.69% 37.51%OptimalSpline 31.02% 35.69%

(c) cloud

Table 3. Estimation errors for histograms,splinegrams, KernelSpline, and Optimal-Splines, for (a) worldnet , (b) thyroid , and(c) cloud data sets.

the data sets. If incremental maintenance is not needed,then the OptimalSpline is the technique of choice. TheKernelSpline gave result sizes that were slightly (5-11%)more accurate than the equiwidth histogram. As reportedin the following section, there is no performance price topay for using the KernelSpline in place of the equiwidthhistogram.

Recall that splinegrams are constructed from histogrambin midpoints and, therefore, contain knots that are onlylocally influential. As such, this method suffers from bias[17]. Because the construction of splinegrams does notproperly make use of the underlying continuity that theproposed methods do, we were not surprised to discoverhow poorly they performed relative to the proposed meth-ods, generating more than 22 times the error of Optimal-Spline in one experiment. The lesson here is that one can-not just use splines arbitrarily to achieve goodaccuracy;rather, the nonparametric method underlying spline con-struction is critical.

Finally, we considered the effect of storage space on ac-curacy for the histogram and spline methods at differentvalues for�. We found that theaccuracy improved propor-

tionally with increasing storage space for all the methods.

4.2 Estimation Speed

We compared the overhead of estimating the selectivityat runtime for both histograms and for the proposed splinemethods, with� = 20. We ran these experiments on each ofthe data sets listed above and measured the user time on anSGI workstation running IRIX to find that it took approx-imately 18 seconds for all methods to compute estimatesfor 1000 queries. Furthermore, we observed approximatelylinear scale-up of runtime overhead with increasing storagespace.

5 Multivariate Range Query Estimation

In multivariate histograms, the effect of bin discontinu-ities on estimation error is amplified. Figure 4(a) comparesa bivariate histogram to a bivariate KernelSpline built fromthe same multivariate normal data set. Note how much dis-continuity is imposed by the histogram bin dividers.

The uniform frequency assumption is also more of anissue for multivariate data because, unlike in a univariatehistogram, it is possible for more than two bins (in fact, ar-bitrarily many) to be partially intersected. In fact, high di-mensional range queries will exhibit yet another manifesta-tion of the dimensionality curse, whereby (on average) ex-ponentially more buckets will be partially intersected. Fig-ure 5(a) presents a bird’s eye view of equiwidth bins and asquare range query in the square range[a; b]� [c; d]. In thiscase, the uniformity assumption is made in all 12 bins thatare partially intersected.

In the following subsections, we consider how to extendthe proposed methods of Sec. 3 to multivariate attributes.For brevity, we focus on the bivariate KernelSpline. Sec-tion 5.1 describes how to extend the univariate cubic splinefor bivariate attributes; Section 5.2 shows how to do so forthe KernelSpline; Section 5.3 presents some experimentalresults of this method compared to bivariate histograms.

5.1 Bicubic Splines

The bicubic interpolation spline is a generalization ofthe cubic interpolation spline to surfaces. See Fig. 4(b) foran example of a bicubic spline. Here we give a more formaldefinition:

7

Page 8: Range Selectivity Estimation For Continuous Attributes

12

34

56

78

910

0

1000

2000

3000

4000

5000

6000

(a) bivariate histogram

0

10

20

30

40

0

10

20

30

40

50−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b) bicubic spline

Figure 4. Nonparametric estimation of a bi-variate normal density (a) using a bivariatehistogram, and (b) using a bicubic spline.

Definition 5 Given a rectangular gridG = f(xi; yj)ja = x0 < x1 < � � � < xn = b;

c = y0 < y1 < � � � < ym = dg with heightsuij at eachknot, abicubic splineS is a function onR = [a; b]� [c; d]satisfying the following conditions:

1. S is a bicubic polynomial, denotedSij, on the interval[xi; xi+1]� [yj ; yj+1], i.e.,Sij(x; y) =

P3

k=0

P3

s=0 aijks(x� xi)k(y � yj)s;2. S(xj) = uij for eachi; j;3. S is twice differentiable continuously onR.

5.2 Bivariate KernelSplines

Given a rectangular grid of knots, a KernelSpline es-timates the bivariate density at eachknot location via bi-variate kernel density estimation. Bivariate kernel estima-tion is very similar to univariate density estimation but with

���������

���������������������

������������

�����

�����

������������

a

c

d

b

���������������

���������������

���������������

���������������

���������������

���������������

���������

���������

������������

������������

������������

�����

�����

���������

������������

���������

���������

���������

���������������

������������������

���������

Figure 5. Two common cases where his-tograms make the uniform frequency as-sumption: (a) in two partially intersectedbins, and (b) in the same bin.

the sum of bivariate rather than univariate Gaussian kernelsgauged at the grid points. A bicubic spline is then con-structed through these grid points. Figure 4(b) illustratesan example. As in the univariate case, the bivariate Ker-nelSpline can be built inone passover the data set and isincrementally maintainable. Its runtime overhead involvescomputing a bicubic spline on the fly, which can be com-puted inO(nm) by solvingm tridiagonal systems of diag-onal lengthn.

5.3 Experiments

Competing Methods:As a representative of the histogrammethods, we selected the bivariate equiwidth histogram;as a representative of the proposed spline methods, we se-lected the bivariate KernelSpline at evenly-spacedknots onthe rectangular grid. We compared these two methods atthe same storage space: approximately 400 bytes.

method equispaced

histogram 21.55%KernelSpline 9.28%

method equispaced

histogram 16.86%KernelSpline 5.08%

(a) binormal (b) LBcounty

Table 4. Estimation errors of the equiwidthhistogram and the KernelSpline, for (a)binormal , and (b) LBcounty .

Software: The bivariate histogram methods were imple-

8

Page 9: Range Selectivity Estimation For Continuous Attributes

mented inC. Freely availableSplus code fromStatlibwas used for computing bivariate kernel estimates. Knotlocations given by bivariate histograms and kernel esti-mates were sent to a basic bicubic spline algorithm imple-mented inC for KernelSplines.

Queries: Queries for the bivariate methods were in theform of 2-d rectangular ranges[a; b]� [c; d]. These rangeswere chosen uniformly.

Data sets:Two data sets were used:binormal , 100,000bivariate normally distributed points andLBcounty ,Cartesian coordinates of 63,830 road crossings in LongBeach County, CA.

Error Measure: The error measure from Sec. 4 was used.

Results: We ran our method on the above data sets for1000 querieseach. Our results show an even more favor-able ratio of performance for KernelSplines than what wasobserved with univariate data: KernelSplines have less thanhalf the error of equiwidth histograms for thebinormaldata set and less than one third forLBcounty . Table 4summarizes the results. One would expect bivariate Opti-malSplines to outperform bivariate KernelSplines substan-tially, based on the relative performance of univariate Opti-malSplines to KernelSplines, but this remains to be tested.

6 Related Work

As mentioned, the focus of the past work is on dis-crete data, where at least some of the attribute values havehigh multiplicity. Most methods in the literature success-fully exploit this assumption, such as the end-biased his-togram [9]; the maxdiff histogram [16, 15]; and the com-pressed histogram [16, 15]; the polynomial-based methodin [18]; wavelet-based histograms [13],etc. As shown in[16], the prevailing method is maxdiff(V,A), which we usedin our experiments.

Remotely related to our work is the query feedback ap-proach of [1]; the use of linear regularization to obtainbetter estimates from histograms [4]; and the CF-kernelmethod of [12] to obtain a fast kernel estimation of the den-sity in very large data sets.

7 Conclusions

The main contribution of this paper is the recognitionof the need to distinguish between different attribute valuedomain types such as discrete and continuous, for obtain-ing range query selectivity estimates and for approximate

query answering over continuous valued attributes. Ouranalysis and experiments reveal that continuous valued at-tributes call for different statistical profiling methods.

We have presented two methods based on splines,namely, KernelSplines and OptimalSplines, which have thefollowing benefits:

� They exploit the continuity of the data by fitting ana-lytic continuous functions to a smooth estimate of thedata density (p.d.f.);

� They can estimate range selectivites by evaluating ananalytic integral, eliminating the need for the UniformFrequency Assumption, which is known to deterioratethe performance of histograms;

� They can be implemented quickly using public do-main code forSplus .

Due to these advantages, our experiments on both realand synthetic data have demonstrated the following:

� They achieve substantially lower estimation error (aslow as one-sixth) than the state-of-the-art histograms,at exactly the same amount of storage space and atcomparable CPU time;

� The relative estimation error of the proposed methodscompared to histograms is amplified for multivariaterange estimations.

For these reasons, the proposed spline methods are moreattractive than histograms for evaluating range selectivitiesover univariate and multivariate continuous attributes. TheOptimalSpline is the most accurate for univariate attributes,consistently achieving the lowest estimation error out of allthe methods examined, and obtaining excellent results on avariety of data sets (e.g., 14.5% error on a 100K-attributehighly-skewed data set). In the case where periodic off-line build-up of a statistical profile is required, the Ker-nelSpline shares some of the nice properties of the Opti-malSpline, generating better results than histograms, butrequiring only one pass when the data range is known inadvance.

For multivariate attributes, we have observed that theKernelSpline is more clearly a better choice than the his-togram, as the differential in estimation error compared tohistograms is magnified in higher dimensions. One couldinfer that the ratio should be even better for the multivari-ate OptimalSpline; nonetheless, we would recommend theKernelSpline as the method of choice for multivariate at-tributes, where build-up time is more expensive, because

9

Page 10: Range Selectivity Estimation For Continuous Attributes

they clearly beat histograms while only requiring a singlepass over the data.

Future work includes investigating the relative degrada-tion and maintenance work requirements of the proposedmethods compared to histograms in the presence of inser-tions and deletions.

Acknowledgments

We would like to thank Christos Faloutsos for his usefulcomments. We would also like to thank statisticians An-dreas Buja, William DuMouchel, and Charles Kooperbergfor their constructive discussions.

References

[1] Chungmin M. Chen and Nick Roussopoulos. Adap-tive selectivity estimation using query feedback. InProc. of the ACM-SIGMOD, pages 161–172, Min-neapolis, MN, May 1994.

[2] S. Christodoulakis. Estimating block selectivities.In-formations Systems, 9(1), March 1984.

[3] C. deBoor. Spline functions. In S. Kotz and N. John-son, editors,Encyclopedia of Statistical Sciences,volume 2. Wiley, 1983.

[4] Christos Faloutsos, H.V. Jagadish, and Nikolaos D.Sidiropoulos. Recovering information from summarydata. InVLDB, pages 36–45, Athens, Greece, August1997.

[5] Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, andLynne Stokes. Sampling-based estimation of thenumber of distinct values of an attribute.Proc. ofVLDB, pages 311–322, September 1995.

[6] Joseph M. Hellerstein, Peter J. Haas, and Helen J.Wang. Online aggregation. InProc. ACM SIGMOD,pages 171–182, Tucson, AZ, May 1997.

[7] Y. Ioannidis. Universality of Serial Histograms.Pro-ceedings of VLDB, Dublin Ireland, pages 256–277,August 1993.

[8] Y. Ioannidis and S. Christodoulakis. Optimal His-tograms for Limiting Worst-Case Error Propagationin the Size of Join Results.ACM Transactions onDatabase Systems, Vol. 18, No. 4, pages 709–748,December1993.

[9] Yannis E. Ioannidis and Viswanath Poosala. Balanc-ing histogram optimality and practicality for query re-sult size estimation. InACM SIGMOD, pages 233–244, San Jose, CA, June 1995.

[10] C. Kooperberg and C. Stone. Logspline density esti-mation for censored data.Journal of Computationaland Graphical Statistics, December1992.

[11] Flip Korn, H.V. Jagadish, and Christos Faloutsos. Ef-ficiently supporting ad hoc queries in large datasetsof time sequ ences. InProc. ACM SIGMOD, pages289–300, Tucson, AZ, May 1997.

[12] M. Livny, R. Ramakrishnan, and T. Zhang. Fastdensity and probability estimation using cf-kernelmethod for very large databases. Tr, University ofWisconsin, Madison, WI, July 1996.

[13] Yossi Matias, Jeff Vitter, and Min Wang. Wavelet-based histograms for selectivity estimation. InProc.ACM SIGMOD, pages 448–459, Seattle, WA, June1998.

[14] M. Muralikrishna and David J. DeWitt. Equi-depthhistograms for estimating selectivity factors for multi-dimensional queries. InProc. ACM SIGMOD, pages28–36, Chicago, IL, June 1988.

[15] Viswanath Poosala and Yannis E. Ioannidis. Selec-tivity estimation without the attribute value indepen-dence assumption. InProc. of VLDB, pages 486–495,Athens, Greece, August 1997.

[16] Viswanath Poosala, Yannis E. Ioannidis, Peter J.Haas, and Eugene J. Shekita. Improved histogramsfor selectivity estimation of range predicates. InACMSIGMOD, pages 294–305, Montreal, Canada, June1996.

[17] David Scott.Multivariate Density Estimation. Wiley,New York, 1992.

[18] Wei Sun, Yibei Ling, Naphtali Rishe, and Yi Deng.An instant and accurate size estimation method forjoins and selection in a retrieval-intensive environ-ment. InProc. ACM SIGMOD, pages 79–88, May1993.

[19] E. J. Wegman. Nonparametric probability density es-timation: A summary of available methods.Techno-metrics, 14(3):533–545, August 1972.

[20] E. J. Wegman. Density estimation. In S. Kotz andN. Johnson, editors,Encyclopedia of Statistical Sci-ences, volume 2. Wiley, 1983.

10