parallel distributed processing constrained skyline queries...
TRANSCRIPT
Parallel Distributed Processing of ConstrainedSkyline Queries by Filtering
Bin Cui 1, Hua Lu 2, Quanqing Xu 1, Lijiang Chen 1, Yafei Dai 1, Yongluan Zhou 3'Department of Computer Science, Peking University, China
{bin.cui, xqq, clj, dyf}@pku.edu.cn
2Department of Computer Science, Aalborg University, [email protected]
3 Distributed Information Systems Laboratory, EPFL, [email protected]
Abstract- Skyline queries are capable of retrieving interestingpoints from a large data set according to multiple criteria. Mostwork on skyline queries so far has assumed a centralized storage,whereas in practice relevant data are often distributed amonggeographically scattered sites. In this work, we tackle constrainedskyline queries in large-scale distributed environments withoutthe assumption of any overlay structures, and propose a novel al-gorithm named PaDSkyline (Parallel Distributed Skyline queryprocessing). PaDSkyline significantly shortens the response timeby performing parallel processing over site groups produced bya partition algorithm. Within each group, it locally optimizesthe query processing over distributed sites. It also drasticallyenhances the network transmission efficiency by performing earlyreduction of skyline candidates with deliberately selected multiplefiltering points. Results of extensive experiments demonstrate theefficiency and robustness of our proposals.
I. INTRODUCTION
A skyline query [1] returns interesting points from a setof multiple dimensional data points. A point is said to beinteresting if it is not dominated by any other points. A pointpt1 is said to dominate pt2, if ptl is not worse than pt2 inevery single dimension but better than pt2 in at least onedimension. The meaning of "better" varies in different situ-ations, for example "smaller" or "larger" in value comparison,and "earlier" or "later" in date comparison. Because of theirpowerful capability of retrieving interesting points from a largen-dimensional data set, skyline queries are well suitable forapplications like decision making and multiple criteria opti-mization. For instance, a tourist can issue a skyline query ona hotel relation to get those ones with good recommendationsand cheap prices.Most work on skyline queries [1], [2], [3], [4], [5], [6] so
far has assumed a centralized data storage, and been focusedon providing efficient skyline computation algorithms on asole database. This assumption, however, fails to reflect thedistributed computing environments consisting of differentcomputers, which are located at geographically scattered sitesand connected via Internet. For example, a stock trader needsto know which stocks worldwide are worth investing, basedon the trading records of the previous day. For this purpose, he
has to access multiple stock information databases available atdifferent places like New York Stock Exchange, London StockExchange, Tokyo Stock Exchange, etc. For each single stock,the agent needs to take into consideration multiple attributeslike last trade price, change, last close price, estimatedprice, volume, etc. Therefore, a skyline query against thosedistributed databases will help the agent get those interestingstocks.
Another real-life example is online comparative shopping, inwhich a search engine needs to get good bargains from manydifferent shopping sites according to multiple criteria likeprice, quality, guarantee, etc. Clearly, such multiple criteriaare best captured by a skyline query.
In such cases as mentioned above, however, directly apply-ing existing techniques to process skyline queries would incurlarge overhead. In this work, we intend to efficiently processconstrained skyline queries [5] in such widely distributed en-vironments. We propose a novel algorithm named PaDskylinewhich stands for Parallel Distributed Skyline query processing.A constrained skyline query is attached with constraints onspecific dimensions. A constraint on a dimension is a rangespecifying the user's interest. Refer again to the stock selectionexample. The agent may only be interested in those stockswhose last trade prices are between $15 to $20. Similarconstraints are also applicable to other dimensions. Two pointsare noteworthy about possible constraints. One is that the rangedenoted by a constraint can be unclosed, like estimated pricehigher than $20. The other is that constraints can be availablein only part of the total n dimensions.
Given a distributed environment as aforementioned, ourobjective is efficient query processing strategies that shortenthe overall query response time. We first speed up the overallquery processing by achieving parallelism of distributed queryexecution. Given a skyline query with constraints, all relevantsites are partitioned into incomparable groups among whichthe query can be executed in parallel. The parallel executionalso makes it possible to report skyline points progressively,which is usually desirable to users. Within each group, specificplans are proposed to further improve the query processing
978-1-4244-1837-4/08/$25.00 (© 2008 IEEE 546 ICDE 2008
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
involving all intra-group sites. On a processing site, multiplefiltering points are deliberately picked based on their overalldominating potential from the local skyline. They are then sentto other sites with the query request, where they help identifymore unqualified points otherwise reported as false positives,and thus reducing the communication cost between data sites.We make the following contributions on the problem of dis-
tributed constrained skyline queries in a network environmentwithout assuming any overlay structures. First, we propose aspecific partition algorithm that divides all relevant sites intogroups, such that a given query can be executed in parallelamong all those site groups. Second, we elaborate the queryexecution within any single group of sites, which includesdeciding a query forwarding plan between sites and designat-ing a group head responsible for query results merger. Third,we give a parallel distributed skyline algorithm based on thefirst two contributions, together with a cost model to estimatequery costs. Fourth, we detail how to use multiple filteringpoints in distributed query processing such that the networkdata transmission is more efficient. Two simple yet efficientheuristics are developed for selecting multiple filtering pointson a processing site. Finally, we conduct extensive experimentson both synthetic and real datasets, and the results demonstratethe efficiency and robustness of our proposals.
The remainder of this paper is organized as follows. Sec-tion II provides a brief review of related work on skylinequeries. Section III gives the problem definition. In Sections IVand V we detail the parallel distributed query execution and theselection of multiple filtering points, respectively. Section VIpresents the experimental results. Finally, we conclude thepaper in Section VII.
II. RELATED WORK
Borzonyi et al. [1] introduced the skyline operator intodatabase systems with algorithms Block Nested Loop (BNL)and Divide-and-Conquer (D&C). Chomicki et al. [2] proposeda Sort-Filter-Skyline (SFS) algorithm as a variant of BNL. Tanet al. [6] proposed two progressive algorithms: Bitmap andIndex. The former represents points in bit vectors and employsbit-wise operations, while the latter utilizes data transforma-tion and B+-tree indexing. Kossmann et al. [4] proposed aNearest Neighbor (NN) method. It identifies skyline points byrecursively invoking R*-tree based depth-first NN search overdifferent data portions. Papadias et al. [5] proposed a Branch-and-Bound Skyline (BBS) method based on the best-firstnearest neighbor algorithm [7]. Godfrey et al. [3] provided acomprehensive analysis of previous skyline algorithms withoutindexing supports, and proposed a new hybrid method withimprovement. All these works assume centralized data storage.
Deviating from skyline queries in the centralized setting,Balke et al. [8] addressed skyline operation over web databaseswhere different dimensions are stored in different data sites.Their algorithm first retrieves values in every dimension fromremote data sites using sorted access in round-robin on alldimensions. This continues until all dimension values of anobject, called the terminating object, have been retrieved. Then
all non-skyline objects will be filtered from all those objectswith at least one dimension value retrieved. Differently in thispaper, our work deals with distribution of data horizontallypartitioned.Wu et al. [9] proposed a parallel execution of constrained
skyline queries in a CAN [10] based distributed environment.By using the query range to recursively partition the dataregion on every data site involved, and encoding each involved(sub-)region dynamically, their method avoids accessing sitesnot containing potential skyline points and progressively re-ports correct skyline points. Wang et al. [11] developed SkylineSpace Partitioning (SSP) approach to compute skylines on atree-structured P2P platform BATON. SSP partitions the sky-line space into regions and maps them in a single dimensionalorder, which allows regions to be distributed to different peernodes according to BATON protocols. Our proposal in thispaper differs from these two pieces of work in that we do notassume any overlay availability on top of the original network.Huang et al. [12] proposed techniques for skyline query
processing in MANETs. Lightweight devices in MANETs areable to issue spatially constrained skyline queries that involvedata stored on many mobile devices. Queries are forwardedthrough the whole MANET without routing information. Theyproposed a filtering based data reduction technique that re-duces the data transferred among devices. Our work, assuminga wired large-scale distributed environment, is also differentfrom this work in a MANET setting.
III. PROBLEM DEFINITION
Given a set of m sites S = {Si, S2,...., Sm.} distributed atdifferent geographic locations, each Si has a local relation Ri.Every tuple in any Ri is n-dimensional point, represented as(P1,P2 ...* Pn). Different Ris may overlap, i.e., it is possiblethat Ri n Rj 0 for it'j.
Without loss of generality, we assume smaller values arepreferred in the skyline operator. We use pti -C pt2 torepresent point pti dominates point pt2. Besides, we supposeany site Sorg, able to directly communicate with any other siteSi C S through wired end-to-end connections, may initiateagainst all Ris a skyline query with a set of n constraintsC = {C1,C2,... Cn4. Each Ci is either a range [Ili, ui], ora 0 indicating no constraint in that dimension. Our goal is toget the result for the constrained skyline query efficiently, i.e.,with short response time. We define the query response timeas the time period from the moment a query is issued by a siteSorg, to the moment Sorg receives the complete and correctanswers after contacting other sites.
To shorten the response time of a query, we mainly endeavoron two aspects. On one hand, we propose effective waysto guide deciding the query forwarding and execution orderbetween different sites, so as to obtain the query executionparallelism, and boost the result reporting progressiveness.On the other hand, we generalize the single filtering pointidea [12] to use multiple filtering points, and thus enhancingthe filtering power and reducing the amount of data transmittedbetween remote sites. We propose benefit measurements and
547
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
heuristics to guide the selection of powerful multiple filteringpoints.
In contrast to previous work [ 11], [9], our problem definitiondoes not assume the availability of an overlay network wheredifferent nodes hold disjoint data partitions. Instead, differentrelations on different sites may overlap. Therefore, theseprevious methods are not applicable to our problem.
IV. PARALLEL DISTRIBUTED SKYLINE PROCESSING
A. MotivationIn the previous work [12], a skyline query is forwarded
among mobile peers via multiple hops according to thewireless nature of MANETs [13]. Whereas in a wired envi-ronment, connections are end-to-end. Because of such wiredconnections between each pair of sites, a constrained skylinequery can be sent out to all peer sites and then each sitecan execute the query on its own data set simultaneously.This naive execution plan might benefit from the parallelismamong all peer sites. However, it is very sensitive to large localskyline results, usually of but not limited to anti-correlateddatasets [1], which lead to heavy communication cost. Forthis reason, identifying unqualified candidates in local skylinesearly is favorable.
Following the data reduction principle in semijoin in dis-tributed query processing [14], [15], Huang et al. [12] pro-posed to transfer a single local skyline point with the queryrequest among mobile peers, which acts as a filter to identifythose unqualified ones in peers' local skylines. In this work,we extend this single filtering point to multiple ones sincewired connections are much faster and more reliable thanwireless channels. How to choose multiple filtering points willbe detailed in Section V. In this section, we focus on finding agood distributed skyline query execution plan that minimizesthe total execution time. We compare the naive approach withthe parallel distributed approach with multiple filtering pointswe will propose, using a cost model taking into account bothlocal processing time and network transmission time.We argue that the order to execute a distributed skyline
query among multiple sites really matters, because an ap-propriate execution order can filter more data points earlierand thus reducing not only communication cost but also thelocal processing cost in the subsequent execution. We intendto balance the parallelism and filtering when processing adistributed skyline query (DSQ for short) and get short overallresponse time.
B. Parallel Distributed Query ExecutionTo help determine execution order among different sites, the
query originator first asks each site Si for its MBRi, the n-dimensional minimum bounding box of the local relation Ri.If a site's MBR disjoins with the constraint set C specifiedin the query, it will not be considered in the following queryprocessing. For each site whose MBRi overlaps with C, weonly need to consider the intersection MBRi nC. We call thisintersection reduced minimum bounding box and use rMBRito represent it.
We proceed to partition all those sites left into severalgroups according to their rMBRs, such that the skylinecomputation in any one group does not depend on or affect thecomputation in any other group. Therefore, the given skylinequery can be executed in parallel among those site groups.While within any individual site group the query is executedaccording to some local plans, which will be discussed inSection IV-B.2.
1) Incomparable Partition of Sites and Parallelism: Giventwo data sites Si and Sj (with their reduced minimumbounding boxes rMBRi and rMBRj respectively), we needto determine if the skyline query can be executed againstthem in a parallel way such that two results do not affecteach other. Intuitively, we cannot do this if they overlap, asin the overlapping region points from different boxes may benot incomparable. Whereas, the non-overlapping relationshipdoes not necessarily permit parallelism. To help deal with thisproblem, we first give a definition of "incomparable" betweentwo data sites.
Definition 1: Two data sites Si and Sj are incompa-rable with respect to the constrained skyline query, iffrMBRi.min.DR n rMBR = 0 and rMBRi.min.DR nrMBRi = 0.In Definition 1 above, rMBRi.min.DR stands for the domi-nating region of rMBRi's minimum corner with respect to theconstraints specified in the skyline query. It is obtained as theintersection of the original dominating region [12] and the setof constraints C, which will be formalized in Section V-A.1.When there is no confusion or ambiguity in the context, wealso say that two reduced minimum bounding boxes rMBRiand rMBRj are incomparable, which actually means that theircorresponding data sites are incomparable. With this definition,we have the following lemma.Lemma 1: If two data sites Si and Sj are incomparable with
respect to the constrained skyline query, it holds that Vpti erMBRi, 7ptj C rMBRj s.t. pti -C ptj and Vptj C rMBRj,Apti C rMBRi s.t. ptj -C pti.The correctness of this lemma is guaranteed by the propertythat the minimum corner of rMBR has the strongest dominat-ing capability among all possible points in rMBR. This lemmashows that a given skyline query can be executed against twoincomparable sites in a parallel way without conflict betweentwo results. It is easy to see that this parallelism can be gener-alized to multiple sites, any pair of which are incomparable. Aby-product of this parallelism is the progressiveness of skylinereporting, because the result of any single site among thosepair-wise incomparable sites definitely appears in the finalskyline.
It is not impossible that, however, all sites are not pair-wiseincomparable, as we do not assume any partition availableamong all sites. Thus, we partition the set of sites S into non-empty groups that satisfy: (1) Any pair of sites from differentgroups are incomparable; (2) For any pair of sites Si andSj within the same non-singleton group, there exists a pathSi, Sk, . . ., Sj such that any adjacent pair along the path arenot incomparable. Property 1 ensures that parallel execution
548
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
can be carried out against groups of sites. While property2 clusters together those sites whose local skyline resultspotentially conflict, and provides opportunity for alternativeoptimizations. We call the partition incomparable partition ofsites.The partition algorithm, iucmpPartititon for short, is shown
in Figure 1. Before the real partitioning, intersections of siteMBRs and constraint set C are obtained with unpromising sitesbeing pruned (lines 1-3). The partitioning procedure starts byassigning the first candidate site into a singleton group (line 4).Then, each remaining site Si is compared with every currentgroup, and any group containing sites not incomparable withSi is removed from the partition into a temporary group Si(line 7-9). At the end of each loop Si is assigned into theadjusted partition, either in a new singleton group or in thetemporary group with other relevant sites found in the loop(line 10).
Algorithm icmpPartition(S, C)Input: S is the set of data sites
C is the set of constraints in the skyline queryOutput: an incomparable partition of S
11 Adjust MBRs and prune unqualified sites1. for each Si C S2. rMBRi =MBRinC;3. if (rMBRi == 0) S = {S-}S
HI Compute the independent partition of all relevant sites4. Is = {{Si,}}; HI Sl, is the current l" element in S5. for each Si C S-{S1/}6. Si = 0;7. for each Si C Us8. if (3Sj C Si s.t. Sj and Si are not incomparable)9. Fls = Is-{S,}; Si = S, U S,;10. us = sU{{Si}Usi}
Fig. 1. Incomparable Partition Algorithm
Refer to an example in 2-dimensional space shown inFigure 2, where S {A, B, C, D, E, F, G}. Each site'sMBR is shown in a corresponding rectangles. Based on theincomparableness related properties above, S is partitionedinto {{A}, {B, C, D, E}, {F, G}}. A forms a singleton groupbecause it is incomparable with any other site. Though B andD are incomparable, they are assigned to the same group withC and E, because either of them are not incomparable withC (and E). F and G are incomparable with any other site butnot with each other, therefore they constitute another group.
2) Intra-Group Query Execution: Within each group ofthe sites partition obtained above, we need to consider theexecution order among those relevant sites. Now effectivefiltering is our concern, in other words we use filtering pointswhen forwarding a DSQ between sites in the same group.Given a group Si {Si ..... k }, we have two alternativesfor intra-group execution order.A simple way is to decide a sequence of these sites and
tP2
0
Fig. 2. Incomparable Partition of Sites
forward the DSQ with filtering points along the sequence. Thesequence is gained by sorting all sites Sij s in a non-descendingorder of the Euclidean distance between the minimum cornersof rMBRi and constraints set C. The reason for this lies inthe intuition that a point nearer to the minimum corners ofC (or the origin when a skyline query without constraints isconcerned) is more powerful in terms of dominating capability.The sorting can be integrated into the partition algorithm inFigure 1, on line 9 (line 10) where Si and Si ({Si}) are united.
The linear approach ignores any internal possibilities forparallelism. In Figure 2, for example, the query can beparallelized against sites B and D either before or after othersites. To fully capture this potential for intra-group parallelism,a directed graph is needed to store all sites in a group, which isquite complicated due to the nature of a graph. As a trade-offbetween complexity and parallelism, we use a tree to organizeeach group of sites. For this purpose, we adapt the partitionalgorithm in Figure 1 as follows. In each loop (lines 5-10) onsite Si, instead of using a temporary group Si to contain theunion of all groups removed out from HIs, Si is used as a rootand all such groups are attached to it as children, and finallythe tree is added into the partition as an updated group. Afterthe partitioning, within every group the query is executed andforwarded with filtering points in the top-down manner startingfrom the root.
For either way, every group needs an assembly to mergeresults from sites and remove false positives during the queryprocessing. We install this procedure in the group head which,responsible for returning result to query originator, is the firstsite in the linear approach or the root in the tree-base approach.
C. Parallel Distributed Skyline AlgorithmBased on the discussion so far, our overall parallel dis-
tributed skyline algorithm, called PaDSkyline, is presentedin Figure 3. When a site Sorg issues a distributed skylinequery with constraints C, it first gets the incomparable partitionof all sites by calling icmpPartition (line 1). After that, thequery request will be sent out to each group head in parallel(lines 2-3). The query request includes constraints C, networkaddressable query originator identifier Sorg, and the intra-group query execution plan gi.plan. Next, Sorg enters a wait,during which it will be triggered by an incoming result reply
549
1. Pi
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
from a group head, until all of them have replied (lines 4-7). Upon receiving the reply from the head of group gi, Sorgdirectly reports the received result ga.result (lines 5-6) as itis incomparable with results from other groups.
Algorithm PaDSkyline(S, C)Input: S is the set of data sites
C is the set of constraints in the skyline queryOutput: the constrained skyline
HI get the incomparable partition of all sites1. I-s = icmpPartition(S, C)
HI parallel execution2. for each group gi C rls in parallel3. send (C, Sorg,gi.plan) to gi's group head
HI result merge, triggered by incoming result reply4. repeat5. receive result reply from a group gi's head6. report gi. result7. until all group heads have replied
Fig. 3. Parallel Distributed Algorithm
For the sake of simplicity, we designate Sorg as the headof its site group. For that case, the network communicationbetween Sorg and a group head degrades to an inter-processor inter-thread communication on a single host.
After a group gi's head receives the query request, itcarries out the intra-group skyline computation as shown inFigure 4. First, it computes its local constrained skyline Rgand captures the initial filtering points set Fc during the localcomputation (line 1). How to select multiple filtering pointswill be detailed in Section V. Next, it sends the query requestfurther to downstream site(s) in the group query plan (line2). Note here the query request includes the constraints C,network addressable group head's identifier Sg, the reducedquery plan plan', and the initial filtering points set Fc. Eachsite Si receiving the query request computes its local resultSi.R, returns S,.R to site Sg, and sends the query requestwith updated filtering points and further reduced plan to itsown downstream site(s). While Sg keeps receiving S .R andmerging it with Rg until all group members have replied (lines3-6).
During query execution within a site group, the filteringpoints set Fc will be dynamically changed, as will be coveredin Section V-C. Also, the query plan is dynamically reducedas follows. Each site, including the group head, removes itselffrom the plan. If the plan is a single sequence, the reducedplan is the sub-sequence left and its head is exactly the targetdownstream site. If the plan is a tree, the removal may resultin more than one sub-trees. For that case, each of them willbe sent to a corresponding target downstream site, which isexactly the root of a sub-tree.
D. Cost ModelSuppose for a given local relation Ri, the time to compute
its local skyline SKi is T1(Ri). We use "ISK,I" to represent
Algorithm groupSkyline(C, Sorg, plan)Input: C is the set of constraints in the skyline query
Sorg is the query originator site identifierplan is the query execution plan in the group
Output: the constrained skyline within the group1. compute local skyline Rg,
and get the initial filtering points set Fc2. send (C, Sg, plan', Fc) to next site(s) in plan
HI result merge, triggered by incoming result reply3. repeat4. receive result reply from a group member Si5. merge Sj.R, with Rg,
removing duplicates and false positives6. until all group members have replied7. return Rg to Sorg
Fig. 4. Intra-Group Algorithm
the size in bytes of a local skyline. The time to transfer SKibetween two sites is determined by its size, together with thenetwork relevant conditions like bandwidth.
In the naive approach without filtering (see Section IV-A),the time for query originator Sorg to get result from a peersite Si is the sum of the local computation time and networktransmission time, as shown in the following Formula 1.
Ti = T1(Ri) + Ttrani,or ( SKi ) (1)
Due to the parallelism of a naive query execution, its overallresponse time Tnau is determined by the peer site whose resultreaches Sorg latest and the local assembly time on Sorg, asshown in the following Formula 2.
Tnav = max{Ti 1 < i <m} + Tasm (2)Suppose an incomparable partition I-s = .S... Sk} is
produced for a given distributed skyline query with constraints,then its overall response time Tpdq is determined by the groupwhose result reaches S,.rg latest, as shown in Formula 3. Notein the parallel distributed query processing, we do not need aglobal assembly on Sorg
Tpdq = maxfTs,{ 1 < i< k} (3)
In the formula above Ts, is the response time of group Si,i.e., the time lapse before Sorg receives the result from Si'sgroup head hi. It is detailed in the following Formula 4. Th, isthe local response time, i.e., the time lapse before hi gets allresults from sites in the group. Tasmh is the local assemblytime, and SKS3 is the result sent to Sorg
TS, = Thi + TasMhi + Ttranhi,0,g(|SKSi ) (4)
E. Pre-Computation of Local SkylinesIf a skyline query is not attached with constraints, the local
skyline result SKi on any single site Si can be directly usedto compare with filtering points. Similarly, if a site Si's dataset is fully contained within the constraints {Cl, C2... , Cr}
550
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
specified in a skyline query, the local skyline SKi is stilldirectly useful. With this observation, we pre-compute for eachsite Si its local skyline SKi without any constraint, and storeSKi to expect that it can be reused by future queries. Withthis local optimization we hope to reduce the local skylinequery processing times, and thus shortening the overall queryresponse times.
V. ENHANCEMENT OF FILTERING POINTS
In [12], one single filtering point is transferred and changedfrom peer to peer. During the query forwarding and processing,the single filtering point is dynamically changed once anotherpoint is found to be more powerful in filtering out unqualifiedcandidates.By contrast, as the network connections in this work are
wired with considerably steady and high bandwidth comparedto wireless MANETs, we turn to use multiple filtering pointsamong sites, expecting to get higher filtering power. Then, weneed to decide which skyline points should be used as filteringpoints to achieve the best data reduction rate in the distributedskyline query processing.
In this section, we first formalize the dominating regionof multiple skyline points with respect to constraints, nextwe address how to select multiple filtering points initially,and then we proceed to discuss how to dynamically changefiltering points during the query processing.
A. Dominating Region ofK Filtering Points
Filtering points are selected from the local skyline resultinitially obtained. Suppose the initial skyline result is SKi,it ={Sl S2, ... SlI}, we need to select K (< 1) points from itas the multiple filtering points. The concrete value of K isconstrained by I and has effect on network communicationcost and pruning capacity obtained. It is not possible to givea generic optimal value of K for all situations. We insteadexperimentally study this issue in Section VI-B.
1) Dominating Region with Constraints: When K equals 1,we select the point with the maximum potential of dominatingothers, which is measured by the volume of a point's domi-nating region, the hyper-rectangle determined by the point andthe maximum corner of the data space [12]. Figure 5 shows a2-dimensional example.
b
pj2
0 pj bI
Fig. 5. Dominating Region without Constraints
In the presence of the constraints specified for relevantdimensions, the definition of dominating region needs amend-ments accordingly. As shown in Figure 6, suppose a constraintrange [11, u1] is specified on the dimension P1. Consequently,for a point tpj (<Pj , . . ., Pj >), its dominating region shrinksto the smaller one determined by its own value, constraintupper bound u1, and dimension P2'S upper bound b2.
b2
Pj2
P2
'Ij
0 11 Pjl ul
Max cornerof dat- spaceJct[
0:b
Fig. 6. Dominating Region with Constraints
Suppose the value range on dimension Pk is [Sk, bkl in termsof all Ris. Then the volume of tuple tpj's dominating regionwith respect to constraints is
n
VDRj = (bk-Pjk)k=l
where
bk {
(5)
bk, if Ck = 0Uk, if k < Pjk <UkPjk, otherwise
Note that it does not make sense to define the dominatingregion for points out of the region specified by all constraints,because such points does not appear in the skyline result andwill not be considered when we select filtering points. In theFormula 5, we ensure the VDR value is zero for any pointuncovered by given constraints, which will prevent it frombeing chosen as a filtering point.
Given a collection of skyline points {si,S. . ., sk }, theirdominating regions only differ on their own positions in thedata space. Thus for any si in that collection, its dominatingregion is represented by a hyper-rectangle HRi ([pi,, b1],.[pi, bn]), where bi is determined as Formula 5 describes.
2) Fused Dominating Region: For multiple filtering points,we need to consider their overall filtering capability whichis measured by the volume of their fused dominating region.Given any two distinct skyline points si and sj, the volumeof their fused dominating region is
n
VDRi,j = VDRi + VDRj- (bk -max(pik, Pjk)) (6)k=1
Figure 7 shows an example of fused dominating region of twoskyline points tpi and tpj, where constraint [11, u1] is specifiedon the dimension P1.
551
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
b-
prMax cornero~fdata spjac(
tp i00 11U1 b
Fig. 7. Fused Dominating Region
By applying the Inclusion-Exclusion principle further, wecan compute the volume of fused dominating region of arbi-trary number (K) of skyline points.
K
VDRl..K = Z VDRii=l
K
+ 1 ((1)j-lj=2
n
J(bk -max(pik, ,. Pijk)) (7)1<ii<...<ij<Kk 1=
We call it fused VDR value for K skyline points. Basically,the computation complexity of the Formula 7 is 0(2K) whichis quite high. What makes it even more complex is that theoptimal selection of multiple filtering points is made by choos-ing points with the maximum volume of fused dominatingregion. For that purpose, we need to enumerate all (K) K-combinations of SKinit, whose computation complexity isO(( e1 ) K). Hence the total complexity of computing the fusedVDR values for all K-combinations is O(( 2K1 )K) . This highcomplexity renders undesirable the optimal selection of Kfiltering points with the maximum fused VDR value. There-fore, we turn to alternatives that make more computationallyefficient selections at the cost of the quality of the results.
B. Selection ofK Filtering Points
We proceed to present two heuristics to guide the selectionof multiple filtering points.
1) Heuristic I: Maximal Sum of VDRs: The first heuristicfor selecting multiple filtering points is straightforward. Itmaximizes the sum of the K VDR values of all possiblechoices. To accomplish this, we need to sort points in SKinitin a non-ascending order and then pick the top-K ones. In analternative way, the sorting can be integrated into the skylinecomputation, which produces a sorted SKinit for easy pickingof top-K points. For both ways, the time complexity dependson the sorting method used, which usually does not go beyond0(l2).
This heuristic, MaxSum for short, actually simplifies thecomputation by ignoring the overlapping between differentskyline points' dominating regions. In other words, the over-lapping between dominating regions determines the accuracyof this approximation. The smaller the overlapping regions are,the more accurate the method will be.
2) Heuristic II. Maximal Distance Between Points: In thesecond heuristic we intend to take into account the topologybetween filtering points, to reduce the overlapping faced bythe first heuristic. Distance is a simple metric to help considerthis. Intuitively, the farther two skyline points are apart, the lesstheir dominating regions overlap. Hence we propose a greedyheuristic, MaxDist for short, that maximizes the distancebetween filtering points.
The algorithm of this heuristic is shown in Figure 8.Initially, it picks from SKinit two points between which thedistance is the largest among all pairs (line 2). Then, itincrementally selects points from SKinit and add them to thefiltering set, until K filtering points are obtained. In everyincremental step, the point with the maximal sum of distancesto all current filtering points is selected (line 5). The timecomplexity of this algorithm is 0(12), much less than the exactapproach in Section V-A.2.
Algorithm maxDist(SKi,it, K)Input: SKinit is the initial skyline
K is the number of filtering points neededOutput: a set of multiple filtering points1. Fft = 0;2. Pick si and sj from SKinit satisfying
Si sjl > si/,sj' , V 1 < si/,sj, Kl;3. Fflt {sihSj}; SK' = SKi {it-{si, sj};4. while ( Fflt < K)5. Pick si from SK' satisfying
EsjCEFlt si, sj > EsjCEFlt |s',, sj|, V s', e SK';6. Fflt = Fflt U {s }; SK' = SKiit- {si};7. return Fflt
Fig. 8. Algorithm of Heuristic MaxDist
C. Dynamic Update of Multiple Filtering Points
After a site Si receives a query request with a set offiltering points Fflt, it will execute a local query processing.To take advantage of the filtering points, the local processingcan be implemented in two ways. In an integrated way, Ffltis checked against every candidate point pt met during theskyline computation, any dominated pt is ignored, and anydominated si in Fflt is removed from the set. In a separateway, a local skyline is computed first, and then it will be com-pared with Fflt to filter out those unqualified candidates anddominated sis. The integrated way is seamlessly applicable tothose centralized skyline algorithms that are not based on datatransformation [1], [2], [3], [4], [5], whereas the separate wayis applicable to all existing skyline algorithms.
For either way, we get a local skyline result SKi and thereduced set of filtering points Fj1t. Now we need to updatethe _F,t with points in SKi, so that the dominating capabilityof the multiple filtering points is maintained or increased. Asimple yet efficient way is to treat all points in these two setsequally, and to pick K ones SK4 U F', using the heuristics
552
tpj
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
proposed in Section V-B. To update F17t based on HeuristicI (MaxSum), SKi is sorted in non-ascending order of fusedVDR values, and merged with S' to get the top-K ones asthose new filtering points. To update 1f,t based on HeuristicII, the algorithm in Figure 8 is adapted to pick K points fromthe union of Ff17 and SKi.
VI. EXPERIMENTAL STUDYIn this section, we evaluate our skyline query mechanism
with extensive experiments. We study the performance ofour skyline computing approach PaDSkyline comparing withNaive and Random approach. In the Naive approach, a queryoriginator directly sends out its query request to each site,receives results and merge with its local result, till all siteshave replied. The naive approach does not execute queries inany specific order of sites, nor does it employ any filteringpoints. The Random approach is almost the same to ourproposal PaDSkyine method (with two heuristics MaxSum andMaxDist), except that it selects filtering points randomly.We in the experiments use two kinds of synthetic datasets:
independent and anti-correlated datasets. The parameters usedin the experiments are listed in Table I. Unless stated other-wise, the default parameter values, given in bold, are used.Besides, we use a real-life dataset of NBA players' seasonstatistics from 1949 to 20031, which contains 16,644 recordsof 17 attributes and approximates a correlated data distribution.
Parameter SettingDimensionality 2, 3, 4, 5Number of sites 1,000, 2,000, ..., 10,000Filter points percentage 10%, 20%,..., 50% ..., 90%Cardinality of each site 1,000Number of queries 50
TABLE IPARAMETERS USED IN EXPERIMENTS
All simulation experiments are conducted on a Linux Serverwith two Intel(R) Xeon(TM) 2.80GHz processors and 1.OGBof RAM. We consider three performance metrics. First westudy Data Reduction Efficiency, the efficiency of multiplefilter points towards the local skyline computing in terms ofthe data reduction rate DRR, which was proposed in [12]. TheDRR is the proportion of data points reduced by the filteringpoint to the number of points in the unreduced skyline. It isdefined as
DRR = -E=XitAorg( uunredSKi - redSK Kj)
Di=RRiAorg *urredSK (
where Ki is the number of filtering points sent to a processingsite, and m is the network size. Second, we consider ResponseTime which records the overall query processing time. Thirdwe investigate Precision, which indicates the quality of skylinepoints returned to the query site for a certain query. Except in
1 http://databasebasketball.com
the Naive approach, each group head is regarded as a querysite for its group members. It is formally defined as (P AI),where A is the set of exact skyline points a query site obtainsfinally, and B is the set of resulting points returned to the querysite in the distributed environment. P captures the fraction ofrelevant points a query processing approach needs to identifyand returns to the user.
A. Effect of Network Size
In this experiment, we consider the effect of network sizeon the performance of different methods. Note that, for theNaive approach, it returns all the skyline candidates to thequery site, and hence the DRR is zero. Figures 9(a), 10(a)and 11(a) show that MaxSum yields the best DRR perfor-mance. Carefully selected filtering points can significantlyreduce the data communication when the skyline query isforwarded between sites, e.g. MaxSum approach obtains 80%higher DRR than Random approach. Regarding to MaxSumand MaxDis, there is a tradeoff between sizes of dominatingregions and overlap of dominating regions when we selectfiltering points. From the results, we can see that MaxSumhas better filtering capability overall.
Figures 9(b), 10(b) and 11(b) show the results of responsetime. Clearly, the Naive method is the worst among all fourones, because it requests the local skyline points from all thedata sites in the network, which introduces higher communi-cation cost and heavier computational workload in the queryoriginator site to merge all the unfiltered answers. MaxSumand MaxDist perform better than Random because of theirstronger filtering capabilities, which result in smaller amountof transferred data and shorter response time. However, thegap is narrowed as random selection of filtering points iscomputationally faster than two heuristics.The experimental results of precision, indicator of the
quality of skyline points returned to query sites, are reportedin Figures 9(c), 1O(c) and 11(c). Averagely, MaxSum is about200% better than Naive and 80% better than Random. Thissignificant performance gain is again attributed to our carefullyselected multiple filtering points, which increase the ratioof final skyline points to all points transferred through thenetwork.As our approaches outperform the competitors in all ex-
perimental cases, we in the sequel will show the results onindependent dataset only due to the space constraint.
B. Effect of Filtering Points
How the performance varies with different number of filter-ing points is reported in Figure 12. Referring to Figure 12(a),a larger number of filtering points do not necessarily ensure ahigher DRR. This is because more filtering points also count inthe DRR calculation in Formula 8, and the collective dominat-ing capability of multiple points does not increase enough tooffset that effect, especially for the Random approach withoutany special consideration in points selection.More filtering points also prolong the response time, as
shown in Figure 12(b), because they need more time to
553
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
0.85El-l
76 0.8
* 0.75
0 0.7
76 0.65 - MaxSumMaxDistRandom
0.6
0.55
051000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(a) Data Reduction Rate
20
18
S' 16
NaiveMaxSumMaxDistRandom
H 14a)c 12r
c 10
6 , _ - X
41000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(b) Response time
0.22
0.2
0.18
Naive-MaxSumMaxDistRandom
0. 16
0.14-
X0 12
< 0.1
0.08
0 06,0.04-
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000the Number of Sites
(c) Precision
Fig. 9. Performance on Independent Datasets with Different Network Sizes
0.85
0.8
a) 0.75
C 0.70
C)= 0.65-
0.6
0.55
0.5
- MaxSumMaxDistRandom
0.45 .. ! ,
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000the Number of Sites
(a) Data Reduction Rate
145e Naive
130 l-MaxSumMaxDist
120 Random
E 100
90 ,8070
60,0
50,401000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(b) Response time
0.26
0.24
0.22
C 0.2 ,,02*,0.184aU)L 0.160
1
Z5 0.14-
0.12
0.1
0.08
-Naive- MaxSumMaxDistRandom
0.061000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(c) Precision
Fig. 10. Performance on AntiCorrelated Datasets with Different Network Sizes
22 0.55Naive
20A MaxSum 0.5MaxDist
18 Random 0.45-
16 04Ej_ 14 a) 0.35
o 12 V 03_
rr 1 ay"~~~~~~~~~0.25-e MaxSum-m- MaxDist 8 K 0.2
Random6 015t--'-
0.551000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(a) Data Reduction Rate
41000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(b) Response time
0.11000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the Number of Sites
(c) Precision
Fig. 11. Performance on NBA Dataset with Different Network Sizes
be transferred from site to site through the network. Onthe other hand, more filtering points increase the averageprecision, as shown in Figure 12(c), because they do rule outmore unqualified intermediate answers (though not enough tomaintain and increase DRR) which otherwise are returned tothe query sites, i.e. group heads.
C. Effect of Dimensionality
The effect of dataset dimensionality is reported in Figure 13.As the dimensionality increases, DRR decreases for all non-
naive approaches as shown in Figure 13(a), and average
precision increases for all approaches including Naive as
shown in Figure 13(c). The cause for this discrepancy observed
is the cardinality of the final skyline, which grows markedlyas dimensionality increases [16], [17]. A larger final skylineresults in relatively fewer unqualified intermediate answers,
which explains the decrease of DRR. A larger final skylinecertainly produces more valid points returned to the query
site, and thus increasing the average precision. A larger finalskyline also causes more data points to be transferred throughthe network, which leads to the longer response time seen inFigure 13(b).
VII. CONCLUSION
This paper addresses constrained skyline queries in a dis-tributed network environment without overlay structures. First
554
0.95r
0.9e
0.85-
' 0.8-c
07'= 0.75
0.7-
0.65-
- Naive-MaxSumMaxDistRandom
0 .95 22
0.9,
0.6F
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.
-MaxSumMaxDistRandom0.9t
a) 0.8-
007
cJ0 0.76
2 0.5
0.4
10 20 30 40 50 60 70 80 90the Percentage of Fllter Points
(a) Data Reduction Rate
100
80 - MaxSumMaxDist
60 Random
200
E 180
160
o140-
120 ,-
100 A
80
6010 20 30 40 50 60 70 80 90
the Percentage of Fllter Points
(b) Response time
0.35 - - MaxSum
-e- MaxDistRandom
0.3
020 1
a) 0.2
0.15-w
0.10 20 30 40 50 60 70 80 90the Percent of Filter Points
(c) Average Precision
Fig. 12. Performance with Different Numbers of Filtering Points
1400
1200>
1000
E 800a)
-MaxSum o 600-MaxDistRandom 400
200
3 4Dimensions
(a) Data Reduction Rate
0 e2 3 4
Dimensions
(b) Response time
5 2 3 4Dimensions
(c) Average Precision
Fig. 13. Performance with Varying Dimensionality
a specific partition algorithm is proposed which divides allrelevant sites into groups, such that a given query can beexecuted in parallel among all site groups. For the query
execution within a single group of sites, we propose query
forwarding plans between sites and designate a group headresponsible for query results merger. Based on the inter-group and intra-group query execution mechanism, a paralleldistributed skyline algorithm is given, together with a costmodel for cost estimation. As a substantial enhancement, it isdiscussed how to select local skylines as filtering points, whichare used in distributed query processing to prevent more datafrom being transmitted through the network. Extensive experi-ments are conducted, producing results which demonstrate thatour proposals are efficient in distributed query processing androbust to scales of both data and network size.
ACKNOWLEDGMENT
This research was supported by the National Natural Sci-ence foundation of China under Grant No.60603045, NationalGrand Fundamental Research 973 program of China underGrant No.2004CB318204.
REFERENCES
[1] S. Borzonyi, D. Kossmann, and K. Stocker, "The skyline operator," inProc. ICDE, 2001, pp. 421-430.
[2] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang, "Skyline with presort-ing." in Proc. ICDE, 2003, pp. 717-816.
[3] P. Godfrey, R. Shipley, and J. Gryz, "Maximal vector computation inlarge data sets." in VLDB, 2005, pp. 229-240.
[4] D. Kossmann, F. Ramsak, and S. Rost, "Shooting stars in the sky: Anonline algorithm for skyline queries," in Proc. VLDB, 2002, pp. 275-286.
[5] D. Papadias, Y Tao, G. Fu, and B. Seeger, "An optimal and progressivealgorithm for skyline queries," in Proc. ACM SIGMOD, 2003, pp. 467-478.
[6] K.-L. Tan, P.-K. Eng, and B. C. Ooi, "Efficient progressive skylinecomputation," in Proc. VLDB, 2001, pp. 301-310.
[7] G. Hjaltason and H. Samet, "Distance browsing in spatial database,"ACM TODS, vol. 24, no. 2, pp. 265-318, 1999.
[8] W.-T. Balke, U. Giintzer, and J. X. Zheng, "Efficient distributed skyliningfor web information systems," in Proc. EDBT, 2004, pp. 256-273.
[9] P. Wu, C. Zhang, Y Feng, B. Y Zhao, D. Agrawal, and A. E. Abbadi,"Parallelizing skyline queries for scalable distribution." in Proc. EDBT,2006, pp. 112-130.
[10] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, "Ascalable content-addressable network," in Proc. ACM SIGCOMM, 2001,pp. 161-172.
[11] S. Wang, B. C. Ooi, A. K. H. Tung, and L. Xu, "Efficient skyline queryprocessing on peer-to-peer networks," 2007, pp. 1126-1135.
[12] Z. Huang, C. S. Jensen, H. Lu, and B. C. Ooi, "Skyline queries againstmobile lightweight devices in MANETs." in Proc. ICDE, 2006, p. 66.
[13] S. Basagni, M. Conti, S. Giordano, and I. Stojmenovic, Eds., Mobile AdHoc Networking. New Jersey: Wiley-IEEE Press, 2004.
[14] P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. JamesB. Rothnie, "Query processing in a system for distributed databases(sdd-1)," ACM TODS, vol. 6, no. 4, pp. 602-625, 1981.
[15] C. T. Yu and C. C. Chang, "Distributed query processing," ACMComputing Surveys, vol. 16, no. 4, pp. 399-433, 1984.
[16] S. Chaudhuri, N. N. Dalvi, and R. Kaushik, "Robust cardinality and costestimation for skyline operator." in Proc. ICDE, 2006, p. 64.
[17] P. Godfrey, "Skyline cardinality for relational processing." in Proc.FoIKS, 2004, pp. 78-97.
555
0.9
08Cc 0.8-
07
7 0.6
0.5
Naive-MaxSumMaxDistRandom
2
0.35Naive
0.3 - MaxSumMaxDistRandom
0.25
0.2
0 0.15
0.1
0.05
Authorized licensed use limited to: Peking University. Downloaded on December 15, 2009 at 20:31 from IEEE Xplore. Restrictions apply.