[ieee 2010 2nd international workshop on database technology and applications (dbta) - wuhan, china...

4
A Two-Phased Refinement Algorithm to Process Reverse Skylines without Pre-Processing Zhonghe Li Dept. of Computer Engineering Myongji University Yongin-si, Korea [email protected] Ah Han Dept. of Computer Engineering Myongji University Yongin-si, Korea [email protected] Dongseop Kwon Dept. of Computer Engineering Myongji University Yongin-si, Korea [email protected] Youngbae Park Dept. of Computer Engineering Myongji University Yongin-si, Korea [email protected] Abstract— Reverse skyline queries are difficult to process because of the massive amount of computations for checking candidates because existing algorithms for reverse skylines are generally based on pre-processing. Although pre-processing reduces the number of computations on processing queries, it requires re- computations of pre-processed result every time data change. To overcome this limitation, we propose an efficient algorithm to reduce the number of computation in processing reverse skyline queries with a two-phased refinement step. Before refining the final result from candidates, the proposed algorithm has an additional refinement step for decreasing the number of candidates, so that it can handle reverse skyline queries more effectively without any pre-processing. Since not based on pre- processing, our algorithm is more suitable for frequently updated data. Experimental results show that the performance of the proposed algorithm is better than those of the existing pre- processing-based ones. Keywords- Skyline;Reverse skyline;Query Processing I. INTRODUCTION Since the skyline query finds a set of interesting objects from large data, it is useful for many applications including decision support systems or data warehouse systems. Given a set P of d-dimensional points, the skyline operator returns all points in P that are not dominated by any other point [1-2]. A point p i dominates point p j if the coordinate of p i in each dimension is not greater than that of p j , and strictly smaller in at least one dimension. For example, suppose that a used-real estate database has records of houses with attributes Price, age, distance to station, and size. Fig 1(a) illustrates each hotel with its price and distance to station. In this situation, if a customer wants to find a cheap house near station from the used-real database, the result is (H1, H6, H7, and H9), which is an original skyline [3-4] as shown in Fig 1(a). If a customer prefers a house with 600$ street price and 600M distance, which represented by Q in Fig 1(b), interesting houses may be different from the original skylines. The global skyline (GSL), shown in Fig 1(b), is an answer for this situation. The reverse skyline of a point finds a set of objects whose skyline contains the query point, which means the set of object may be interested in the query point [1]. Suppose that a real- estate agent has information on customers’ favorite houses and wants to sell a house, represented Q in Fig 1(c). A customer whose favorite house is H9 in Fig 1(c) may have an interest on the house Q because a dynamic skyline for H9 contains Q. Customers whose favorite houses are in (H2, H5, H7, H8, H9, H10, and H12) may also have an interest on Q with the same reason. This set is a reverse skyline of Q, shown in Fig 1(d). Distance to Station Price (0,0) Q H1 H2 H5 H8 H3 H4 H7 H9 H10 H6 H11 H12 H13 H14 H15 (a) Original skyline of Q (b) Global skyline of Q Price (0,0) Q H1 H2 H5 H8 H3 H4 H7 H9 H10 H6 H11 H12 H13 H14 H15 Price (0,0) Q H1 H2 H5 H8 H3 H4 H7 H9 H10 H6 H11 H12 H13 H14 H15 (c) Dynamic skyline of H9 (d) Reverse skyline of Q Figure 1. Examples of skylines A naïve solution for computing a reverse skyline is to compute all the dynamic skylines [2, 5] of all objects and check whether dynamic skylines contain the query point. However, it is not practical because of the huge computing cost. An alternative way is a pre-processing approach, which computes the dynamic skylines and stores them into disks for later query processing. However, when data are changed, including insertions or deletions, the system should re- calculate all the pre-stored information. Therefore, pre- processing-based approaches are not suitable and feasible for frequently changing data. In addition, if you have to compute 978-1-4244-6977-2/10/$26.00 ©2010 IEEE

Upload: youngbae

Post on 03-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 2nd International Workshop on Database Technology and Applications (DBTA) - Wuhan, China (2010.11.27-2010.11.28)] 2010 2nd International Workshop on Database Technology

A Two-Phased Refinement Algorithm to Process Reverse Skylines without Pre-Processing

Zhonghe Li Dept. of Computer

Engineering Myongji University Yongin-si, Korea [email protected]

Ah Han Dept. of Computer

Engineering Myongji University Yongin-si, Korea

[email protected]

Dongseop Kwon Dept. of Computer

Engineering Myongji University Yongin-si, Korea

[email protected]

Youngbae Park Dept. of Computer

Engineering Myongji University Yongin-si, Korea

[email protected]

Abstract— Reverse skyline queries are difficult to process because of the massive amount of computations for checking candidates because existing algorithms for reverse skylines are generally based on pre-processing. Although pre-processing reduces the number of computations on processing queries, it requires re-computations of pre-processed result every time data change. To overcome this limitation, we propose an efficient algorithm to reduce the number of computation in processing reverse skyline queries with a two-phased refinement step. Before refining the final result from candidates, the proposed algorithm has an additional refinement step for decreasing the number of candidates, so that it can handle reverse skyline queries more effectively without any pre-processing. Since not based on pre-processing, our algorithm is more suitable for frequently updated data. Experimental results show that the performance of the proposed algorithm is better than those of the existing pre-processing-based ones.

Keywords- Skyline;Reverse skyline;Query Processing

I. INTRODUCTION Since the skyline query finds a set of interesting objects

from large data, it is useful for many applications including decision support systems or data warehouse systems. Given a set P of d-dimensional points, the skyline operator returns all points in P that are not dominated by any other point [1-2]. A point pi dominates point pj if the coordinate of pi in each dimension is not greater than that of pj, and strictly smaller in at least one dimension. For example, suppose that a used-real estate database has records of houses with attributes Price, age, distance to station, and size. Fig 1(a) illustrates each hotel with its price and distance to station. In this situation, if a customer wants to find a cheap house near station from the used-real database, the result is (H1, H6, H7, and H9), which is an original skyline [3-4] as shown in Fig 1(a). If a customer prefers a house with 600$ street price and 600M distance, which represented by Q in Fig 1(b), interesting houses may be different from the original skylines. The global skyline (GSL), shown in Fig 1(b), is an answer for this situation.

The reverse skyline of a point finds a set of objects whose skyline contains the query point, which means the set of object may be interested in the query point [1]. Suppose that a real-estate agent has information on customers’ favorite houses and wants to sell a house, represented Q in Fig 1(c). A customer

whose favorite house is H9 in Fig 1(c) may have an interest on the house Q because a dynamic skyline for H9 contains Q. Customers whose favorite houses are in (H2, H5, H7, H8, H9, H10, and H12) may also have an interest on Q with the same reason. This set is a reverse skyline of Q, shown in Fig 1(d).

Dis

tanc

e to

Sta

tion

Price(0,0)

Q

H1

H2

H5H8

H3

H4

H7

H9

H10

H6

H11

H12

H13

H14

H15

(a) Original skyline of Q (b) Global skyline of Q

Price(0,0)

Q

H1

H2

H5H8

H3

H4H7

H9

H10

H6

H11

H12

H13

H14

H15

Price(0,0)

Q

H1

H2

H5H8

H3

H4H7

H9

H10

H6

H11

H12

H13

H14

H15

(c) Dynamic skyline of H9 (d) Reverse skyline of Q

Figure 1. Examples of skylines

A naïve solution for computing a reverse skyline is to compute all the dynamic skylines [2, 5] of all objects and check whether dynamic skylines contain the query point. However, it is not practical because of the huge computing cost. An alternative way is a pre-processing approach, which computes the dynamic skylines and stores them into disks for later query processing. However, when data are changed, including insertions or deletions, the system should re-calculate all the pre-stored information. Therefore, pre-processing-based approaches are not suitable and feasible for frequently changing data. In addition, if you have to compute

978-1-4244-6977-2/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 2nd International Workshop on Database Technology and Applications (DBTA) - Wuhan, China (2010.11.27-2010.11.28)] 2010 2nd International Workshop on Database Technology

a reverse skyline from data with some predicates, for examples houses with garages or built in 5 years, pre-computed data may be useless.

To overcome this problem, we propose an efficient algorithm to reduce the number of computations in processing reverse skyline queries with a two-phased refinement approach. Before the refinement step of the traditional approach to select final answer from candidates, the proposed approach performs an additional refinement step for decreasing the number of candidates, so that it can handle reverse skyline queries more efficiently without any pre-processing. The proposed algorithm is more suitable for the online processing of reverse skyline queries over frequently updated data because it is not based on pre-processing. To prove efficiency of the proposed algorithm, we performed extensive experiments in various settings, and the result shows that the proposed algorithm outperforms the existing algorithm.

The rest of the paper is organized as follows. Section 2, we present the related work for skyline queries and reverse skyline queries. Section 3 proposes the two-phased refinement algorithm. Then, Section 4 presents the efficiency of the proposed algorithms with experimental evaluation. Finally, Section 5 concludes the paper with directions for future work.

II. RELATED WORK The reverse skyline (RSL) of point Q is a set of objects

that are interested in a query point Q [2]. To process RSL from large dataset, an enormous number of calculations are required.

Price(0,0)

Q

H1

H2

H5H8

H3

H4H7

H9

H10

H6

H11

H12

H13

H14

H15

Price(0,0)

Q1

H1

H2

H5H8

H3

H4H7

H9

H10

H6

H11

H12

H13

H14

H15

H6'

H8'

H7'

H11'

H1'

DADR

DDR

Q2

DSL(H9)

Q3

Figure 2. Rectangle of H10 Figure 3. Approximated skyline of H9

A. Branch and Bound Reverse Skyline The Branch and Bound Reverse Skyline (BBRS) algorithm

are similar with the original BBS algorithm for the original skyline [6-8]. To compute the RSL, first, the BBRS computes a GSL of point Q. Then, for each point in GSL, the BBRS execute a rectangle-shaped window query over the dataset. The center of the rectangle is located on the GSL point, and one of its vertices is laid at the query point [2]. Fig 2 shows an example of the rectangle for the object H10. If the rectangle contains any object in it, the corresponding GSL point cannot be in the RSL(Q), otherwise it can be in the RSL(Q). Since the BBRS has to execute as many queries as the number of points in the GSL, it needs a considerable time for computing RSLs.

Therefore, it is not suitable for the online processing of RSL over large dataset.

B. Reverse Skyline using Skyline Approximations The Reverse Skyline using Skyline Approximation

(RSSA) is a modified algorithm of the BBRS to optimize the query performance with a pre-processing approach. As a pre-processing, it computes the dynamic skyline (DSL) of each object in dataset [5, 9]. Then, the RSSA computes Dynamic Dominance Regions (DDRs) and Dynamic Anti-Dominance Regions (DADRs) of the DSL [2]. Fig 3 shows the DDR and DADR of the object H9. DDRs and DADRs of each object are stored into the disk. If a query point is in a DADR of an object, the object can be in the RSL of the query point. For example, in Fig 3, object H9 is in the RSL of Q1 because Q1 is in the DADR of H9. On the contrary, if a query point is in a DDR of an object, the object cannot be in the RSL of the query point. In Fig 3, object H9 is not in the RSL of Q2 because Q2 is in the DDR of H9. However, if a query point is neither in the DADR nor the DDR of an object, the system cannot determine whether the object is in the RSL or not, should test the object with a window query. For example, the object H9 should be tested to computer the RSL of the point Q3 in Fig 3. The RSSA has a good query performance, but it requires a huge storage to maintain pre-computed data. In addition, when any of dataset is changed, it must re-compute all the DSLs. Therefore, it is not suitable for computing RSLs over frequently updated data.

III. TWO-PHASED REFINEMENT ALGORITHM In this section we present a Two-Phased Refinement

Algorithm (called TPRA) for computing reverse skyline queries. The basic algorithm of the TPRA is the BBRS. However, after computing a GSL of a query point, the TPRA performs two steps of candidate refinements, which called as the Refining for Reverse Skyline Candidates (RRSC) and the Refinement of Comparison Objects (RCO). With these two refinement steps, the TPRA efficiently reduce the number of windows queries in the original BBRS, which make the TPRA suitable for the online processing of reverse skyline queries.

A. Refinement of Reverse Skyline Candidates After finding a GSL, the BBRS should test all the points in

the GSL with window queries as described in Section II (a). In the BBRS, all the points in the GSL become candidates for an answer. However, we can reduce the number of these candidates with a simple process. For example, Fig 4 depicts the first quadrant of a GSL divided by a query point Q as the origin. In this example, the object H13 cannot be in the RSL(Q), because objects H3 and H12 are placed in the window-query rectangle of the object H13. The TPRA eliminates these unnecessary objects like the object H13 from the candidates. This step is the Refining for Reverse Skyline Candidates (RRSC).

Algorithm 1 describes the specific process of the RRSC. The space is divided by the query point into each quadrant, and the algorithm is performed by each quadrant of the space. Although we will explain the algorithm only with the first

Page 3: [IEEE 2010 2nd International Workshop on Database Technology and Applications (DBTA) - Wuhan, China (2010.11.27-2010.11.28)] 2010 2nd International Workshop on Database Technology

quadrant in this paper, the algorithm can be applied for other quadrants as straightforward. First, the system sorts all the point in GSL in each axis, and chooses an object with the minimum value in each axis. With the chosen object, the system tests all the other objects in the GSL whether they can be eliminated or not. If the values of an object for all other axes except the chosen one are bigger than the half values of the chosen object, the object should be eliminated because the window-query of the object always contains the chosen object. For example, in Fig 5, we have 5 objects (H3, H5, H12, H13, and H14) as a GSL. For the x-axis, after sorting the objects, we choose H12 and we can eliminate all the objects that have bigger y-value than the half of y-value of H12. In this example, we can eliminate H13 because the window-query of H13 always includes H12. Then, we can continue this process with remaining objects. The object H3 is chosen for the next step. With the same process, H14 will eliminate. After finishing the x-axis, the same refinement process will be applied to the y-axis. First, we choose H5 as the minimum object in y-axis. Then we can eliminate H3 as the same manner. Finally, after the RRSC, only the objects H5 and H12 will be the candidates of the RSL. Consequently, we can save three window queries for the eliminated objects H3, H13, and H14.

Figure 4. Rectangle for checking H13 Figure 5. Refinement of x-axis

B. Refinement of Compared Target Objects Even though the TPRA can reduce the number of

candidates, it should still test all the candidates. For these tests, each candidate should be compared with all other objects in the same quadrant by a window query. Therefore, if we can reduce the number of the other objects that should be compared with the candidates, it will improve the performance. To reduce these comparison objects, we introduce the Dummy Global Skyline (DGSL), which is a second-level GSL retrieved from objects that are not in the GSL. Equation (1) is a theoretical definition of the DGSL.

( ){ } )1( )(,,,| ijijiiji pppPppGSLPpGSLppDGS ≺≺ −−−∈∈∀=

Figure 6. Refinement of y-axis Figure 7. Dummy GSL

If an object which is not in the DGSL is located in a window query, the dominant object of the object in the DGSL must be in the same window query. Therefore, when the TPRA tests all candidates with window queries, it only needs to compare them with the DGSL, instead of comparing with all the objects. For example, in Fig 7, the DGSL contains objects, H15, H21, and H22. When we want to test the object H13, we have to execute a window query with the shaded rectangle in Fig 7. The rectangle contains objects H15 and H24. However we do not have to check H24 because if H24 is located in the shaded rectangle, H15 must be in the same

Algorithm 1. Refining for Reverse Skyline Candidates

1. procedure RRSC 2. RSCL = {} // Reverse Skyline Candidate List

GSL, DGSL = RCO (Dataset, Q) 3. sort GSL 4. O = pop object from GSL 5. insert O into RSCL 6. while the next min value object N is not null do 7. for (each object P in GSL) do 8. for (each axis except the chosen one) do 9. if (O.value/2 > P.value) then 10. N = P; break for loop 11. end if 12. if (N is Null) then 13. break while loop 14. else 15. insert N into RSCL; set O = N 16. end if 17. end while 18. return RSCL 19. end procedure

Algorithm 2. Refining for Target Objects Comparison. 1. procedure RCO2. GSL = {} // Global Skyline List 3. DGSL = {} // Dummy Global Skyline List 4. for (each pi in Dataset) do 5. for (each pj in GSL) do 6. if (pi dominate pj) then 7. insert pi into GSL; insert pj into DGSL; 8. else if (pj dominate pi) then 9. if (pi incomparable with DGSL) then 10. insert pi to DGSL 11. else if (pi dominated by DGSL) then 12. delete pi from dataset; 13. end if 14. else 15. insert pi into GSL 16. end if 17. return GSL, DGSL 18. end procedure

Page 4: [IEEE 2010 2nd International Workshop on Database Technology and Applications (DBTA) - Wuhan, China (2010.11.27-2010.11.28)] 2010 2nd International Workshop on Database Technology

rectangle. The big advantage of using the DGSL is that the system can compute the DGSL during the computation of the GSL. We do not have to an additional computation for the DGSL. Algorithm 2 shows the process to compute the GSL and the DGSL.

IV. EXPERIMENTAL RESULT In this section, we present the results of some experiments

to prove the efficiency of the proposed algorithm. For the experiments, we generated synthetic datasets with 10K to 100K objects. We used three different algorithms for reverse skylines: the BBRS, the RSSA, and the TPRA (the proposed one). All experiments were conducted on a Windows PC with a 32 bit 3.2 GHz single core CPU and 2 GB main memory.

0

1,000

2,000

3,000

4,000

5,000

6,000

20K 40K 60K 80K 100K

Proc

essi

ng ti

me(

ms)

The number of data(n)

BBRS RSSA TPRA0

1,000

2,000

3,000

4,000

5,000

6,000

10K 30K 50K 70K 90K

Proc

essi

ng ti

me(

ms)

The number of data(n)

BBRSRSSATPRA

(a) Size of dataset (uniform, 3d) (b) Size of dataset (Gaussian, 3d)

Figure 8. Processing time of data size

0

1,000

2,000

3,000

4,000

5,000

6,000

2D 3D 4D 5D

Proc

essi

ng ti

me(

ms)

Dimension(d)

BBRSRSSATPRA

0

100

200

300

400

500

2D 3D 4D 5D

The

num

ber o

f dat

a

Dimension(d)

Global SkylineReverse Skyline CandidatesReverse Skyline

Figure 9. Processing time versus

dimension (uniform, 10k) Figure 10. Number of GSL, RSLC, and RSL versus dimension (uniform, 10k)

Fig 8 shows the processing time of three algorithms with varying data size from 10K to 100K under the uniform distribution and Gaussian distribution, respectively. With 100k objects with the uniform distribution, the RSSA requires 5,927ms while the TPRA requires only 3,003ms. For all data size, the performance of the TPRA is superior to those of the competitors. Fig 9 shows the processing time of three algorithms with varying the dimension from 2 to 5, with a 10K uniformly distributed dataset. The experiment shows that the TPRA outperforms other competitors in all settings. For the dataset with 5 dimensions, the RSSA consumes 10,883ms while the TPRA consumes only 1,694ms, which reduces the processing time by 84.4%. Fig 10 shows the effectiveness of the two-phased refinement. As increasing the number of dimensions, the number of objects in the GSL increases rapidly. This deteriorates the performance of the BBRS and the RSSA. However, the number of objects in the reverse skyline candidate, which is a refined result of the proposed algorithm, does not increase much, because the proposed algorithm effectively eliminates unnecessary objects in the

GSL, and reduces the number of window queries. By reducing the number of objects in candidates, the TPRA can handle reverse skyline queries efficiently.

V. CONCLUSIONS The reverse skyline is an important and useful operation

for decision support systems and data analysis applications. Though several algorithms for reverse skyline queries have been proposed, they are not suitable for processing online reverse skyline queries. This paper proposes a new algorithm for reverse skyline queries. Instead of using pre-processing approaches, the proposed algorithm eliminates unnecessary objects in candidates by a two-phased refinement step. The existing pre-processing-based approaches have problems with dealing frequently updated data because they have to re-calculate all data when any object is changed. However, the proposed algorithm can efficiently process reverse skyline queries even if data is changed frequently, because it is not based on the pre-processing. The experimental results show that the proposed algorithm outperforms the existing approaches in various experimental settings. In future research, we will study an approximation technique for processing reverse skyline queries, which retrieves a result quickly instead of allowing some errors in the result. We will also consider a parallel computation of reverse skyline queries for analysis of large dataset.

REFERENCES [1] E. Dellis and B. Seeger, “Efficient computation of reverse skyline querie

s,” in VLDB ’07: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007, pp. 291–302.

[2] E. Dellis, A. Vlachou, I. Vladimirskiy, B. Seeger, and Y. Theodoridis, “Constrained subspace skyline computation,” in CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management. New York, NY, USA: ACM, 2006, pp. 415–424.

[3] S. B¨orzs¨onyi, D. Kossmann, and K. Stocker, “The skyline operator,” in Proceedings of the 17th International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society, 2001, pp. 421–430.

[4] K.-L. Tan, P.-K. Eng, and B. C. Ooi, “Efficient progressive skyline computation,” in VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 301–310.

[5] X. Lian and L. Chen, “Monochromatic and bichromatic reverse skyline search over uncertain databases,” in SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 2008, pp. 213–226.

[6] D. Papadias, Y. Tao, G. Fu, and Seeger, “Progressive skyline computation in database systems,” ACM Trans. Database Syst., vol. 30, no. 1, pp. 41–82, 2005.

[7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The r*-tree: An efficient and robust access method for points and rectangles,” in Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990, H. Garcia-Molina and H. V. Jagadish, Eds. ACM Press, 1990, pp. 322–331.

[8] D. Kossmann, F. Ramsak, S. Rost, "Shooting Stars in the Sky: an Onlie Algori thm for Skyline Queries." In VLDB, Proceedings of the 28th international conference on very large data bases. VLDB Endowment, 2002, pp.275-286.

[9] J. Pei, W. Jin, M. Ester, and Y. Tao, “Catching the best views of skyline: a semantic approach based on decisive subspaces,” in VLDB ’05: Proceedings of the 31st international conference on very large data bases. VLDB Endowment, 2005, pp. 253–264.