stabbing the sky: efficient skyline computation over sliding windows comp9314 lecture notes

28
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows

COMP9314 Lecture Notes

COMP9314 Xuemin [email protected] 2

Outline

• Introduction

• n-of-N Queries

• (n1, n2)-of-N Queries

• Performance Evaluation

• Conclusions

COMP9314 Xuemin [email protected] 3

Skyline

Skyline Query:• Input: a set of points in d-

dimensional space.• Output: points not dominated

by another point.

• (x1, x2, …, xd) dominates (y1, y2, …, yd) iff xi<=yi (1<=i<=d) & ∃k, xk<yk.

COMP9314 Xuemin [email protected] 4

Applications

Multi-criteria decision making…

Stock Trading Example:

• What are the top deals?

COMP9314 Xuemin [email protected] 5

Skyline Query Over Sliding WindowStock Trading Example• Top deals of a stock in the last 5 mins? last 4 mins, …• Top deals of a stock in the last 10K deals? …

Queries:• n-of-N model (n <= N): the most recent n elements

• (n1, n2)-of-N model

• One-time queries• Continuous queries

COMP9314 Xuemin [email protected] 6

Challenges

Insertions & deletions (possibly high speed).

On-line information– memory requirement– processing speed

Existing techniques do not support n-of-N:[Borzsonyi et al (ICDE01), Tan et al (VLDB01), Kossman et al (VLDB 02), Papadias et al (SIGMOD03), Kapoor (SIAM J. comp00)]

– support the computation of whole dataset– O (n logd-2 n) for d >= 4 & O (n log n ) otherwise

COMP9314 Xuemin [email protected] 7

Results

n-of-N:• keep N’ (N’ N) elements where N’ = O (logd N) if data

distribution on each dimension is independent.• a novel encoding scheme, with O (N’) space, leads to n-

of-N query time O ( log N’ + s ) instead of O (n logd-2 n).• a new trigger based technique for continuously

processing an n-of-N query. – trigger update time: O ( log s).– result update time: O (logδ) where δ is a result change.

(n1, n2)-of-N: similar results.

COMP9314 Xuemin [email protected] 8

n-of-N Queries

• e is redundant Point e in PN (the most recent N elements) iff– e expires w.r.t PN, or

– ∃e’ s.t. e’ e, and e’ is younger than e

N = 6

COMP9314 Xuemin [email protected] 9

Optimality

Theorem: Non-redundant Points (RN) vs. n-of-N Skyline Query Result (Qn,N)

– (PN – RN) does not appear in any Qn,N

– Qn,N must be a subset of RN

xRN n, xQn,N

– RN = O(logd-1N) for “independent” distributions

• Only need to keep RN – the minimum number of elements to be kept.

COMP9314 Xuemin [email protected] 10

Querying RN

• critical dominance: e e’ where e is the youngest.

• dominance graph GRN: RN and the critical dominance

relationships.

Y

X

M = 7, N = 7

1 2

3

4

6

5

7

Y

X

M = 7, N = 7

1 2

3

4

6

5

7

COMP9314 Xuemin [email protected] 11

Querying RN

e Qn,N iff

• e is a root in GRN or

• e’ e in GRN & e’ has expired t(e’) < M – n + 1 <= t(e)Y

X

M = 7, N = 7

1 2

3

4

6

5

7

n RN M-n+1 Qn,N

n=5 {3,4,5,6,7 } 3 {3,4}

n=4 {4,5,6,7 } 4 {4, 7}

n=3 {5,6,7} 5 {5,6,7}

COMP9314 Xuemin [email protected] 12

Querying RN: Optimal Algorithm

To answer an n-of-N Query, encode the GRN using intervals:

• Stab the intervals by (M-n+1).• For all returned intervals (x,y), return point whose timestamp is y

• Technique: Use an interval tree index to achieve optimal O(log|RN|+s) query timeY

X

M = 7, N = 7

1 2

3

4

6

5

7

Y

X

M = 7, N = 7

1 2

3

4

6

5

7

(3,7](4,6]

(4,5](0,3]

(0,4]

e.g., n=6

(0,3](0,4](3,7](4,5](4,6]

COMP9314 Xuemin [email protected] 13

Maintaining RN

new element enew arrives:

• If the oldest eold RN expires, remove eold and update RN and GRN (interval

tree).

• find D RN dominated by enew, update RN and GRN

– Depth-first search on a R-tree of RN

• find e c enew, update GRN

– Best-first search on the R-tree of RNY

X

M = 8, N = 5

3

4

6

5

7

8

Y

X

M = 8, N = 5

3

4

6

5

7

8

enew = 8eold = 3D = {6}e = 4

COMP9314 Xuemin [email protected] 14

Continuous n-of-N QueryTrigger-based algorithm: • Deletion: Qn,N – {eold}, and Qn,N – {D}

• Insertion: Qn,N {enew} if (e’ c enew and t(e’ ) >= M-n+1)

• Maintain a min-heap of Qn,N for efficiency

Y

X

M = 8, N = 5

3

4

6

5

7

8

enew = 8M = 8eold = 3D = {6}e’ = 4

M = 7n = 4, N=5Q4,5 = {4,7}

M = 8n = 4, N=5Q4,5 = {5,7,8}

Q5,5 = {3,4} Q5,5 = {4,7}

COMP9314 Xuemin [email protected] 15

(n1,n2)-of-N Query

More complicated than n-of-N Query• PN needs to be kept!

• (Old) critical dominance: t (ae) = max { t (e’): e’ e & t (e’) < t (e) }

• backward critical dominance: t (be) = min { t (e’): e’ e & t (e’) > t (e)}

• e Q(n1,n2),N iff ae < M-n2+1 e M-n1+1 < be

• CBC dominance graph: PN & the two kinds of dominance relationships

M = 7, N = 7Y

X1

2

3

4

5

6

7

(2,4)-of-7:{4, 6}

M = 7, N = 7n1 = 4, n2 = 6

Y

X1

2

3

4

5

6

7

a5 = 3, b5 = 6(M-n2+1, M-n1+1] = (4,6]

COMP9314 Xuemin [email protected] 16

Processing (n1,n2)-of-N QueryEncode the CBC dominance graph:

– e ((ae, e], be)

– build an interval tree on (ae, e] only

Stab using M-n2+1 against the interval tree and check e <= M-n1+1 < be) on-the-fly:

– O(logN+s*), sub-optimal

M = 7, N = 7Y

X1

2

3

4

5

6

7

(2,4)-of-7: ???(M-n2+1, M-n1+1] = (4,6]

(1,6](3,4](3,5]...

(2,4)-of-7: {4, 6} Candidates: {4, 5, 6}

COMP9314 Xuemin [email protected] 17

More on (n1,n2)-of-N QueryMaintenance: Similar to that of n-of-N query, but

– Always expires the oldest element in PN, and maintain the interval tree and the R-tree on RN.

– Implementation-wise: Use two interval trees to index RN and PN-RN, respectively.

Continuous queries– More complicated

• A new skyline point might not be a skyline in the previous result,

• nor critically dominated by a skyline point in the previous result

• nor a newly arrived point

– Basic idea• Maintain additional Candidate Solutions (minimization) & triggers

– Details in the full paper

COMP9314 Xuemin [email protected] 18

Experiment Setup

• Hardware– P4 2.8G CPU, 1G Memory

• Datasets– Correlated, independent, and anti-

correlated– d = 2 to 5, N = 106

• Algorithms– KLP, nN, mnN, cnN, n12N, mn12N

• Metrics– Processing time Streaming rate

COMP9314 Xuemin [email protected] 19

n-of-N Query

• Varying dimensionality– M up to 2M, N = 1M, n uniformly from [1K,

1M], #queries = 1000

COMP9314 Xuemin [email protected] 20

n-of-N Query (cont’d)

• Varying n• for correlated, independent, and anti-correlated datasets

COMP9314 Xuemin [email protected] 21

Maintenance Costs

2d and 5d datasets, measure average and max time, N = i * 105

COMP9314 Xuemin [email protected] 22

Scalability

M (total number) = 2M, N = 1M, #queries = 2M

anti-correlatedindependent

COMP9314 Xuemin [email protected] 23

Continuous n-of-N Queries

• 2d & 5d datasets • N = 10K and 1M • 10 queries with n = i*(N/10)• measures cnN avg, cnN max, nN avg, nN max

COMP9314 Xuemin [email protected] 24

(n1,n2)-of-N Queries

Varying dimensionality– M up to 2M, N = 1M, #queries = 1000

– restricting n2 – n1 >= 500

Scalability– M = 2M, N = 1M, #queries = 2M

COMP9314 Xuemin [email protected] 25

Maintenance

• 2d and 5d datasets• measure average and max time• N = i * 105

COMP9314 Xuemin [email protected] 26

Conclusions

• Efficient algorithms for various sliding windows skyline queries– Keep only minimum number of points– Encode and index those points– Maintain all the data structures

• The proposed solutions – have theoretical guarantee on the performance, and– have demonstrated efficiency and scalability in the experiments

• Future work– Improve the current solution for (n1,n2)-of-N queries

– Approximate skyline queries

COMP9314 Xuemin [email protected] 27

Q&A

Thank You!

COMP9314 Xuemin [email protected] 28

Reference• [ICDE01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline

operator. ICDE, 2001.

• [VLDB01] K. Tan, P. Eng, and B. Ooi. Efficient progressive skyline computation. VLDB, 2001.

• [VLDB 02] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. VLDB, 2002.

• [SIGMOD03] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal progressive alogrithm for skyline queries. SIGMOD, 2003.

• [SIAM J. comp00] S. Kapoor. Dynamic maintenance of maxima of 2-d point sets. SIAM J. Comput., 2000.