Download - SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

SUBSKY:Efficient Computation of Skylines in Subspaces

Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei

Conference: ICDE 2006

Presenter: Kamiru

Superviosr: Dr. Nikos Mamoulis

Skyline Queries

Given a set of d-dimenional points, a point p dominates another p’ if

p[i]<=p’[i], for all i in d,and p[j]<p’[j], for any j in d

Skyline queries aim to find the points that are not dominated by any point

foul rate

turnover rate

0

1

1

For the NBA database,

Low turnover rate and low foul rate are two important factors for a defense player

player

Best point

Applications of Skyline Queries

Find a good hotel to me according to distance and price

price 1000 AA BB

CC

price 1500

price 500DDprice 2000

Hotel D must not a good hotel for this user, since its price is higher and distance is farther than other hotels

Alternative applications of Skyline Queries - i

Some top-k queries are calculated by Skyline queries

A top-k query retrieves the k tuples in P with highest scores according to g where g must be a monotonic function, ex:

g(p) = p.x + p.y

Alternative applications of Skyline Queries - i

Please help me to find who are the top-2 NBA players according to sum of their points and assists in 2007-2008 season

assists

points

20

10

100

The values are represented by right-top corner of each player photo

The results (up to Jan 23 2008) of this query are

Allen Iverson, 27+6.9

LeBron Jamesm, 29.7+7.4

Top-2 results must be in top-2 skyband

PRUNED

Alternative applications of Skyline Queries - ii

Another interesting measurement is dominating count (DC)

DC is counted by the number of dominating points to a query

foul rate

turnover rate

0

1

1player

1

40

1

Ex: find the top-2 dominating players in the NBA database according to turnover rate and fould rate

2

Best point

Skyline Computations

Two categories of skyline computations Computing from scratch (no index) Relied on index

1. Computing from scratch (no index) Advantage

• No any pre-computation• Not to update any index when data changed

Drawback• Must calculate from scratch

– It must scan the entire data at least once

Skyline Computations

2. Relied on index Once you built, get to use it many times Lower query cost is occurred by performing the

search on an appropriate structure• B - tree• R - tree• …

Since all of us are database people, (I hope…) we prefer method 2 more

Related works

1. Computing from scratch (no index) Block nested loop Sort filter skyline Divide and conquer Bitmap Linear elimination sort for skyline

Related works

2. Relied on index B – tree approach

• index

R – tree approach• Nearest neighbor (NN)• Branch and bound skyline (BBS)

– BBS has been proved that is I/O optimal. It accesses fewer disk pages than any algorithm based on R-trees

List y p4:0.1 p1:0.2 p3:0.3 p8:0.6

List x p5:0.1 p6:0.25 p2:0.3 p7:0.6

Related works

index

p1

p2

p3

p4

p6

p5 p7

p8

Best pointx

yPoint p adds to list i if p has the smallest value in dimension i

1) Ssky = {p5}

2) Ssky = {p5,p4}

3) Ssky = {p5,p4,p1}

• All remaining elements in List x are pruned by p1 since both coordinates of p6 is bigger than p1

• Due to the same reason, all remaining elements in List y are pruned by p1 too

Related works

BBS

p1

p2p3 p4

p6

p5 p7

p8

N3

N4

N1

N2

M1

M2

M1 M2

N1 N2 N3 N4

p1 p2 p3 p4 p5 p6 p7 p8

Best point

HNN={p1,p2,N2,M2}

• p1 is the first NN object from best point

Dominant region of p1 shows in grey color

2) p2 is pruned by dominating region

3) Expand N2

4) …

Dominant region

SUBSKY

According to NBA database, we have more than 10 different attributes for one player

Skyline queries may be interested in some attributes only

SUBSKY

Build one R-tree and run BBS BBS is an I/O optimal algorithm based on R-tree

index, but their approaches are optimized for a fixed set of dimensions

Build R-trees for all elements in the power set of dimensions Hugh storage space

SUBSKY for uniform data

Anchor point Ac– the maximal corner of the data space having maximum coordinate on all dimensions

x

y

1

1

psky

p1

p2

Ac

f(p)=max(1-p[i]),

where i is from 1 to d

fsky(psky)=min(1-psky[i]),

where i is from 1 to d

f(p2)

f(p1)

fsky(psky)

No any point p satisfying

f(p)<fsky(psky)

can belong to the skyline

Pruning region of psky

maximum value of the coordinate

Best point

A similar result exists for the skyline of any subspace


Skyline queries only apply on relevant dimensions SUB

f’sky(psky)=min(1-psky[i]),

where i is in SUB Then,

f(p) < fsky(psky) <= f’sky(psky)

No any point p satisfying the above equation can belong to the skyline


Assume that our skyline query is interested in dimension x and y only

First, we sort the data by f(pi) p3, p4, p1, p2, p5

Ssky={p3}, f’sky(p3)=0.5 =min(1-0.5,1-0.3) U=0.5 (largest f’ value in Ssky)

Ssky={p3,p4}, f’sky(p4)=0.1 U=0.5

Ssky={p1,p4}, f’sky(p1)=0.8 p3 is removed by adding p1, since it is dominated by p1

U=0.8

p1 p2 p3 p4 p5

x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7

f(pi) 0.8 0.6 0.9 0.9 0.4

General SUBSKY

In practical, data are usually clustered If the data are clustered, then we should

expect that one anchor point cannot give us enough pruning power

x

1

1Ac

Best point

psky

A1

General SUBSKY

x

Ac

psky

A1

A2

cluster s1

s2

s3

s4

Anchors for different clusters

Two questions:

1) How to find the anchors?

2) How to assign points to anchors?

1

1

Best point

1

1

Best point

Ac

Finding the Anchors

First, let us see what a perfect anchor of a point p If p is assigned to A, then p can be pruned by any

skyline point dominating p

p

Major perpendicular plane

A1

A2

A3

Any point on this line is a perfect anchor of point p

Anti-dominant region of p

Finding the Anchors

1

1

Best point

Ac

p1

Major perpendicular plane

p2a good anchor

For each point, find the projections to the plane Ex: p’1, p’2…

Partition the projected points into m clusters using algorithm k-means, and formulate an anchor for each cluster

p’2p’1

Finding the Anchors

How to decide an anchor for a cluster?Blue points are assigned to cluster S. How can we decide the anchor for S?

1) Obtain point B, whose coordinate on each dimension equals the lowest coordinate of the points in S in their original space on this axis

2) Then, the algorithm computes the smallest square opposite to B which covers all points in S

A

B

Assigning Points to Anchors

A naïve way is to assign points to their closest anchor point in the major perpendicular plane (projected space)

It is not directly quantifies the benefit of an assignment

Pruning region of p2

Pruning region of p1

ER of p


In order to assign a point to a good anchor, this paper introduces a new measurement which name is effective region (ER)

1

1

Best point

Ac

All points in yellow region (ER) can make a pruning region to Ac that cover p

p

If ER-volume of p is larger, then p has more chance to be pruned

p1

p2

ER of p


1

1

Best point

Ac

p

ER of p

1

1

Best point

Ac

p

A’

p1

p2

p1

p2


The pruning volume size of a point p to an anchor point Aj is

∏max(0,Aj[i]-L∞(p,Aj)),

where i is from 1 to d Therefore, assign a point p to Aj that produces

the largest pruning volume size

Query example

We use the same example in previous slide Assume that we have two anchors, one is Ac

and the other A’ is found by K-means (m=1) Ac=(1,1,1) and A’=(1,1,0.8)

First, we calculate the ER volume of each data point with respect to Ac and A’

p1 p2 p3 p4 p5

x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7

f(pi) 0.8 0.6 0.9 0.9 0.4

p1 p2 p3 p4 p5

Ac 8 64 1 1 216A’ 0 - 9 - 144

Unit 10-3

Query example

Sorted list by f: Ac p4 p1 p2 p5

A’ p3

p1 p2 p3 p4 p5

x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7

f(pi) 0.8 0.6 0.9 0.9 0.4

p1 p2 p3 p4 p5

Ac 8 64 1 1 216A’ 0 - 9 - 144

1) Ssky={p4}, f’sky(p4)=0.5U=0.5

2) Ssky={p4, p1}, f’sky(p1)=0.8

U=0.8

Experiments

3 real datasets NBA, Household, and Color

2 synthetic data (10D) Uniform Clustered

• 10 cluster centroids

• For each centroid, it takes N/10 points whose coordinate on each axis follows a Gaussian distribution with variance 0.05 and a mean equal to the corresponding coordinate of the centroid

NBA Household Color

Dimension 13 6 9

Cardinality 17k 127k 68k

Experiments

Experiments

3D subspaces, 1 million cardinality

3D subspaces, full-space dimensionality is 10

Conclusion

The core of SUBSKY is a transformation that convert multi-dimensional data into 1D values

Show better performance than a I/O optimized algorithm in the subspace skyline problem

Some continuous monitoring cases are good to investigate How to adopt the set of anchor points if data

update rapidly The f values could be stored in other index

structure to support fast update

Download - SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

Top Related