SUBSKY:Efficient Computation of Skylines in Subspaces
Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei
Conference: ICDE 2006
Presenter: Kamiru
Superviosr: Dr. Nikos Mamoulis
Skyline Queries
Given a set of d-dimenional points, a point p dominates another p’ if
p[i]<=p’[i], for all i in d,and p[j]<p’[j], for any j in d
Skyline queries aim to find the points that are not dominated by any point
foul rate
turnover rate
0
1
1
For the NBA database,
Low turnover rate and low foul rate are two important factors for a defense player
player
Best point
Applications of Skyline Queries
Find a good hotel to me according to distance and price
price 1000 AA BB
CC
price 1500
price 500DDprice 2000
Hotel D must not a good hotel for this user, since its price is higher and distance is farther than other hotels
Alternative applications of Skyline Queries - i
Some top-k queries are calculated by Skyline queries
A top-k query retrieves the k tuples in P with highest scores according to g where g must be a monotonic function, ex:
g(p) = p.x + p.y
Alternative applications of Skyline Queries - i
Please help me to find who are the top-2 NBA players according to sum of their points and assists in 2007-2008 season
assists
points
20
10
100
The values are represented by right-top corner of each player photo
The results (up to Jan 23 2008) of this query are
Allen Iverson, 27+6.9
LeBron Jamesm, 29.7+7.4
Top-2 results must be in top-2 skyband
PRUNED
Alternative applications of Skyline Queries - ii
Another interesting measurement is dominating count (DC)
DC is counted by the number of dominating points to a query
foul rate
turnover rate
0
1
1player
1
40
1
Ex: find the top-2 dominating players in the NBA database according to turnover rate and fould rate
2
Best point
Skyline Computations
Two categories of skyline computations Computing from scratch (no index) Relied on index
1. Computing from scratch (no index) Advantage
• No any pre-computation• Not to update any index when data changed
Drawback• Must calculate from scratch
– It must scan the entire data at least once
Skyline Computations
2. Relied on index Once you built, get to use it many times Lower query cost is occurred by performing the
search on an appropriate structure• B - tree• R - tree• …
Since all of us are database people, (I hope…) we prefer method 2 more
Related works
1. Computing from scratch (no index) Block nested loop Sort filter skyline Divide and conquer Bitmap Linear elimination sort for skyline
Related works
2. Relied on index B – tree approach
• index
R – tree approach• Nearest neighbor (NN)• Branch and bound skyline (BBS)
– BBS has been proved that is I/O optimal. It accesses fewer disk pages than any algorithm based on R-trees
List y p4:0.1 p1:0.2 p3:0.3 p8:0.6
List x p5:0.1 p6:0.25 p2:0.3 p7:0.6
Related works
index
p1
p2
p3
p4
p6
p5 p7
p8
Best pointx
yPoint p adds to list i if p has the smallest value in dimension i
1) Ssky = {p5}
2) Ssky = {p5,p4}
3) Ssky = {p5,p4,p1}
• All remaining elements in List x are pruned by p1 since both coordinates of p6 is bigger than p1
• Due to the same reason, all remaining elements in List y are pruned by p1 too
Related works
BBS
p1
p2p3 p4
p6
p5 p7
p8
N3
N4
N1
N2
M1
M2
M1 M2
N1 N2 N3 N4
p1 p2 p3 p4 p5 p6 p7 p8
Best point
HNN={p1,p2,N2,M2}
• p1 is the first NN object from best point
Dominant region of p1 shows in grey color
2) p2 is pruned by dominating region
3) Expand N2
4) …
Dominant region
SUBSKY
According to NBA database, we have more than 10 different attributes for one player
Skyline queries may be interested in some attributes only
SUBSKY
Build one R-tree and run BBS BBS is an I/O optimal algorithm based on R-tree
index, but their approaches are optimized for a fixed set of dimensions
Build R-trees for all elements in the power set of dimensions Hugh storage space
SUBSKY for uniform data
Anchor point Ac– the maximal corner of the data space having maximum coordinate on all dimensions
x
y
1
1
psky
p1
p2
Ac
f(p)=max(1-p[i]),
where i is from 1 to d
fsky(psky)=min(1-psky[i]),
where i is from 1 to d
f(p2)
f(p1)
fsky(psky)
No any point p satisfying
f(p)<fsky(psky)
can belong to the skyline
Pruning region of psky
maximum value of the coordinate
Best point
A similar result exists for the skyline of any subspace
SUBSKY for uniform data
Skyline queries only apply on relevant dimensions SUB
f’sky(psky)=min(1-psky[i]),
where i is in SUB Then,
f(p) < fsky(psky) <= f’sky(psky)
No any point p satisfying the above equation can belong to the skyline
SUBSKY for uniform data
Assume that our skyline query is interested in dimension x and y only
First, we sort the data by f(pi) p3, p4, p1, p2, p5
Ssky={p3}, f’sky(p3)=0.5 =min(1-0.5,1-0.3) U=0.5 (largest f’ value in Ssky)
Ssky={p3,p4}, f’sky(p4)=0.1 U=0.5
Ssky={p1,p4}, f’sky(p1)=0.8 p3 is removed by adding p1, since it is dominated by p1
U=0.8
p1 p2 p3 p4 p5
x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7
f(pi) 0.8 0.6 0.9 0.9 0.4
General SUBSKY
In practical, data are usually clustered If the data are clustered, then we should
expect that one anchor point cannot give us enough pruning power
x
1
1Ac
Best point
psky
A1
General SUBSKY
x
Ac
psky
A1
A2
cluster s1
s2
s3
s4
Anchors for different clusters
Two questions:
1) How to find the anchors?
2) How to assign points to anchors?
1
1
Best point
1
1
Best point
Ac
Finding the Anchors
First, let us see what a perfect anchor of a point p If p is assigned to A, then p can be pruned by any
skyline point dominating p
p
Major perpendicular plane
A1
A2
A3
Any point on this line is a perfect anchor of point p
Anti-dominant region of p
Finding the Anchors
1
1
Best point
Ac
p1
Major perpendicular plane
p2a good anchor
For each point, find the projections to the plane Ex: p’1, p’2…
Partition the projected points into m clusters using algorithm k-means, and formulate an anchor for each cluster
p’2p’1
Finding the Anchors
How to decide an anchor for a cluster?Blue points are assigned to cluster S. How can we decide the anchor for S?
1) Obtain point B, whose coordinate on each dimension equals the lowest coordinate of the points in S in their original space on this axis
2) Then, the algorithm computes the smallest square opposite to B which covers all points in S
A
B
Assigning Points to Anchors
A naïve way is to assign points to their closest anchor point in the major perpendicular plane (projected space)
It is not directly quantifies the benefit of an assignment
Pruning region of p2
Pruning region of p1
ER of p
Assigning Points to Anchors
In order to assign a point to a good anchor, this paper introduces a new measurement which name is effective region (ER)
1
1
Best point
Ac
All points in yellow region (ER) can make a pruning region to Ac that cover p
p
If ER-volume of p is larger, then p has more chance to be pruned
p1
p2
ER of p
Assigning Points to Anchors
1
1
Best point
Ac
p
ER of p
1
1
Best point
Ac
p
A’
p1
p2
p1
p2
Assigning Points to Anchors
The pruning volume size of a point p to an anchor point Aj is
∏max(0,Aj[i]-L∞(p,Aj)),
where i is from 1 to d Therefore, assign a point p to Aj that produces
the largest pruning volume size
Query example
We use the same example in previous slide Assume that we have two anchors, one is Ac
and the other A’ is found by K-means (m=1) Ac=(1,1,1) and A’=(1,1,0.8)
First, we calculate the ER volume of each data point with respect to Ac and A’
p1 p2 p3 p4 p5
x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7
f(pi) 0.8 0.6 0.9 0.9 0.4
p1 p2 p3 p4 p5
Ac 8 64 1 1 216A’ 0 - 9 - 144
Unit 10-3
Query example
Sorted list by f: Ac p4 p1 p2 p5
A’ p3
p1 p2 p3 p4 p5
x 0.2 0.4 0.5 0.9 0.6y 0.2 0.4 0.3 0.1 0.8z 0.5 0.9 0.1 0.6 0.7
f(pi) 0.8 0.6 0.9 0.9 0.4
p1 p2 p3 p4 p5
Ac 8 64 1 1 216A’ 0 - 9 - 144
1) Ssky={p4}, f’sky(p4)=0.5U=0.5
2) Ssky={p4, p1}, f’sky(p1)=0.8
U=0.8
Experiments
3 real datasets NBA, Household, and Color
2 synthetic data (10D) Uniform Clustered
• 10 cluster centroids
• For each centroid, it takes N/10 points whose coordinate on each axis follows a Gaussian distribution with variance 0.05 and a mean equal to the corresponding coordinate of the centroid
NBA Household Color
Dimension 13 6 9
Cardinality 17k 127k 68k
Experiments
Experiments
Experiments
3D subspaces, 1 million cardinality
3D subspaces, full-space dimensionality is 10
Conclusion
The core of SUBSKY is a transformation that convert multi-dimensional data into 1D values
Show better performance than a I/O optimized algorithm in the subspace skyline problem
Some continuous monitoring cases are good to investigate How to adopt the set of anchor points if data
update rapidly The f values could be stored in other index
structure to support fast update