maximal vector computation in large data sets the 31st international conference on very large data...
TRANSCRIPT
![Page 1: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/1.jpg)
Maximal Vector Computation in Large
Data Sets
The 31st International Conference on Very Large Data BasesVLDB 2005 / VLDB Journal 2006, August
Parke Godfrey, Jarek Gryz York UniversityRyan Shipley The College of William and Mary
Speaker: ZHANG Shiming (Simon)Supervisor: Prof. David Cheung Dr. Nikos Mamoulis
![Page 2: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/2.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)223/4/21
Outline Introduction
Skyline Vs Maximal Vector Problem Goals & Accomplishments Design & Analysis Considerations Generic Algorithms & Analyses LESS Algorithm & Performance Conclusions
This presentation based on this paper but not limited to it
![Page 3: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/3.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)323/4/21
What is skyline? Skyline Query
Given a set of d-dimensional data points, skyline query is to find a set of data points not dominated by others.
Adversarial skyline query: Adversarial skyline query: finds a set of data finds a set of data point point not dominatingnot dominating others (not covered in any others (not covered in any paper)paper)
Dominate Relationship A data point p dominates another data point q if and
only if p is better than or as good as(preference) q on all dimensions and p is strictly better than q on at least one dimension
Monotone Preference Function
![Page 4: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/4.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)423/4/21
What is skyline? SQL Extensions Find the maximals over tuples in the
database context w.r.t skyline criteria
SELECT...FROM...WHERE...GROUPBY...HAVING...
SKYLINE OF [DISTINCT] d1 [MIN|MAX|DIFF],
...,
dm [MIN|MAX|DIFF]
ORDERBY...
![Page 5: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/5.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)523/4/21
What is skyline? Skyline Examples
Interesting hotel
# of rooms
price
Hotel Information
(price, #of rooms)
Skyline of hotels
Name # of rooms Price
Hotel 1 20 70
Hotel 2 40 40
Hotel 3 40 100
Hotel 4 50 70
Hotel 5 60 100
Hotel 6 70 10
Hotel 7 80 40
Not too crowded cheap hotel
![Page 6: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/6.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)623/4/21
What is skyline? Skyline Examples
Consider a Hotel table with columns name, address, dist(distance to the beach), stars (quality ranking), & price.
![Page 7: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/7.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)723/4/21
Maximal Vector Problem A classical interesting
problem since the 1960’s To identify the maximals over
a collection of vectors Tuples ≈ vectors (or points) in
k-dim. space
Related to nearest neighbors convex hull
![Page 8: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/8.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)823/4/21
Challenges of skyline query processing(not in this paper)
Search efficiency Update efficiency Scalability to skyline query variants and
various-type data High dimensionality and Large Data Set
![Page 9: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/9.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)923/4/21
Related Work (not in this paper) General Skyline Algorithms
BNL and D&C, Börzsönyi et al., ICDE’01 Bitmap and Index, Tan et al., VLDB’01 NN, Kossmann et al., VLDB’02 SFS, Chomicki et al., ICDE’03 BBS, Papadias et al., SIGMOD’03 LESS,Parke et al., VLDB’05 Static attributes vs. dynamic spatial attributes in SSQ
SSQ is a dynamic skyline query, M. Sharifzadeh et al., VLDB’06 Z Order Skyline, Ken et al., VLDB’07 BBRS-Reverse Skyline, Evangelos et al., VLDB’07 ……
Nearest Neighbor Search K-NN …
Computational Geometry Voronoi Diagram Delaunay Graph Convex Hull High-Dimensional computational geometry
Maximal Vector Problem FLET(Fast Linear Expected-Time),J.L. Bentley et al.,SODA 1990
Index on Skyline Bitmap, B-tree, R-tree, aR-tree
….
Spatial Skyline Query (SSQ): find the data points pi that are not spatially dominated by any other point pj with respect to the given query points {q}.
![Page 10: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/10.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1023/4/21
Variations of Skyline Queries (not in this paper)
Constrained skyline (spatial skyline) Ranked Skyline Group-by Skyline Dynamic Skyline or Multi-source Skyline Enumerating Skyline/Top-K/K-Dominating Skyline K-Skyband Skyline Approximate Skyline Reverse Skyline Subspace Skyline SkyCub in subspace Probabilistic Skylines on Uncertain Data Privacy Skyline ……
![Page 11: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/11.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1123/4/21
Goals & Accomplishments
![Page 12: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/12.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1223/4/21
Design & Analysis Considerations
Relational Performance Criteria External
I/O conscious (too much data for main memory) well behaved
compatible with a query optimizer CPU computational load (asymptotic runtime analyses)
generic (focus on generic maximal-vector algorithm) no indexes, no pre-computed information
good properties progressive, pipe-lineable, universality and etc. at worse, linear run-time ( O(n) )
![Page 13: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/13.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1323/4/21
Design Choices divide-and-conquer (D&C) or scan-based
Can D&C be I/O conscious? Can scan-based be efficient?
to sort or not to sort Is sorting useful? Is sorting too inefficient? (Not linear. . .)
comparison policy Which vectors to compare next? How to reduce the number of comparisons?
…
![Page 14: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/14.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1423/4/21
A Model for Average-Case Analysis
Component Independence (CI)
Uniform Independence (UI)
![Page 15: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/15.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1523/4/21
Expected Number of Maximals
![Page 16: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/16.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1623/4/21
Algorithms & Analyses Generic Algorithms
![Page 17: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/17.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1723/4/21
Algorithms & Analyses Generic Algorithms’ Performance
![Page 18: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/18.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1823/4/21
Algorithms & Analyses Divide-and-Conquer algorithms
No evidence to make an efficient external version Although they are good in asymptotic complexity
for n, dimension curve is a problem for k
Scan-based algorithms Find global maximals early and eliminate non-
maximals more quickly.
![Page 19: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/19.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1923/4/21
DD&C:D&C|+Sort
![Page 20: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/20.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2023/4/21
LD&C:D&C|-Sort
![Page 21: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/21.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2123/4/21
Block Nested Loops (BNL) Algorithm
O(kn)average caseUnder CI
![Page 22: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/22.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2223/4/21
Sort Filter Skyline (SFS) Algorithm Have a window (W) and stream (S), as with BNL. Sort S first (via an external sort routine): e.g.,
Then, call improved BNL
Any w in the window is guaranteed to be maximal (skyline).
![Page 23: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/23.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2323/4/21
BNL vs SFS
![Page 24: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/24.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2423/4/21
BNL & SFS
![Page 25: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/25.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2523/4/21
The LESS Algorithm Combine best aspects of the algorithms, mainly BNL & SFS.
EF Win--Elimination-Filter keep records with the best entropy scores
SF Win--Skyline-Filter keep current skyline for further filter
block-sort pass
last merge pass
![Page 26: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/26.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2623/4/21
LESS: Linear Average-Case Issues & Improvement
![Page 27: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/27.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2723/4/21
LESS: Performance n = 500, 000 EF window: 200 vectors SF window: 76 pages, 3,000 vectors Pentium III, 733 MHz RedHat Linux 7.3
![Page 28: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/28.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2823/4/21
Conclusions Future Works for Optimization of LESS
![Page 29: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e9e5503460f94ba0028/html5/thumbnails/29.jpg)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2923/4/21