stholes: a multidimensional workload-aware histogram nicolas bruno* columbia university luis...
TRANSCRIPT
STHoles: A Multidimensional Workload-Aware Histogram
Nicolas Bruno*Columbia University
Luis Gravano*Columbia University
Surajit ChaudhuriMicrosoft Research
SIGMOD 2001
* Work done in part while the authors were visiting Microsoft Research.
2
Histograms as Succinct Data Set Summaries
Used for selectivity estimation and approximate query processing.
Data set partitioned into buckets, each approximated by aggregate statistics.
3
Histograms
Each bucket consists of a bounding box and a tuple frequency value.
Uniformity is assumed inside buckets.– Histograms should partition data set in
buckets with uniform tuple density. Multi-dimensional data makes
partitioning even more challenging.
4
Outline
Overview of existing multidimensional histogram techniques.
Introduction to STHoles histograms. System architecture and STHoles
construction algorithm. Experimental evaluation.
5
Gaussian Data Set
Histograms Techniques: EquiDepth
EquiDepth Histogram[Muralikrishna and DeWitt 1988]
Correctly identifies core of densest clusters. Partitioning uses “equi-count” instead of “equi-density”
6
Gaussian Data Set MHist Histogram[Poosala and Ioannidis 1997]
Histogram Techniques: MHist
Works well for highly skewed data distributions. Devotes too many buckets to the densest clusters. Bad initial “choices” are amplified in later steps.
7
Gaussian Data Set GenHist Histogram[Gunopulos et al. 2000]
Histogram Techniques: GenHist
More robust than previous techniques (based on multidimensional information).
Difficult to choose right values of various parameters. Requires at least 5-10 passes over the data.
8
Gaussian Data Set STGrid Histogram[Aboulnaga and Chaudhuri 1999]
Histogram Techniques: STGrid
Incorporates feedback from query execution. Grid partitioning strategy is sometimes too rigid. Focuses on efficiency rather than accuracy.
9
Our New Histogram Technique: STHoles
Flexible bucket partitioning. Exploits workload information to allocate
buckets. Query feedback captures uniformly
dense regions. Does not examine actual data set.
10
STHoles Histograms Tree structure among buckets. Buckets with holes: relaxes rectangular
regions while using rectangular bucket structures.
Non rectangular
region
11
System Architecture for STHolesRange Query
12
STHoles Construction Algorithm
Initialize histogram H as an empty histogram.
For each query q in workload:1- Gather simple statistics from query
results.2- Identify candidate holes and drill (add)
them as new buckets in H.3- Merge superfluous buckets in H.
13
?
Drilling New Candidate Buckets
Count how many tuples in result stream lie inside qb.
Drill qb as a new bucket (child of b).
q
For each query q in workload and bucket b in histogram:
14
Shrinking Candidate Buckets
Partition constraint: Bounding boxes must be rectangular.
Apply greedy technique to shrink a candidate hole to a rectangle.
15
Merging Buckets To avoid exceeding available space. Merge most “similar” buckets in terms
of tuple density.
16
Parent-Child Merges
Eliminate buckets too similar to their parents.Example: The interesting region in bc is
covered by its child b1.
17
Sibling-Sibling Merges
Consolidate buckets with similar densities that cover close regions.
Extrapolate frequency distributions to yet unseen regions.
18
Gaussian Data Set STHoles Histogram
An Example STHoles Histogram
19
Experimental Setting
Data Sets: – Real: (UCI Repository)
• Sample of Census data set (200K tuples)• Cover data set (500K tuples)
– Synthetic: Variations of Gaussian and Zipfian(Array) distributions.
200K to 500K tuples, 2 to 4 dimensions.
Histograms:– 1024 available bytes per histogram.
– EquiDept, MHist, GenHist, STGrid, STHoles.
20
Experimental Setting (cont.) Workloads [Pagel et al. 1993]:
– 1,000 queries.– Query centers follow different distributions:
Uniform, Biased, Gaussian.– Query boundaries follow different constraints:
area covered, tuples covered.
Census data set Biased (tuples) workload Gaussian (area) workload
Accuracy Metric: Absolute Error.
(with some normalization; details in paper)
21
Comparison with Other Approaches: Biased Workload
Biased workload, query boundaries cover around 1% of the data domain
22
Comparison with Other Approaches: Uniform Workload
Uniform workload, query boundaries cover around 1% of the data set tuples.
23
Convergence with Workload
Biased workload
24
Handling Data Set Updates
From Gaussian to Zipfian data distributions.
25
Other Experiments Varying:
– data skew.– data dimensionality.– histogram size.– workload generation parameters.– number of attributes in queries.
Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8%.
STHoles lead to robust selectivity estimates across data distributions and workloads.
See full paper for details!
26
Summary: STHoles, a Multidimensional Workload-Aware Histogram
Exploits query feedback. Built without examining data set. Allows bucket nesting to capture complex
shapes using only rectangular bucket structures.
Results in robust and accurate selectivity estimations.
In many cases, outperforms the best techniques that access full data sets.
27
Related Work (Histograms)
Unidimensional:– EquiDepth [Piatetsky-Shapiro and Connell 1984]– MaxDiff [Poosala et al. 1996]– V-Optimal [Jagadish et al. 1998]– Many more!
Multidimensional:– EquiDepth [Muralikrishna and DeWitt 1988]– MHist [Poosala and Ioannidis 1997]– GenHist [Gunopulos et al. 2000]– STGrid [Aboulnaga and Chaudhuri 1999]
28
Related Work (Other Techniques)
Sampling [Olken and Rotem 1990] Wavelets [Matias et al. 1997] Discrete transformations [Lee et al.
1999] Parametric Curve Fitting [Chen and
Roussopoulos 1994]
29
Evaluation Metric
Absolute Error:
Normalized Absolute Error:
30
Overhead Evaluation over Microsoft SQL Server 2000
31
Varying Histogram Size
Gaussian Data SetZipfian Data SetCensus Data Set
32
Varying Spatial Selectivity
Gaussian Data SetZipfian Data SetCensus Data Set