congressional samples for approximate answering of group-by queries swarup acharya
DESCRIPTION
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center, Bell Labs, New Jersey ) Divya Rao. Outline. Introduction Background Aqua System Problem Formulation Solutions - PowerPoint PPT PresentationTRANSCRIPT
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF
GROUP-BY QUERIES
Swarup Acharya Phillip Gibbons
Viswanath Poosala(Information Sciences Research Center,
Bell Labs, New Jersey)
Divya Rao
Outline Introduction Background Aqua System Problem Formulation Solutions Query Rewriting strategies Experiment Conclusion
Introduction
Group-by queries- most important class of queries in decision support systems.
Congressional Samples- A hybrid union of uniform and biased samples
Seek to propose techniques for obtaining fast, highly-accurate answers for Group-by queries
Background Uniform random sampling is not effective for group-
by queries.
Ex: A group by query on the US Census database to determine the per-capita income of every state.
Huge discrepancies in the sizes of different groups like California is 70 times more populated than Wyoming.
This leads to poor accuracy of answers of those groups which have fewer number of tuples than the larger ones as accuracy is highly dependent on the number of sample tuples that belong to that group.
Background
Uniform Random Sampling are more appropriate only when the utility of data to the user mirrors the data distribution
Multi-table query: When different data have equal representation but their utility to the user is skewed
Ex: Data warehouses where the usefulness of data degrades with time
This means the approximate sample has to collect more samples from the recent data which cannot be achieved through uniform random sample over the entire warehouse.
Biased Sampling
Use precomputed samples to address the problem of unbiased query
Advantages of using precomputed biased samples:
Queries can be answered without accessing the original data at query run time
Storing queries in disk blocks avoids the overhead of random scanning
Disadvantage: Biased samples must commit to the sample before seeing the query. Hence not suitable for user controlled progressive refinement.
Aqua System Aqua is an efficient decision support system
providing approximate answers to queries
Aqua System(Contd.)
Aqua is a Middleware tool that can sit atop any DBMS managing a data warehouse
Aqua maintains statistical summaries of data in Synopses and uses them to answer queries
The aqua system provides probabilistic error/ confidence bounds on the answer
Aqua System(Contd.)
Aqua System(Contd.)
Problem Formulation
Main aim is to provide accurate answers to group-by queries in an approx. query answering system
If ci and ci' be the exact and apprx. aggregate values in the group gi. Then error is the percentage relative error ε in the estimation of ci is
ε = ( ci – ci' )/ci * 100
Solutions
Theorem: Divide the sample space X equally among the groups and take uniform random sample within each group.
Map this theorem to various classes of group-by queries with arbitrary mixes of groupings.
Ex: US Congress
HOUSE
SENATE
House
The House has representatives from each state proportional to the state's population
Applying theorem T to the House we have,
For the aggregate operation, the quality of approx. answers increases with the query selectivity
Answers to the queries with the same aggregate and equal selectivities will typically have similar quality guarantees.
Senate
Senate has equal number of representatives from each state
Applying the theorem to the Senate we have,
Each group in the sample will have atleast as many sample points as any other group in the entire sample
Problems with House and Senate
Using Samples from House would result in very few sample points for smaller groups
Senate allocates fewer tuples to the larger groups compared to the House.
Hence we have another technique called the “Congress”-collect both the House and the Senate samples
Basic Congress
Apply the theorem to the aggregate queries containing group-by queries on a set of attributes and queries with no group-bys at all.
Collect both the House and Senate samples Reduce this factor by 2
Congress
For the sample space X, the final sample size allocated to each group is given by,
Where the expected sample space allocated to g is
Query rewriting
Scaling up the aggregate expressions
Deriving error bounds on the estimate
Generating unbiased answers using tuples in the biased sample:
Scale factor is the inverse of sampling rate
Rewriting Strategies
The key step in scaling is to efficiently associate each tuple with its corresponding scale factor
a) Store the scale factor with each tuple
i) Integrated Rewriting
ii) Nested-integrated Rewriting
b) Use a separate table to store the scale factor
iii) Normalized Rewriting
iv) Key-normalized Rewriting
Experiments
Experimental Testbed: Aqua system with Oracle v7 as the back-end DBMS
Parameter Range of values Default value
Table size(T) 100k-6M tuples 1M
Sample Percentage(SP)
1%-75% 7%
Num.groups 10-200k 1000
Group-size skew(z)
0-1.5 0.86Experimental Parameters
Experiment(Contd.)
Study to identify a scheme that can provide consistently good performance
Performance of various allocation strategies
Performance of Different query sets:
Queries with no group-bys: House performs well Congress technique performs consistently the best for queries of all types
Queries with three group-bys: Senate has low errors
Queries with two group-bys:Both senate and House perform poorly in this case
Congress performs close to best for queries of all types. Other techniques perform well only in a limited part of the spectrum
Performance of different sample sizes:
The errors in Congress drop as the sample space increases
Performance of group count:
Integrated and Nested-integrated perform better than Normalized and Key-normalized due to the absence of a join operation
Nested-integrated performs better than Integrated due to significantly fewer multiplications.
Conclusions
Demonstrated that uniform samples are not enough to accurately answer all group-by queries
Proposed new techniques based on biased sampling
Congressional sampling concept was introduced and the sampling strategies were validated experimentally to produce accurate estimates to group-by queries and in their execution efficiency
All the techniques have been incorporated into the Aqua System.
Questions??