daq: a new paradigm for approximate query processing navneet potti jignesh patel vldb 2015

24
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

Upload: felicia-morton

Post on 03-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

DAQ: A New Paradigm forApproximate Query Processing

Navneet PottiJignesh Patel

VLDB 2015

Page 2: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

2

Outline

• Approximate Query Processing• SAQ• DAQ• Bitwise DAQ Scheme• Evaluation• Conclusion

Page 3: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

3

Approximate Query Processing

Data volume isgrowing exponentially

Queries are interactiveto support real-time decisions

Decisions are resilient to small errors

Quick Approximate Answer

is better than

Slow Exact Answer

Exploratory analysis demands responsiveness

eg: Average Revenue estimate $12M

is about as good as $12,345,678

Page 4: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

4

SAQSampling-based Approximate Querying

• Run query on a small random subset of data

• Error in estimate presented as confidence intervalAvg revenue = $12.3 ± 0.1 million with 95% confidence

• Can be “online”– error eventually shrinks to zero => exact estimate

Page 5: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

5

Confidence Intervals

• Avg revenue = $12.3 ± 0.1 million with 95% confidence

What does this mean?

12.3 12.412.2

95%

Probability Distribution of Average Revenue

5%

With 95% probability, true average revenue lies in 12.3 ± 0.1 million

With 5% probability, true average revenue lies outside 12.3 ± 0.1 million

Page 6: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

6

Confidence Intervals

• eg: Avg revenue = $12.3 ± 0.1 million with 95% confidence

How should we interpret the tails?

12.3 12.412.2

95%

Probability Distribution of Average Revenue

5%

Tails occur 5% of the time.In Avg Regional Revenue for 100 states,

5 estimates are in the tails

There is no bound on error in the tail.The Avg Revenue could be

as small as $1M or as large as $100M

Page 7: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

7

SAQ: Shortcomings

12.3 12.412.2

5%

Semantics of the tails of confidence intervals are hard to interpret.

Intervals are very broad for outlier aggregates like MAX or Top 100.

Intervals are hard to manipulate. No closed algebra.

95%

Confidence interval bounds are unintuitive

Need to see more of the data to find outliers. Slow convergence.

Is 100 ± 10 “greater than” 90 ± 20?How do we add these intervals?

Page 8: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

8

Pop quiz! Estimate the sum.

DAQDeterministic Approximate Querying

5,321,656+ 3,151,709+ 1,362,296_________________________

= ? ? ? ? ?a. Approximately 2.4 million?

b. Approximately 9.7 million?

c. Approximately 13.8 million?

d. Approximately 17.0 million?

a. Approximately 2.4 million?

b. Approximately 9.7 million?

c. Approximately 13.8 million?

d. Approximately 17.0 million?

= 9,???,???= 9,7??,???= 9,83?,???

Page 9: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

9

• Use deterministic intervals instead of probabilistic (confidence) intervals

• Guaranteed upper and lower boundsAvg revenue = $12.3 ± 0.2 million

• Can be “online”– Error interval eventually

becomes degenerate => exact estimate

DAQDeterministic Approximate Querying

DAQ UI Mockup

Page 10: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

10

SAQ vs DAQ(at a glance)

SAQ DAQ

Complex semantics using confidence intervals

due to the “tail”.

Simple semantics using deterministic intervals

as there is no “tail”.

Slow for outlier aggregates like MAX or Top 100

and heavy-tailed data.

Fast for outlier aggregates like MAX or Top 100

and heavy-tailed data.

No closed algebra.No clear semantics for predicates and arithmetic operations on estimates.

Closed relational algebra.Clear semantics for predicates and

arithmetic ops using interval algebra.

Page 11: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

11

Conceptual DAQ Scheme

• Hierarchically partition the attribute’s domain• Estimates are represented as intervals [a,b]

Page 12: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

12

Conceptual DAQ Scheme

• Hierarchically partition the attribute’s domain• Estimates are represented as intervals [a,b]• e.g., Count B-Tree

Page 13: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

13

Interval Algebra

• Predicate evaluation• Interval representation for relations

City Est. PopulationShire [110,120]

Rivendell [ 70, 90]

Gondor [ 80,120]

Which cities have population > 100?

City Est. PopulationShire [110,120]

Rivendell [ 70, 90]

Gondor [ 80,120]

City Est. PopulationShire [110,120]

Rivendell [ 70, 90]

Gondor [ 80,120]

Certainly > 100 Potentially > 100,

Page 14: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

14

Formal Definition

• Interval estimates for attributes and relations

• Operators consume and produce intervals

• Closed relational algebra

• Online DAQ scheme– monotonically shrinking intervals– converges to exact estimate

Page 15: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

15

Bitwise DAQ Scheme

• Similar to the decimal digit-wise sum example• Uses Bitsliced Index representation

Page 16: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

16

Bitwise DAQ Scheme

• Use most significant m bits for evaluation

• Remaining n-m bits set to all-0 and all-1 for bounds

• Error bound decreases exponentially: 2n-m

Page 17: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

17

Bitwise DAQ SchemeAlgorithms

Average aggregation upto m bits

Page 18: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

18

Bitwise DAQ vs. BaselineExponentially decreasing error bounds in estimating Avg

Page 19: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

19

Bitwise DAQ SchemeAlgorithms

Less Than predicate upto m bits

Page 20: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

20

Bitwise DAQ vs. BaselinePredicate evaluation: 6x speedup using 8 bits for < 1% error

Page 21: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

21

Bitwise DAQ vs. BaselineTop 100: 3.5x speedup for < 1% error on Uniform, Zipf data

Page 22: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

22

Bitwise DAQ vs. SAQTop 100: DAQ performs better for heavy-tailed data (Zipf)

Page 23: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

23

Bitwise DAQSummary

Excels at• Predicates• Outlier aggregates• Heavy-tailed data

Suffers for• Simpler aggregates

(sum, avg)• Uniform data

• Compression improves performance further• Can operate directly on compressed bitvectors• > 8x speedup for Top 100 on Zipf data

• Embarrassingly parallelizable• NUMA-aware parallelization

Page 24: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

24

Future Work

• B-tree like indices

• Extension to other operators (Group By, Join)

• Extension to other data types

• Query optimization

• Combining SAQ and DAQ