scalable approximate query processing through scalable error estimation kai zeng ucla advisor: carlo...

38
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

Upload: wesley-williamson

Post on 24-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

1

Scalable Approximate Query Processingthrough Scalable Error Estimation

Kai ZengUCLA

Advisor: Carlo Zaniolo

Page 2: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

2

Why Approximate Query Processing?

• AQP is critical for massive data– Ever-growing size of big data– Need for timely and cost-effective analysis– Widely applied• RDBMSs (e.g., online aggregation)• MapReduce systems (e.g., BlinkDB)• Data stream systems (load shedding)

Page 3: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

3

• Sampling: widely-used in AQP• Error estimation: fundamental in AQP– Analytic error estimation– Bootstrap

MassiveData

AVG5.5

Approx.Mean

sample(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)

Sample

Sampling & Quality assessment

Need to assess the quality!What is the error of this approx. mean?

Page 4: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

4

MassiveData

query: AVG5.5

Approx.Mean

sample(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Sample

collect

# of tuples, Variance

Central Limit Theorem

Analytic Error Estimation

• Use closed-form formulas• Pro: very fast• Con: restricted to simple aggregates

What if I want to estimate?1. Complex SQL queries2. Data mining tasks3. ….

Page 5: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

5

Bootstrap [Efron 1979]

• Resample with replacement from the sample• Run the query on the resample• Repeat many times, typically 100s or even 1000s of

times(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)

(2, 10, 10, 5, 9, 2, 5, 10, 8, 10)

(8, 1, 2, 1, 1, 9, 7, 4, 10, 1)

5.5

6.8(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)

7.1

4.5

(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)

…… ……

Sample Mean

resample

query: AVG

collect

Same Size

Page 6: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

6

• Compute the error from the empirical distribution of all the query results

95%

Page 7: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

7

Notes on Bootstrap

• Bootstrap treats Q as a black-box • Can handle (almost) arbitrarily complex queries including

UDFs!

• Embarrassingly Parallel

• Computational demanding

• Use too much resources

Page 8: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

Error Estimation• Analytic error estimation– Fast but limited to simple aggregates

• Bootstrap (Monte Carlo simulation):– Expensive but general

Fast and General?

Page 9: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

9

How To Make Bootstrap Faster

• Optimize the Monte-Carlo simulation process– EARL system [VLDB12][ICDE13]

• Bypass the Monte-Carlo simulation process– Analytical Bootstrap method (ABM) [SIGMOD14]

Page 10: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

10

EARLY ACCURATE RESULT LIBRARY(EARL PROJECT)

Page 11: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

11

Motivation

• Existing systems (e.g. Hadoop) use batch processing– High latency– Waste of resources

• Goals: a general driver that can– Return approximate results– With accuracy guarantee– For a wide range of tasks

Page 12: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

12

Incremental Computation

• A small sample a larger sample ……• Use Bootstrap to test accuracy• Time efficient: Enable early returns• Resource efficient: Do not waste resources

MassiveData

Samplesample enlarge enlarge

bootstrap

Accurate enough?

bootstrap

Accurate enough?

……

Sample Sample

Page 13: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

13

Basic Ideas: Optimization

• Intra-iteration optimization– We have to repeat the same computation on all

resamples– Many data are shared!– Compute the shared part once

𝑆

𝑆1

𝑆2

……

Iteration 𝒇

𝒇

𝒇

Shared

Non-shared

Page 14: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

14

Basic Ideas: Optimization

• Inter-iteration optimization– Reuse the old computation– Cannot simply merge for randomness– Keep a small sample in memory for adjustment

𝑆 ∆𝑆

𝑆1

𝑆2

𝑆1′

𝑆2′

𝑆

…… ……

Iteration Iteration

𝑆1 Δ𝑆1 Adjustment is small

Page 15: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

15

ANALYTICAL BOOTSTRAP

Page 16: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

16

Analytical Bootstrap

• Scope: relational algebra(selection), (projection), (join), (aggregate)

• Basic idea– Annotate tuples with random variables– Extend relational algebra to manage these

random variables

A single-round evaluation = 100s/1000s of bootstrap trials!

# of times a tuple will be drawn in a bootstrap trial

Page 17: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

17

Bootstrap Resamples As Multiset DB

• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities

ID Product Qty1 A 22 B 33 A 24 A 4

ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1

ID Product Qty1 A 22 B 32 B 34 A 4

ID Product Qty1 A 22 B 34 A 44 A 4

ID Product Qty1 A 23 A 23 A 24 A 4

resample

……

sample

Page 18: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

18

Querying Multiset DB: Projection

• Projection takes sum of multiplicities

ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1

Product Qty #A 2 3B 3 0A 4 1

1+2=3

SELECT Product, SUM(Qty)FROM OrdersWHERE Qty < (SELECT SUM(Qty) / 4

FROM Orders)GROUP BY Product

How many products are ordered by small quantity orders?

Page 19: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

19

Querying Multiset DB: Aggregate

• Aggregate takes weighted sum of multiplicities

ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1

SUM(Qty) #

10 1

2×1+3×0+2×2+4×1=10

Page 20: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

20

Querying Multiset DB: Join

• Join takes product of multiplicities

Product Qty #A 2 3B 3 0A 4 1

SUM(Qty) #10 1

Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 1

3×1=3

Page 21: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

21

Querying Multiset DB: Selection

• Selection takes product of multiplicities

Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 1

Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 0

3×1=31×0=0

Page 22: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

22

Bootstrap Resamples As Multiset DB

• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities– ,

Page 23: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

23

• Multiset DB– Tuples are annotated with

∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )

ID Product Qty #1 A 2 22 B 3 13 A 2 14 A 4 0

ID Product Qty #1 A 2 12 B 3 13 A 2 04 A 4 2

ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1

ID Product Qty #1 A 22 B 33 A 24 A 4

(𝑚1 ,𝑚2 ,𝑚3 ,𝑚4 )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )0.25

0.25

0.25

0.25

Probabilistic Multiset DB

Random Variables on Probabilistic Multiset DB (PMDB)

Similar to Tossing Coins

Page 24: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

24

Querying PMDB

• Whenever we apply – to the multiplicity columnsum () the annotated random variables

– to the multiplicity columnmultiply () the annotated random variables

Page 25: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

25

ID Product Qty #1 A 22 B 33 A 24 A 4

Product Qty #A 2B 3A 4

Querying PMDB: Projection

• Projection takes convolution sum of multiplicities

Page 26: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

26

From Theory To Practice

• Annotated random variables – Marginal distribution

ID Product Qty #1 A 22 B 33 A 24 A 4

0.25

0.25

0.25

0.25

0.75

∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .75 ) )

ID Product Qty#

n 0 1

1 A 2 4 0.75 0.25

2 B 3 4 0.75 0.25

3 A 2 4 0.75 0.25

4 A 4 4 0.75 0.25

Numeric Form!

Page 27: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

27

ID Product Qty #1 A 22 B 33 A 24 A 4

Product Qty #A 2B 3A 4

ID Product Qty#

n 0 11 A 2 4 0.75 0.252 B 3 4 0.75 0.253 A 2 4 0.75 0.254 A 4 4 0.75 0.25

Product Qty#

n 0 1A 2 4 0.5 0.5B 3 4 0.75 0.25A 4 4 0.75 0.25

Product Qty#

n 0 1A 2 4 0.5 0.5

Querying PMDB: an Example

• : works with the numeric forms

Page 28: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

28

Querying PMDB: an Example

• Correctness of – Tuples projected are disjoint: They do not depend on the same base tuple– Can be detected by functional dependency:

ID Product Qty #1 A 22 B 33 A 24 A 4

Product Qty #A 2B 3A 4

Page 29: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

29

Querying PMDB in Numeric Form

• ABM is correct for queries with eligible plans• A large subset of queries can be evaluated by

ABM in DBPTIME• Eligible plans can be tested at compile time

Functional Dependency Rules

Page 30: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

30

Coverage of Various TechniquesAnalytic error estimationTPCH (9/22); Conviva Log (36.9 %)

ABM DBPTIME eligibleTPCH (15/22); Conviva Log (81.0 %)

ABM eligibleTPCH (19/22); Conviva Log (98.6 %)

ABMTPCH (19/22); Conviva Log (99.1 %)

BootstrapTPCH (19/22); Conviva Log (99.1 %)

Over 6660 queries

Page 31: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

31

EXPERIMENTAL EVALUATION

Page 32: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

32

Experimental Setting

• Synthetic and real-life datasets and queries: – TPC-H: 100 GB– Skewed-TPC-H: 1 GB– Customer: 52 GB

• Compare relative error– Of: mean, standard-deviation, quantile, KS-distance,

confidence interval, existence probability– Between: Analytical Bootstrap Method (ABM), bootstrap

(BS), ground truth (GT)

Page 33: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

33

Accuracy of ABM

Comparing the distributions given by ABM & bootstrap on quantiles & existence probability (1% sample)

1%

ABM models Bootstrap accurately

Page 34: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

34

Accuracy of ABM

Comparing user-defined measures given by ABM & bootstrap to ground truth (1% sample)

ABM is consistent with Bootstrap

Page 35: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

35

Accuracy of ABM

Comparing predictions given by ABM & bootstrap when varying number of bootstrap trials (TPC-H 1%)

Bootstrap converges to ABM

Page 36: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

36

Time Performance of ABM

Bootstrap: Original bootstrapBLB-10: Bag of Little Bootstrap using 10 machinesODM: On-Demand Materialization

Comparing time performance of ABM & bootstrap variants (TPC-H 10%)

ABM is 3-4 orders of magnitude faster than sequential/parallel bootstrap variants

Page 37: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

37

Time Performance of ABM

Exact: Run the query on the original dataSample: Run the query on the sampleCLT: Analytic error estimation using Central Limit Theorem

Comparing time performance of ABM & various techniques (TPC-H 10%)

ABM introduces little overhead

Page 38: Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

38

Conclusion & Future Work

• Bootstrap is critical for scalable AQP• ABM provides an analytical model for

bootstrap, and achieves significant speed-up• ABM+EARL: a bootstrap-based system that

can automatically choose/combine error estimation methods

• Integrating ABM into Hive/Shark