scalable approximate query processing through scalable error estimation kai zeng ucla advisor: carlo...

1

Scalable Approximate Query Processingthrough Scalable Error Estimation

Kai ZengUCLA

Advisor: Carlo Zaniolo

2

Why Approximate Query Processing?

• AQP is critical for massive data– Ever-growing size of big data– Need for timely and cost-effective analysis– Widely applied• RDBMSs (e.g., online aggregation)• MapReduce systems (e.g., BlinkDB)• Data stream systems (load shedding)

3

• Sampling: widely-used in AQP• Error estimation: fundamental in AQP– Analytic error estimation– Bootstrap

MassiveData

AVG5.5

Approx.Mean

sample(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)

Sample

Sampling & Quality assessment

Need to assess the quality!What is the error of this approx. mean?

4

MassiveData

query: AVG5.5

Approx.Mean

sample(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Sample

collect

# of tuples, Variance

Central Limit Theorem

Analytic Error Estimation

• Use closed-form formulas• Pro: very fast• Con: restricted to simple aggregates

What if I want to estimate?1. Complex SQL queries2. Data mining tasks3. ….

5

Bootstrap [Efron 1979]

• Resample with replacement from the sample• Run the query on the resample• Repeat many times, typically 100s or even 1000s of

times(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)

(2, 10, 10, 5, 9, 2, 5, 10, 8, 10)

(8, 1, 2, 1, 1, 9, 7, 4, 10, 1)

5.5

6.8(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)

7.1

4.5

(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)

…… ……

Sample Mean

resample

query: AVG

collect

Same Size

6

• Compute the error from the empirical distribution of all the query results

95%

7

Notes on Bootstrap

• Bootstrap treats Q as a black-box • Can handle (almost) arbitrarily complex queries including

UDFs!

• Embarrassingly Parallel

• Computational demanding

• Use too much resources

Error Estimation• Analytic error estimation– Fast but limited to simple aggregates

• Bootstrap (Monte Carlo simulation):– Expensive but general

Fast and General?

9

How To Make Bootstrap Faster

• Optimize the Monte-Carlo simulation process– EARL system [VLDB12][ICDE13]

• Bypass the Monte-Carlo simulation process– Analytical Bootstrap method (ABM) [SIGMOD14]

10

EARLY ACCURATE RESULT LIBRARY(EARL PROJECT)

11

Motivation

• Existing systems (e.g. Hadoop) use batch processing– High latency– Waste of resources

• Goals: a general driver that can– Return approximate results– With accuracy guarantee– For a wide range of tasks

12

Incremental Computation

• A small sample a larger sample ……• Use Bootstrap to test accuracy• Time efficient: Enable early returns• Resource efficient: Do not waste resources

MassiveData

Samplesample enlarge enlarge

bootstrap

Accurate enough?

bootstrap

Accurate enough?

……

Sample Sample

13

Basic Ideas: Optimization

• Intra-iteration optimization– We have to repeat the same computation on all

resamples– Many data are shared!– Compute the shared part once

𝑆

𝑆1

𝑆2

……

Iteration 𝒇

𝒇

𝒇

Shared

Non-shared

14

Basic Ideas: Optimization

• Inter-iteration optimization– Reuse the old computation– Cannot simply merge for randomness– Keep a small sample in memory for adjustment

𝑆 ∆𝑆

𝑆1

𝑆2

𝑆1′

𝑆2′

𝑆

…… ……

Iteration Iteration

𝑆1 Δ𝑆1 Adjustment is small

15

ANALYTICAL BOOTSTRAP

16

Analytical Bootstrap

• Scope: relational algebra(selection), (projection), (join), (aggregate)

• Basic idea– Annotate tuples with random variables– Extend relational algebra to manage these

random variables

A single-round evaluation = 100s/1000s of bootstrap trials!

# of times a tuple will be drawn in a bootstrap trial

17

Bootstrap Resamples As Multiset DB

• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities

ID Product Qty1 A 22 B 33 A 24 A 4

ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1

ID Product Qty1 A 22 B 32 B 34 A 4

ID Product Qty1 A 22 B 34 A 44 A 4

ID Product Qty1 A 23 A 23 A 24 A 4

resample

……

sample

18

Querying Multiset DB: Projection

• Projection takes sum of multiplicities


Product Qty #A 2 3B 3 0A 4 1

1+2=3

SELECT Product, SUM(Qty)FROM OrdersWHERE Qty < (SELECT SUM(Qty) / 4

FROM Orders)GROUP BY Product

How many products are ordered by small quantity orders?

19

Querying Multiset DB: Aggregate

• Aggregate takes weighted sum of multiplicities


SUM(Qty) #

10 1

2×1+3×0+2×2+4×1=10

20

Querying Multiset DB: Join

• Join takes product of multiplicities

Product Qty #A 2 3B 3 0A 4 1

SUM(Qty) #10 1

Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 1

3×1=3

21

Querying Multiset DB: Selection

• Selection takes product of multiplicities



3×1=31×0=0

22

Bootstrap Resamples As Multiset DB

• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities– ,

23

• Multiset DB– Tuples are annotated with

∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )




ID Product Qty #1 A 22 B 33 A 24 A 4

(𝑚1 ,𝑚2 ,𝑚3 ,𝑚4 )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )0.25

0.25

0.25

0.25

Probabilistic Multiset DB

Random Variables on Probabilistic Multiset DB (PMDB)

Similar to Tossing Coins

24

Querying PMDB

• Whenever we apply – to the multiplicity columnsum () the annotated random variables

– to the multiplicity columnmultiply () the annotated random variables

25


Product Qty #A 2B 3A 4

Querying PMDB: Projection

• Projection takes convolution sum of multiplicities

26

From Theory To Practice

• Annotated random variables – Marginal distribution


0.25

0.25

0.25

0.25

0.75

∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .75 ) )

ID Product Qty#

n 0 1

1 A 2 4 0.75 0.25

2 B 3 4 0.75 0.25

3 A 2 4 0.75 0.25

4 A 4 4 0.75 0.25

Numeric Form!

27



ID Product Qty#

n 0 11 A 2 4 0.75 0.252 B 3 4 0.75 0.253 A 2 4 0.75 0.254 A 4 4 0.75 0.25

Product Qty#

n 0 1A 2 4 0.5 0.5B 3 4 0.75 0.25A 4 4 0.75 0.25

Product Qty#

n 0 1A 2 4 0.5 0.5

Querying PMDB: an Example

• : works with the numeric forms

28

Querying PMDB: an Example

• Correctness of – Tuples projected are disjoint: They do not depend on the same base tuple– Can be detected by functional dependency:



29

Querying PMDB in Numeric Form

• ABM is correct for queries with eligible plans• A large subset of queries can be evaluated by

ABM in DBPTIME• Eligible plans can be tested at compile time

Functional Dependency Rules

30

Coverage of Various TechniquesAnalytic error estimationTPCH (9/22); Conviva Log (36.9 %)

ABM DBPTIME eligibleTPCH (15/22); Conviva Log (81.0 %)

ABM eligibleTPCH (19/22); Conviva Log (98.6 %)

ABMTPCH (19/22); Conviva Log (99.1 %)

BootstrapTPCH (19/22); Conviva Log (99.1 %)

Over 6660 queries

31

EXPERIMENTAL EVALUATION

32

Experimental Setting

• Synthetic and real-life datasets and queries: – TPC-H: 100 GB– Skewed-TPC-H: 1 GB– Customer: 52 GB

• Compare relative error– Of: mean, standard-deviation, quantile, KS-distance,

confidence interval, existence probability– Between: Analytical Bootstrap Method (ABM), bootstrap

(BS), ground truth (GT)

33

Accuracy of ABM

Comparing the distributions given by ABM & bootstrap on quantiles & existence probability (1% sample)

1%

ABM models Bootstrap accurately

34

Accuracy of ABM

Comparing user-defined measures given by ABM & bootstrap to ground truth (1% sample)

ABM is consistent with Bootstrap

35

Accuracy of ABM

Comparing predictions given by ABM & bootstrap when varying number of bootstrap trials (TPC-H 1%)

Bootstrap converges to ABM

36

Time Performance of ABM

Bootstrap: Original bootstrapBLB-10: Bag of Little Bootstrap using 10 machinesODM: On-Demand Materialization

Comparing time performance of ABM & bootstrap variants (TPC-H 10%)

ABM is 3-4 orders of magnitude faster than sequential/parallel bootstrap variants

37

Time Performance of ABM

Exact: Run the query on the original dataSample: Run the query on the sampleCLT: Analytic error estimation using Central Limit Theorem

Comparing time performance of ABM & various techniques (TPC-H 10%)

ABM introduces little overhead

38

Conclusion & Future Work

• Bootstrap is critical for scalable AQP• ABM provides an analytical model for

bootstrap, and achieves significant speed-up• ABM+EARL: a bootstrap-based system that

can automatically choose/combine error estimation methods

• Integrating ABM into Hive/Shark

scalable approximate query processing through scalable error estimation kai zeng ucla advisor: carlo...

Documents

sample slide

size slide

bootstrap bootstrap

bootstrap efron

larger sample use bootstrap

mean sample

sample mean

small sample