blinkdb and g-ola: supporting continuous answers with error bars in sparksql-(sameer agarwal and kai...

62
BlinkDB and G-OLA: Supporting Approximate Answers in SparkSQL Sameer Agarwal and Kai Zeng Spark Summit | San Francisco, CA | June 15 th 2015

Upload: spark-summit

Post on 13-Aug-2015

629 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

BlinkDB and G-OLA: Supporting Approximate Answers in SparkSQL

Sameer Agarwal and Kai ZengSpark Summit | San Francisco, CA | June 15th 2015

Page 2: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

About Us1. Sameer Agarwal

- Software Engineer at Databricks- PhD in Databases (UC Berkeley)- Research on Approximate Query Processing (BlinkDB)

2. Kai Zeng

- Post-doc in AMP Lab/ Intern at Databricks- PhD in Databases (UCLA)- Research on Approximate Query Processing (ABM)

Page 3: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

Continuous Query Execution on Samples of Data

Page 4: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

Continuous Query Execution on Samples

What is the average latency in the table?

34.6667

Page 5: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

35

Continuous Query Execution on Samples

Page 6: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

35 ± 2.1

Continuous Query Execution on Samples

Page 7: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

35 ± 2.133.83 ± 1.3

Continuous Query Execution on Samples

Page 8: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

33.83 ± 1.334.6667 ± 0.0

35 ± 2.1

Continuous Query Execution on Samples

Page 9: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

9

Demo

Page 10: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

SELECT foo (*)FROM TABLE

A ± ε

ErrorEstimation

QueryExecution

DataStorage

Continuous Query Execution on Samples

Page 11: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

SELECT foo (*)FROM TABLE

A ± ε

ErrorEstimation

QueryExecution

DataStorage

Continuous Query Execution on Samples

G-OLA

Page 12: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Interfaceval dataFrame =  

sqlCtx.sql(“select  avg(latency)  from  log”)

//  batch  processingval result  =  dataFrame.collect()  //  34.6667

Page 13: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlineonlineDataFrame.collectNext() //  35  ± 2.1onlineDataFrame.collectNext() //  33.83  ± 1.3

Interface

Page 14: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlinewhile (onlineDataFrame.hasNext())  {onlineDataFrame.collectNext()

}

Interface

Page 15: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlinewhile (onlineDataFrame.hasNext()  &&responseTime <=  10.seconds)  {onlineDataFrame.collectNext()

}

Interface

Page 16: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlinewhile (onlineDataFrame.hasNext()  &&errorBound >=  0.01)  {onlineDataFrame.collectNext()

}

Interface

Page 17: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlinewhile (onlineDataFrame.hasNext()  &&userEvent.cancelled())  {onlineDataFrame.collectNext()

}

Interface

Page 18: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

val dataFrame =  sqlCtx.sql(“select  avg(latency)  from  log”)

//  online  processingval onlineDataFrame =  dataFrame.onlinewhile (onlineDataFrame.hasNext()  &&userEvent.cancelled())  {onlineDataFrame.collectNext()

}

AGGREGATES/  UDAFsJOINS/GROUP  BYsNESTED  QUERIES

Interface

Page 19: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

SELECT foo (*)FROM TABLE

A ± ε

ErrorEstimation

QueryExecution

DataStorage

Continuous Query Execution on Samples

Page 20: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

SELECT foo (*)FROM TABLE

A ± ε

QueryInterface

ErrorEstimation

QueryExecution

DataStorage

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys 2013.

Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan. A General Bootstrap Performance Diagnostic. In ACM KDD 2013

Sameer Agarwal, Henry Milner, Ariel Kleiner, AmeetTalwalkar,Michael Jordan, Samuel Madden, BarzanMozafari, Ion Stoica. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In ACM SIGMOD 2014.

Continuous Query Execution on Samples

Page 21: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Focused on estimating aggregate errors given representative samples

Central Limit Theorem (CLT) Error Estimation using BootstrapHOE:ASTAT63, BIL: WILEY86, CGL:ASTAT83, PH:IBM96 EFRON:JAS82, EFRON:JAS87, VP:TPMS80, FGK:IJCAI99, ET:CH93

21

Error Estimation on a Sample of Data

d

Closed form approximations to variance of sample

estimators for BlinkDB

Henry Milner

04/06/13

Notation:

1. µ = E[X]

2. µk is the kth central moment of the underlying distribution, E[(X�E[X])

k]

(note that µ1 = 0, not µ)

3. �2= µ2

2 is the variance of the underlying distribution

4. p is the frequency of rows (the probability that a row matches the filter

predicate for the query)

The following results are (asymptotically in sample size) true, but not di-

rectly useful, since they depend on unknown properties of the underlying dis-

tribution. In all cases we just plug in the sample values. For example, instead

of µ we use

1n

Pni=1 Xi where Xi is the ith sample value.

Note that for estimators other than sum and count, I assume no filtering

(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely

selective queries (p = 0). I can compute the filtering-adjusted values if you like.

1. Count: N(np, n(1� p)p)

2. Sum: N(npµ, np(�2+ (1� p)µ2

))

3. Mean: N(µ,�2/n)

4. Variance: N(�2, (µ4 � �4)/n)

5. Stddev: N(�, (µ4 � �4)/(4�2n))

1

Sampling!…

Resampling!

DS100

S1

Sθ(S1)

θ(S100 )θ(S) 95% confidence

interval!

Page 22: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

Error Estimation using BootstrapWhat is the average latency in the table?

Page 23: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

Error Estimation using BootstrapID City Latency1 NYC 302 NYC 383 SLC 344 SLC 34

ID City Latency1 NYC 302 NYC 303 SLC 344 LA 36

ID City Latency1 SLC 342 LA 363 SLC 344 LA 36

...

θ1 = 34 ...

34.5 ± 2

θ2 = 32.5 θ100 = 35

Page 24: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

Error Estimation using BootstrapID City Latency1 NYC 302 NYC 383 SLC 344 SLC 34

ID City Latency1 NYC 302 NYC 303 SLC 344 LA 36

ID City Latency1 SLC 342 LA 363 SLC 344 LA 36

...

θ1 = 34 ...

34.5 ± 2

θ2 = 32.5 θ100 = 35

Page 25: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

Error Estimation using BootstrapID City Latency1 NYC 302 NYC 383 SLC 344 SLC 345 SLC 37

ID City Latency1 SLC 372 NYC 303 SLC 344 LA 365 NYC 30

ID City Latency1 SLC 342 SLC 373 SLC 344 LA 365 LA 36

...

θ1 = 34.6...

35 ± 1.6

θ2 = 33.4 θ100 = 35.4

Page 26: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

ID City Latency1 NYC 302 NYC 383 SLC 344 SLC 345 SLC 37

Error Estimation in BlinkDB

Leverage PoissonizedResampling to generate samples with replacement

Page 27: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

ID City Latency #1

1 NYC 30 22 NYC 38 13 SLC 34 04 SLC 34 15 SLC 37 1

Sample from a Poisson (1) Distribution

θ1 = 33.8

Error Estimation in BlinkDB

Page 28: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

ID City Latency #1

1 NYC 30 22 NYC 38 13 SLC 34 04 SLC 34 15 SLC 37 16 SF 28 2

Incremental Error Estimation

Error Estimation in BlinkDB

Page 29: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

ID City Latency1 NYC 302 NYC 383 SLC 344 LA 365 SLC 376 SF 287 NYC 328 NYC 389 LA 3610 SF 3511 NYC 3812 LA 34

What is the average latency in the table?

ID City Latency #1 #2

1 NYC 30 2 12 NYC 38 1 03 SLC 34 0 24 SLC 34 1 25 SLC 37 1 06 SF 28 2 1

Construct all Resamples in a Single Pass

Error Estimation in BlinkDB

0.2-0.5% additional overhead

Page 30: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

High Level Take-away:Bootstrap and Poissonized Resampling Techniques are the key towards achieving quick and continuous error bars for a general set of queries

30

Page 31: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

SELECT foo (*)FROM TABLE

A ± ε

QueryInterface

ErrorEstimation

QueryExecution

DataStorage

G-OLA

Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust and Ion Stoica.G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015.

Continuous Query Execution on Samples

Page 32: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

Page 33: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

Page 34: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

A”

10+10+10  sec

Page 35: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

A”

10+10+10  sec

Overall Quadratic Cost!

Page 36: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

Page 37: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

Page 38: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

A

Page 39: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

A

Page 40: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

A

Query Execution: Under The Hood

Data

Query

Answer± ε

10  sec

A’

10+10  sec

A

Delta Update Query

Page 41: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Delta Update Queries

Data

Query

Answer± ε

Page 42: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Data

Query

Answer± ε

Delta Update Queries

Page 43: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Data

Query

Answer± ε

Delta Update Queries

Page 44: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Data

Query

Answer± ε

Delta Update Queries

Page 45: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Data

Query

Answer± ε

Delta Update Queries

Page 46: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Data

Query

Answer± ε

Delta Update Queries

Page 47: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

47

Delta Update: Simple Queries

AVG

SCAN

SELECT  avg(latency)FROM  log

A

Page 48: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

48

Delta Update: Simple Queries

AVG

SCAN

SELECT  avg(latency)FROM  log

A

Page 49: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

49

Delta Update: Simple Queries

AVG

SCAN

SELECT  avg(latency)FROM  log

A

AVG

SCAN

A

Page 50: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

50

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)A

latency > A

(I)

Page 51: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

51

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)A

latency > A

A’

A’

(I) (II)

Page 52: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

52

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)A

latency > A

(I)

Page 53: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

53

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

A ± ε

(I)

Page 54: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

54

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

10±2

(I)

Page 55: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

55

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

10±2

latency < 8

(I)

Page 56: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

56

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

10±2

latency > 12

(I)

Page 57: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

57

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

10±2

8 < latency < 12

(I)

Page 58: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

58

Delta Update: Nested Queries

FILTER

JOIN

AVG SCAN

SCAN

AVGSELECT  avg(latency)FROM  logWHERE  latency  >(SELECT  avg(latency)FROM  log

)

latency > A

10±2

8 < latency < 12

(I) (II)

Page 59: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

High Level Take-away: Introduce Delta Update Queries as a First Class Citizen in Query Execution

Page 60: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Check out our code!1. Code Preview: http://github.com/amplab/bootstrap-sql.

Send us an email to [email protected] and [email protected] to get access!

2. Spark Package in July’15

3. Gradual Native SparkSQL Integration in 1.5, 1.6 and beyond

Page 61: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Conclusion1. Continuous Query Execution on

Samples of Data is an important means to achieve interactivity in processing large datasets

2. New SparkSQL Libraries:- BlinkDB for Continuous Error Bars- G-OLA for Continuous Partial Answers

Page 62: BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Thank you.Sameer Agarwal ([email protected])Kai Zeng ([email protected])