optimization of continuous queries in federated database and stream processing systems

Optimization of Continuous Queries in Federated Database and Stream Processing Systems

Yuanzhen Ji1, Zbigniew Jerzak1, Anisoara Nica1, Gregor Hackenbroich1, Christof Fetzer2

1SAP SE 2TU [email protected] [email protected]

April 14, 2023 BTW 2015

Agenda

• Introduction• Federated Continuous Query Execution• Query Optimization Problem• Our Optimization Solution• Evaluation• Conclusions

2

• Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIMDB).– operators: select, join, project, aggregate

• Goal: maximize query throughput (amount of data processed in unit time)

Introduction

3

SPE

CIMDB

data streams

query results

data flow

Introduction

• Motivation: – “No one size fits all” (Cyclops[LHB13], [JI13])– obtain the best of both worlds (SPE, CIMDB)

• Application Scenario: – analyzing energy consumption data collected from smart plugs

installed in households (DEBS 2014 Grand Challenge)• Main contributions:

– a static cost-based optimizer for federated systems• extends established optimization techniques• considers the feasibility property of CQ

– showed the potential of federated CQ execution over a SPE and a CIMDB• up to 8.5x as high as throughput of pure SPE based processing• up to 1.8x as high as throughput of pure CIMDB based processing

4

Federated Continuous Query Execution

• send relevant input data from SPE to CIMDB• trigger re-evaluation of query pieces moved to CIMDB• take results of query pieces executed in CIMDB back to SPE

5

SPE

CIMDB

data streams

query results

SQL query

MIG

MIG

data flow

Query Optimization Problem

• Problem: determine the optimal execution plan for a given CQ

– currently at deployment time

• Feasibility of continuous queries [AN04]:– feasible execution plan: can keep up with data arrival rate– feasible query: has at least one feasible plan

6

SPE CIMDB

?

• Feasibility-dependent optimization objective:– feasible queries: find the feasible plan with least resource consumption– infeasible queries: find the plan which with maximal throughput

• State of the art: either consider feasibility of CQ but not the federation context, or the federation context but not the feasibility of CQ.

Optimization SolutionCost Model – Operator Cost (1)

• Operator cost: CPU cost caused by tuples arrived from data sources within unit-time

For an with k direct upstream operators:

– li: # tuples produced by the i-th upstream operator as a result of unit-time source arrivals

– ci: time to process a single tuple from the i-th upstream operator

7

bottleneck infeasible plan

=

O

l1 =300

=200

=0.001

= 0.002l2

c1

c2

= 300* 0.001+ 200 * 0.002 = 0.7

Optimization SolutionCost Model – Operator Cost (2)

• A query piece executed in CIMDB and its corresponding MIG operator:– treated as a composite operator and cost as a whole– cost includes data transfer (in & out) cost and query execution cost

8

SPE

CIMDB

data streams

query results

SQL query

MIG

data flow

• Execution plan cost: C(P) = <, > (m operator)– Two components: bottleneck cost: total utilization cost: (m: # operators in P)

– is infeasible if >1

Optimization SolutionCost Model – Execution Plan Cost

9

= 1.1

= 2.6

=0.5

O3

O1

O2

O4

=0.3

=1.1 =0.7

Optimization SolutionOptimal Execution Plan

• An execution plan P of a CQ is an optimal plan, iff for any other plan P’ of CQ, one of the following conditions is satisfied:– Condition 1: P is feasible but P’ is infeasible (Cb(P) ≤ 1 < Cb(P’) )– Condition 2: Both P and P’ are feasible, but P has lower Cu(P) (Cb(P) ≤ 1, Cb(P’) ≤ 1, and Cu(P) ≤ Cu(P’) )– Condition 3: Both P and P’ are feasible, but P has lower Cu(P) (1 < Cb(P) ≤ Cb(P’) )

10

Optimization SolutionTwo Phase-Optimization

• Large search space (# possible plans):– many semantically equivalent logical plans– A logical plan with n operators -> 2n possible placement decisions

• Two-Phase optimization: – Phase One: determine the optimal logical plan (consider join ordering,

etc.) – Phase two: determine placement for each operator in the logical plan

produced in phase-one.

• Bottom-up plan construction following dynamic programming (DP) model• Proved applicability of DP for feasibility-dependent optimization objective

in paper.

11

• For each operator O in a logical plan, the optimal sub-plan until O, where O is placed in the SPE, can be build from the optimal sub-plans until direct upstream operators of O.

• For a large logical plan: divide into smaller pieces, optimize and compose in post order.

Optimization SolutionPruning in Phase Two

12

I1

𝑶𝟐𝑺𝑷𝑬

𝑶𝟏𝑺𝑷𝑬

𝑶𝟐𝑺𝑷𝑬

𝑶𝟏𝑫𝑩 I2

<

EvaluationSetup

• Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB RAM, running SUSE Linux.

• Data: real-world energy consumption data from smart plugs installed in households (DEBS 2014 Grand Challenge).

• Tested queries:

13

SELECT in SPE

All in SPE All in DB0

5000

10000

15000

20000

25000

3000026,053.0

3,108.0

18,654.0

Max

. thr

ough

put (

thou

sand

/s)

0 5 10 15 20 25 30 35 400

5000

10000

15000

20000

25000

30000

Requested throughput (thousand/s)

Actu

al th

roug

hput

(tho

usan

d/s)

EvaluationOptimizer effectiveness (1)• Examine 10 source stream data rates picked from

range [1,000, 40,000] (tuples/s)• measure throughput of devised optimal query

14

Max. throughput comparisonActual vs. requested throughput

PROJECT

INNER JOIN

AGGR (avg)

SELECT SELECT

WINDOW(5 min)

WINDOW(5 min)

AGGR (cnt)

SELECT IN SPE

EvaluationOptimizer effectiveness (2)

15

0 5 10 15 20 25 30 35 400

5000

10000

15000

20000

25000

30000


Actu

al th

roug

hput

(tho

usan

d/s)

SELECT in SPE

SEL, JOIN, P in SPE

All in SPE All in DB0

5000

10000

15000

20000

25000

30000

18,144.0

28,594.0

6,044.0

18,047.0

Max

. thr

ough

put (

thou

sand

/s)

P1

P2

P1

P2

Max. throughput comparisonActual vs. requested throughput

• Examine data rates ranging from 1000 to 40,000 tuples/s, at 1000 tuples/s increment

• measure throughput of devised optimal query

P1

PROJECT

INNER JOIN

AGGR(avg, max)

AGGR(avg, max)

SELECT SELECT

WINDOW(5 min)

WINDOW(1 min)

SELECT IN SPE (P1)SEL, JOIN, P IN SPE (P2)

EvaluationInfluence of Feasibility Check

16

0 5 10 15 20 25 30 35 400

5000

10000

15000

20000

25000

30000


Actu

al th

roug

hput

(tho

usan

d/s)PROJECT

INNER JOIN

AGGR(avg, max)

AGGR(avg, max)

SELECT SELECT

WINDOW(5 min)

WINDOW(1 min)

SELECT IN SPE (with feasibility check)SEL, JOIN, P IN SPE (with feasibility check)SEL IN SPE (without feasibility check)

EvaluationOptimization Time

• Tested with join queries (2-way, 5-way, 8-way).

17

2-way (6) 5-way (15) 8-way (24)

11

312

8411

64

327168

With pruningWithout pruning

# en

umer

ated

pla

ns in

Pha

se-T

wo

(log

scal

e)

2-way (6) 5-way (15) 8-way (24)

0.9

68.6 100.5

12.3

908.6

61335.3Phase-OnePhase-Two

Tim

e in

mill

iseco

nd

(log

scal

e)

16+ million

PROJECT

INNER JOIN

AGGR(avg, max)

AGGR(avg, max)

SELECT SELECT

WINDOW(5 min)

WINDOW(1 min)

Conclusion

• Exploits the potential of federated execution of CQ over SPE and IMDB.• Presents a static optimizer which extends traditional optimization

techniques to consider feasibility of CQ.• Evaluation show promising results. For examined queries, throughput of devised federated plan is

– up to 8.5 times as high as throughput of pure SPE-based plan– up to 1.8 times as high as throughput of pure CIMDB-based plan

18

References[AN04] Ayad, A. M. & Naughton, J. F., Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams, SIGMOD, 2004

[FKC+09] Franklin, M. J.; Krishnamurthy, S.; Conway, N.; Li, A., Russakovsky, A. & Thombre, N., Continuous Analytics: Rethinking query processing in a network-effect world. CIDR, 2009[KS09] Kraemer, J. & Seeger B., Semantics and implementation of continuous sliding window queries over data streams, ACM TODS, 2009

[BCD+10] Botan, I.; Cho, Y.; Derakhshan, R.; Dindar, N.; Gupta, A.; Haas, L. M.; Kim, K.; Lee, C.; Mundada, G.; Shan, M.-C.; Tatbul, N.; Yan, Y.; Yun, B. & Zhang, J. A demonstration of the MaxStream federated stream processing system. ICDE, 2010

[LMB+10] Liu, M.; Mihaylov, S. R.; Bao, Z.; Jacob, M.; Ives, Z. G.; Loo, B. T. & Guha, S. SmartCIS: integrating digital and physical environments. SIGMOD Record, 2010

[LIM+12] Liarou, E.; Idreos, S.; Manegold, S. & Kersten, M. MonetDB/DataCell: online analytics in a streaming column-store, PVLDB, 2012

[LHB13] Lim, H.; Han, Y. & Babu, S. How to Fit when No One Size Fits, CIDR, 2013

[Ji13] Ji, Y., Database support for processing complex aggregate queries over data streams , EDBT Workshops, 2013

[CDK+14] Çetintemel, U.; Du, J.; Kraska, T.; Madden, S.; Maier, D.; Meehan, J.; Pavlo, A.; Stonebraker, M.; Sutherland, E.; Tatbul, N.; Tufte, K.; Wang, H. & Zdonik, S. B., S-Store: A streaming NewSQL system for big velocity applications, PVLDB, 2014

[DLB+11] Daum, M.; Lauterwald, F.; Baumgärtel, P.; Pollner, N. & Meyer-Wegener, K., Efficient and Cost-aware Operator Placement in Heterogeneous Stream-processing Environments, DEBS, 2011 19

Thank you!

Query Optimization ProblemState-of-the-Art

21

CQ optimization

Federation context

Optimization Granularity

Feasibility-dependent opt.

[VN02, AN04] √ operator √Traditional distributed, federated DBMS, e.g., [DH02, BCE+05]

√ operator

MaxStream [BCD+10] √Cyclops [LHB13] √ √ queryASPEN [LMB+10] √ √ operatorOperator placement, e.g., [DLB+11] √ √/X operator

query

Semantics

• Adopt the abstract semantics defined in [ABW06], which is based on:– Two data types:

• Stream (S): a possibly infinite bag of elements <s, t>, where s is a tuple belonging to the schema of S and t is the timestamp of s.

• Time-varying Relation (R): a mapping from T to a finite but unbounded bag of tuples belonging to the schema of R.

– Three classes of query operators:• stream-to-relation (S2R) operators: produce one relation from one

stream (e.g., window operators)• relation-to-relation (R2R) operators: produce one relation from

one or more relations. • relation-to-stream (R2S) operators: produce one stream from one

relation.

22

SPE

continuous query

streaming data query results

IntroductionFrom DBMS to SPE

• Increasing interests in processing high-velocity data streams generated in real-time using continuous queries (CQ).

Need a new processing paradigm

DBMS

one-shot queries

query resultsstored data

23

IntroductionFrom DBMS to SPE

• However, many applications require:– persisting input streaming data/query results for on-demand analysis – combining streaming data with static data during processing.

24

DBMS

one-shot queries query results

stored data

SPE

continuous query


store data accessstored data

IntroductionBuild SPE on Top of DBMS Kernel

• Exploit and merge technologies from both worlds in an integration way. – Truviso Continuous Analytics [FKC+09], HP Lab work [CH10], DataCell

[LIM+12], S-Store [CDK+14]

25

SPE + DBMS

one-shot queries query results

stored data

continuous query


in-memorytable

buffers in UDFs

optimization of continuous queries in federated database and stream processing systems

Science

federated execution

bottleneck cost

feasible execution plan

query throughput

cpu cost

continuous queries cq

optimal execution plan

federated database