optimization of continuous queries in federated database and stream processing systems
TRANSCRIPT
Optimization of Continuous Queries in Federated Database and Stream Processing Systems
Yuanzhen Ji1, Zbigniew Jerzak1, Anisoara Nica1, Gregor Hackenbroich1, Christof Fetzer2
1SAP SE 2TU [email protected] [email protected]
April 14, 2023 BTW 2015
Agenda
• Introduction• Federated Continuous Query Execution• Query Optimization Problem• Our Optimization Solution• Evaluation• Conclusions
2
• Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIMDB).– operators: select, join, project, aggregate
• Goal: maximize query throughput (amount of data processed in unit time)
Introduction
3
SPE
CIMDB
data streams
query results
data flow
Introduction
• Motivation: – “No one size fits all” (Cyclops[LHB13], [JI13])– obtain the best of both worlds (SPE, CIMDB)
• Application Scenario: – analyzing energy consumption data collected from smart plugs
installed in households (DEBS 2014 Grand Challenge)• Main contributions:
– a static cost-based optimizer for federated systems• extends established optimization techniques• considers the feasibility property of CQ
– showed the potential of federated CQ execution over a SPE and a CIMDB• up to 8.5x as high as throughput of pure SPE based processing• up to 1.8x as high as throughput of pure CIMDB based processing
4
Federated Continuous Query Execution
• send relevant input data from SPE to CIMDB• trigger re-evaluation of query pieces moved to CIMDB• take results of query pieces executed in CIMDB back to SPE
5
SPE
CIMDB
data streams
query results
SQL query
MIG
MIG
data flow
Query Optimization Problem
• Problem: determine the optimal execution plan for a given CQ
– currently at deployment time
• Feasibility of continuous queries [AN04]:– feasible execution plan: can keep up with data arrival rate– feasible query: has at least one feasible plan
6
SPE CIMDB
?
• Feasibility-dependent optimization objective:– feasible queries: find the feasible plan with least resource consumption– infeasible queries: find the plan which with maximal throughput
• State of the art: either consider feasibility of CQ but not the federation context, or the federation context but not the feasibility of CQ.
Optimization SolutionCost Model – Operator Cost (1)
• Operator cost: CPU cost caused by tuples arrived from data sources within unit-time
For an with k direct upstream operators:
– li: # tuples produced by the i-th upstream operator as a result of unit-time source arrivals
– ci: time to process a single tuple from the i-th upstream operator
7
bottleneck infeasible plan
=
O
l1 =300
=200
=0.001
= 0.002l2
c1
c2
= 300* 0.001+ 200 * 0.002 = 0.7
Optimization SolutionCost Model – Operator Cost (2)
• A query piece executed in CIMDB and its corresponding MIG operator:– treated as a composite operator and cost as a whole– cost includes data transfer (in & out) cost and query execution cost
8
SPE
CIMDB
data streams
query results
SQL query
MIG
data flow
• Execution plan cost: C(P) = <, > (m operator)– Two components: bottleneck cost: total utilization cost: (m: # operators in P)
– is infeasible if >1
Optimization SolutionCost Model – Execution Plan Cost
9
= 1.1
= 2.6
=0.5
O3
O1
O2
O4
=0.3
=1.1 =0.7
Optimization SolutionOptimal Execution Plan
• An execution plan P of a CQ is an optimal plan, iff for any other plan P’ of CQ, one of the following conditions is satisfied:– Condition 1: P is feasible but P’ is infeasible (Cb(P) ≤ 1 < Cb(P’) )– Condition 2: Both P and P’ are feasible, but P has lower Cu(P) (Cb(P) ≤ 1, Cb(P’) ≤ 1, and Cu(P) ≤ Cu(P’) )– Condition 3: Both P and P’ are feasible, but P has lower Cu(P) (1 < Cb(P) ≤ Cb(P’) )
10
Optimization SolutionTwo Phase-Optimization
• Large search space (# possible plans):– many semantically equivalent logical plans– A logical plan with n operators -> 2n possible placement decisions
• Two-Phase optimization: – Phase One: determine the optimal logical plan (consider join ordering,
etc.) – Phase two: determine placement for each operator in the logical plan
produced in phase-one.
• Bottom-up plan construction following dynamic programming (DP) model• Proved applicability of DP for feasibility-dependent optimization objective
in paper.
11
• For each operator O in a logical plan, the optimal sub-plan until O, where O is placed in the SPE, can be build from the optimal sub-plans until direct upstream operators of O.
• For a large logical plan: divide into smaller pieces, optimize and compose in post order.
Optimization SolutionPruning in Phase Two
12
I1
𝑶𝟐𝑺𝑷𝑬
𝑶𝟏𝑺𝑷𝑬
𝑶𝟐𝑺𝑷𝑬
𝑶𝟏𝑫𝑩 I2
<
EvaluationSetup
• Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB RAM, running SUSE Linux.
• Data: real-world energy consumption data from smart plugs installed in households (DEBS 2014 Grand Challenge).
• Tested queries:
13
SELECT in SPE
All in SPE All in DB0
5000
10000
15000
20000
25000
3000026,053.0
3,108.0
18,654.0
Max
. thr
ough
put (
thou
sand
/s)
0 5 10 15 20 25 30 35 400
5000
10000
15000
20000
25000
30000
Requested throughput (thousand/s)
Actu
al th
roug
hput
(tho
usan
d/s)
EvaluationOptimizer effectiveness (1)• Examine 10 source stream data rates picked from
range [1,000, 40,000] (tuples/s)• measure throughput of devised optimal query
14
Max. throughput comparisonActual vs. requested throughput
PROJECT
INNER JOIN
AGGR (avg)
SELECT SELECT
WINDOW(5 min)
WINDOW(5 min)
AGGR (cnt)
SELECT IN SPE
EvaluationOptimizer effectiveness (2)
15
0 5 10 15 20 25 30 35 400
5000
10000
15000
20000
25000
30000
Requested throughput (thousand/s)
Actu
al th
roug
hput
(tho
usan
d/s)
SELECT in SPE
SEL, JOIN, P in SPE
All in SPE All in DB0
5000
10000
15000
20000
25000
30000
18,144.0
28,594.0
6,044.0
18,047.0
Max
. thr
ough
put (
thou
sand
/s)
P1
P2
P1
P2
Max. throughput comparisonActual vs. requested throughput
• Examine data rates ranging from 1000 to 40,000 tuples/s, at 1000 tuples/s increment
• measure throughput of devised optimal query
P1
PROJECT
INNER JOIN
AGGR(avg, max)
AGGR(avg, max)
SELECT SELECT
WINDOW(5 min)
WINDOW(1 min)
SELECT IN SPE (P1)SEL, JOIN, P IN SPE (P2)
EvaluationInfluence of Feasibility Check
16
0 5 10 15 20 25 30 35 400
5000
10000
15000
20000
25000
30000
Requested throughput (thousand/s)
Actu
al th
roug
hput
(tho
usan
d/s)PROJECT
INNER JOIN
AGGR(avg, max)
AGGR(avg, max)
SELECT SELECT
WINDOW(5 min)
WINDOW(1 min)
SELECT IN SPE (with feasibility check)SEL, JOIN, P IN SPE (with feasibility check)SEL IN SPE (without feasibility check)
EvaluationOptimization Time
• Tested with join queries (2-way, 5-way, 8-way).
17
2-way (6) 5-way (15) 8-way (24)
11
312
8411
64
327168
With pruningWithout pruning
# en
umer
ated
pla
ns in
Pha
se-T
wo
(log
scal
e)
2-way (6) 5-way (15) 8-way (24)
0.9
68.6 100.5
12.3
908.6
61335.3Phase-OnePhase-Two
Tim
e in
mill
iseco
nd
(log
scal
e)
16+ million
PROJECT
INNER JOIN
AGGR(avg, max)
AGGR(avg, max)
SELECT SELECT
WINDOW(5 min)
WINDOW(1 min)
Conclusion
• Exploits the potential of federated execution of CQ over SPE and IMDB.• Presents a static optimizer which extends traditional optimization
techniques to consider feasibility of CQ.• Evaluation show promising results. For examined queries, throughput of devised federated plan is
– up to 8.5 times as high as throughput of pure SPE-based plan– up to 1.8 times as high as throughput of pure CIMDB-based plan
18
References[AN04] Ayad, A. M. & Naughton, J. F., Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams, SIGMOD, 2004
[FKC+09] Franklin, M. J.; Krishnamurthy, S.; Conway, N.; Li, A., Russakovsky, A. & Thombre, N., Continuous Analytics: Rethinking query processing in a network-effect world. CIDR, 2009[KS09] Kraemer, J. & Seeger B., Semantics and implementation of continuous sliding window queries over data streams, ACM TODS, 2009
[BCD+10] Botan, I.; Cho, Y.; Derakhshan, R.; Dindar, N.; Gupta, A.; Haas, L. M.; Kim, K.; Lee, C.; Mundada, G.; Shan, M.-C.; Tatbul, N.; Yan, Y.; Yun, B. & Zhang, J. A demonstration of the MaxStream federated stream processing system. ICDE, 2010
[LMB+10] Liu, M.; Mihaylov, S. R.; Bao, Z.; Jacob, M.; Ives, Z. G.; Loo, B. T. & Guha, S. SmartCIS: integrating digital and physical environments. SIGMOD Record, 2010
[LIM+12] Liarou, E.; Idreos, S.; Manegold, S. & Kersten, M. MonetDB/DataCell: online analytics in a streaming column-store, PVLDB, 2012
[LHB13] Lim, H.; Han, Y. & Babu, S. How to Fit when No One Size Fits, CIDR, 2013
[Ji13] Ji, Y., Database support for processing complex aggregate queries over data streams , EDBT Workshops, 2013
[CDK+14] Çetintemel, U.; Du, J.; Kraska, T.; Madden, S.; Maier, D.; Meehan, J.; Pavlo, A.; Stonebraker, M.; Sutherland, E.; Tatbul, N.; Tufte, K.; Wang, H. & Zdonik, S. B., S-Store: A streaming NewSQL system for big velocity applications, PVLDB, 2014
[DLB+11] Daum, M.; Lauterwald, F.; Baumgärtel, P.; Pollner, N. & Meyer-Wegener, K., Efficient and Cost-aware Operator Placement in Heterogeneous Stream-processing Environments, DEBS, 2011 19
Thank you!
Query Optimization ProblemState-of-the-Art
21
CQ optimization
Federation context
Optimization Granularity
Feasibility-dependent opt.
[VN02, AN04] √ operator √Traditional distributed, federated DBMS, e.g., [DH02, BCE+05]
√ operator
MaxStream [BCD+10] √Cyclops [LHB13] √ √ queryASPEN [LMB+10] √ √ operatorOperator placement, e.g., [DLB+11] √ √/X operator
query
Semantics
• Adopt the abstract semantics defined in [ABW06], which is based on:– Two data types:
• Stream (S): a possibly infinite bag of elements <s, t>, where s is a tuple belonging to the schema of S and t is the timestamp of s.
• Time-varying Relation (R): a mapping from T to a finite but unbounded bag of tuples belonging to the schema of R.
– Three classes of query operators:• stream-to-relation (S2R) operators: produce one relation from one
stream (e.g., window operators)• relation-to-relation (R2R) operators: produce one relation from
one or more relations. • relation-to-stream (R2S) operators: produce one stream from one
relation.
22
SPE
continuous query
streaming data query results
IntroductionFrom DBMS to SPE
• Increasing interests in processing high-velocity data streams generated in real-time using continuous queries (CQ).
Need a new processing paradigm
DBMS
one-shot queries
query resultsstored data
23
IntroductionFrom DBMS to SPE
• However, many applications require:– persisting input streaming data/query results for on-demand analysis – combining streaming data with static data during processing.
24
DBMS
one-shot queries query results
stored data
SPE
continuous query
streaming data query results
store data accessstored data
IntroductionBuild SPE on Top of DBMS Kernel
• Exploit and merge technologies from both worlds in an integration way. – Truviso Continuous Analytics [FKC+09], HP Lab work [CH10], DataCell
[LIM+12], S-Store [CDK+14]
25
SPE + DBMS
one-shot queries query results
stored data
continuous query
streaming data query results
in-memorytable
buffers in UDFs