query optimization in apache tajo
TRANSCRIPT
Query Optimization in Apache TajoJihoon Son / Gruter inc.
About Me
● Jihoon Son (@jihoonson)○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter
2
● Introduction to Tajo● Query processing in Tajo
○ Query plans in Tajo○ Query processing example
● Query optimization in Tajo○ Introduction to query optimization○ Query optimization techniques in Tajo
Outline
3
● Apache Top-level Project○ Data warehouse system
■ Efficient processing of analytic queries■ ANSI-SQL compliant
○ Scalable and rapid query execution with own engine■ Distributed query processing■ Fault-tolerance
○ Beyond SQL-on-Hadoop■ Support various types of storage
● HDFS, S3, hbase, rdbms, ...
What is Tajo?
4
Highlighted Features
● Support long-running batch queries as well as interactive ad-hoc queries○ Fast query processing
■ Optimized scan performance● 120 MB/sec per physical disk (SATA)
○ Reliability■ Fault tolerance■ No single point of failure with HA support
5
Highlighted Features
● Support of various kinds of data sources○ HDFS, Amazon S3, Google Cloud Storage, HBase,
RDBMS, ...● Mature SQL support
○ Various kinds of join support○ Window function support○ Cost-based query optimization
● Integration with other systems○ Notebooks like Zeppelin○ BI tools
6
Recent Release: 0.11
● Feature highlights○ Query federation○ JDBC-based storage support○ Self-describing data formats support○ Multi-query support○ More stable and efficient join execution○ Index support○ Python UDF/UDAF support
7
Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit a query
Manage metadataAllocate
a query
Send tasks & monitor
Send tasks & monitor
8
Tajo Worker
Query Master
Tajo Worker
Query Master
Tajo Worker
Query Master
Query Execution Steps
9
Tajo Master
Catalog ServerTajo Client
① Submit a query
DBMS
② Assign a query
● Initializing a query execution
③ Build a query execution plan
Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
10
Storage
⑥ Send status and progress
⑤ Read and process data
④ Send tasks & monitor
● Executing a query
Tajo Master
Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
11
Tajo Client
Storage
⑧ Notify that query execution is completed
⑦ Store the result on storage
⑨ Send the result location
⑩ Read the result
● Finalizing the query execution
Tajo Master
Query Processing in Tajo
12
● Given a user query, a query execution plan is an ordered set of steps to execute the query○ Example
■ Read data from storage, and then do join on some join keys, and finally aggregate with some aggregation keys
● In Tajo, there are three kinds of query plans○ Query master generates a logical query plan and a
distributed query plan○ Query executor of tajo workers generates a local query
plan
Query Execution Plan
13
Query Planning Steps in Tajo
14
SQLSQL
AnalyzerAlgebraic
ExpressionLogicalPlanner
Logical Query Plan
Global Planner
Distributed Query Plan
Physical Planner
Local Query Plan
Query Executor
Query Master
Distributed to tajo workers
Join
Logical Query Plan
● A tree of relational algebras● Example
15
SELECT item.brand, sum(price)FROM sales, itemWHERE sales.item_key = item.item_keyGROUP BY item.brand,
Scan on item
Scan on sales
Group by
< SQL > < Logical query plan >
key: item_key
key: brandfunc: sum(price)
Distributed Query Plan
● A plan with additional annotations for distributed execution○ Data exchange (shuffle) keys, methods, ...
16< Distributed query plan >
Join
Scan on item
Scan on sales
Group by
< Logical query plan >
key: item_key
key: brandfunc: sum(price)
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Hash shuffle with item_key
Hash shuffle with item_key
Range shuffle with brand
Local Query Plan
● A plan with additional annotations for local execution○ In-memory algorithm, disk-based algorithm, …
17
< Distributed query plan >
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Hash shuffle with item_key
Hash shuffle with item_key
Range shuffle with brand
< Local query plan >
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Hash shuffle with item_key
Hash shuffle with item_key
Range shuffle with brandSort-merge
join
Hash aggregation
Query Processing in Tajo
● A query is executed by executing multiple stages subsequently○ A stage is a minimum unit to execute at least a single
operator● Each stage is processed by multiple query executors of
tajo worker in parallel
18
Join
Scan on item
Scan on sales
key: item_keyStage 2
Stage 1
● SQL ● Logical query plan
Query Processing Example
19
Join
SELECT item.brand, sum(price)FROM sales, itemWHERE sales.item_key = item.item_keyGROUP BY item.brand,
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
● Logical query plan ● Distributed query plan
Query Processing Example
20
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle with item_key
Range shuffle with brand
Hash shuffle with item_key
Query Processing Example
● Distributed query plan
21
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle with item_key
Range shuffle with brand
Hash shuffle with item_key
item item sales sales sales
WorkerScan
WorkerScan
WorkerScan
WorkerScan
WorkerScan
● Distributed processing
Query Processing Example
22
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle with item_key
Range shuffle with brand
Hash shuffle with item_key
item item sales sales sales
WorkerScan
WorkerScan
WorkerScan
WorkerScan
WorkerScan
WorkerJoin
WorkerJoin
WorkerJoin
WorkerJoin
WorkerJoin
shuffle
● Distributed query plan ● Distributed processing
Query Processing Example
● Distributed query plan
23
Join
Scan on item
Scan on sales
Group by
key: item_key
key: brandfunc: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle with item_key
Range shuffle with brand
Hash shuffle with item_key
item item sales sales sales
WorkerScan
WorkerScan
WorkerScan
WorkerScan
WorkerScan
WorkerJoin
WorkerJoin
WorkerJoin
WorkerJoin
WorkerJoin
WorkerGroup by
WorkerGroup by
WorkerGroup by
WorkerGroup by
WorkerGroup by
shuffle
shuffle
● Distributed processing
Query Optimization in Tajo
24
Query Optimization
● Mostly, user queries are not optimized for performance
● The query optimizer attempts to determine the most efficient way to execute a user query ○ Considering the possible query plans, and choosing the
best one
25
Extreme Example
● Query○ select * from t where name like 'tajo%' order by id;
● Possible plans
26
Scan
Sort
Filter
Scan with Filter
Sort● Naive plan○ Filtering out tuples
after sort○ Large cost for sort
● Better plan○ Filtering out tuples
after scan immediately○ Small cost for sort○ Reduced number of
operations
Two Kinds of Query Optimization
● Rule-based optimization○ A set of predefined rules is used to choose a good plan○ Usually, heuristic approaches are used
■ Ex) filters should be pushed down to the lower part of the query plan as much as possible
● Cost-based optimization○ Enumerating possible query plans and choosing the one
having the lowest cost○ Cost function has an important role
● Tajo utilizes both types of optimization
27
Query Optimization in Tajo
● Difference from traditional query optimization○ Unlike traditional database systems, pre-collected
statistics is not so important ■ Data may be added or updated by several systems
including Flume, Kafka, Tajo, … ■ Pre-collected statistics can be useful, but is not fully
trustworthy○ It is important to optimize query plans with minimal
statistics ■ Volume of input relations
28
Query Optimization in Tajo
● Tajo has two different approaches for query optimization○ Static optimization
■ Traditional approach■ Optimizing the plan during the query planning phase
○ Progressive optimization■ Optimizing the plan based on the intermediate statistics
while executing the query● A query plan can be optimized without pre-collected
statistics
● Especially effective for queries which require multiple stage execution 29
Logical Query Plan Optimization
● Rule-based optimization○ Access path rewrite rule
■ Choosing access path to data■ Index scan has the highest priority if available
○ Distributivity rule■ Reducing filters based on distributivity
○ Filter pushdown rule■ Pushing down filters to the lowest part as much as
possible○ In-subquery rewrite rule
■ Transforming subqueries in 'IN' filters to semi(anti) joins30
Logical Query Plan Optimization
● Rule-based optimization (cont')○ Projection pushdown rule
■ Pushing down projections to the lowest part as much as possible
● Cost-based optimization○ Join order optimization
■ Finding a join order of lowest cost■ Greedy heuristic: ordering relations from small ones to
large ones● Very effective in single computing environment● Need to improve for parallel computing environment
31
Distributed Query Plan Optimization
● Rule-based optimization○ Two-phase execution of operators
■ Operators which require data shuffling like aggregation, join, or sort are executed in two-phase
■ First phase is for local computing to reduce the amount of shuffled data
■ Second phase is to get the result of the operation
32
Two-phase Execution Example
● Logical query plan
33
● Distributed query plan
Group by
Scan
Sort
Group by
Scan
SortStage 3
Stage 2
Stage 1
Group by
Sort
Local group by
Local sort
Distributed Query Plan Optimization
● Distributed join algorithm selection○ Two representative distributed join algorithms
■ Join cannot be performed within a single stage in distributed systems● Tuples of the same join key may be distributed over cluster
nodes■ Repartition join
● Both input relations are shuffled with the join key columns■ Broadcast join
● Small relations are broadcasted to every node before join
34
Example of Repartition Join
● select … from employee e, department d where e.DeptName = d.DeptName
35
Example of Broadcast Join
● select … from employee e, department d where e.DeptName = d.DeptName
36
Distributed Join Algorithm Selection
● Repartition join VS broadcast join○ Given a set of joins, some parts can be executed with
broadcast join while remaining parts are executed with repartition join
● Which parts will be executed with broadcast join?○ Greedy heuristic: broadcast join is used as many as
possible ■ The size of input relation should be smaller than pre-
defined threshold■ The total volume of broadcasted relations should not
exceed pre-defined threshold37
Distributed Join Algorithm Selection Example
● select … from lineitem, nation, region …
38
Local Query Plan Optimization
● Selecting the best algorithm based on the current resource status○ Aggregation
■ Hash aggregation, sort aggregation○ Join
■ Hash join, sort-merge join● For sort, hash sort is basically used with spilling data to
disk when it doesn't fit into memory
39
Progressive Optimization
● Data repartition○ Some operators like join or aggregation require to
shuffle data with keys○ The number of result partitions of shuffle should be
carefully decided■ The number of partitions is related to the number of tasks
of the next stage● At the beginning of each stage, the number of
partitions is decided based on the input size
40
Progressive Optimization Example
41
Group by
Scan on item (100GB)
SortStage 3
Stage 2
Stage 1
Group by
Sort
# of partitions: 100
● If the default task size is 1GB,
Group by
Scan on item
SortStage 3
Stage 2
Stage 1
Group by(50GB)
Sort# of partitions: 50
# of tasks: 100
# of tasks: 50
Future Work
● Adding more optimization methods● Improve cost functions for more effective cost-based
optimization● Adding new approaches for progressive optimization
○ Runtime query rewriting○ Integrating with genetic algorithm○ …
42
43
Get Involved!
● General○ http://tajo.apache.org
● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads○ http://tajo.apache.org/downloads.html
● Jira – Issue Tracker○ https://issues.apache.org/jira/browse/TAJO
● Join the mailing list○ [email protected]○ [email protected]
44
Thanks!