query optimization in apache tajo

Query Optimization in Apache TajoJihoon Son / Gruter inc.

About Me

● Jihoon Son (@jihoonson)○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter

2

● Introduction to Tajo● Query processing in Tajo

○ Query plans in Tajo○ Query processing example

● Query optimization in Tajo○ Introduction to query optimization○ Query optimization techniques in Tajo

Outline

3

● Apache Top-level Project○ Data warehouse system

■ Efficient processing of analytic queries■ ANSI-SQL compliant

○ Scalable and rapid query execution with own engine■ Distributed query processing■ Fault-tolerance

○ Beyond SQL-on-Hadoop■ Support various types of storage

● HDFS, S3, hbase, rdbms, ...

What is Tajo?

4

Highlighted Features

● Support long-running batch queries as well as interactive ad-hoc queries○ Fast query processing

■ Optimized scan performance● 120 MB/sec per physical disk (SATA)

○ Reliability■ Fault tolerance■ No single point of failure with HA support

5

Highlighted Features

● Support of various kinds of data sources○ HDFS, Amazon S3, Google Cloud Storage, HBase,

RDBMS, ...● Mature SQL support

○ Various kinds of join support○ Window function support○ Cost-based query optimization

● Integration with other systems○ Notebooks like Zeppelin○ BI tools

6

Recent Release: 0.11

● Feature highlights○ Query federation○ JDBC-based storage support○ Self-describing data formats support○ Multi-query support○ More stable and efficient join execution○ Index support○ Python UDF/UDAF support

7

Tajo Master

Catalog Server

Tajo Master

Catalog Server

Architecture Overview

DBMS

HCatalog

Tajo Master

Catalog Server

Tajo Worker

Query Master

Query Executor

Storage Service

Tajo Worker

Query Master

Query Executor

Storage Service

Tajo Worker

Query Master

Query Executor

Storage Service

JDBC client

TSQLWebUI

REST API

Storage

Submit a query

Manage metadataAllocate

a query

Send tasks & monitor

Send tasks & monitor

8

Tajo Worker

Query Master

Tajo Worker

Query Master

Tajo Worker

Query Master

Query Execution Steps

9

Tajo Master

Catalog ServerTajo Client

① Submit a query

DBMS

② Assign a query

● Initializing a query execution

③ Build a query execution plan

Tajo Worker

Query Executor

Storage Service

Tajo Worker

Query Master

Query Executor

Storage Service

Tajo Worker

Query Executor

Storage Service


10

Storage

⑥ Send status and progress

⑤ Read and process data

④ Send tasks & monitor

● Executing a query

Tajo Master

Tajo Worker

Query Executor

Storage Service

Tajo Worker

Query Master

Query Executor

Storage Service

Tajo Worker

Query Executor

Storage Service


11

Tajo Client

Storage

⑧ Notify that query execution is completed

⑦ Store the result on storage

⑨ Send the result location

⑩ Read the result

● Finalizing the query execution

Tajo Master

Query Processing in Tajo

12

● Given a user query, a query execution plan is an ordered set of steps to execute the query○ Example

■ Read data from storage, and then do join on some join keys, and finally aggregate with some aggregation keys

● In Tajo, there are three kinds of query plans○ Query master generates a logical query plan and a

distributed query plan○ Query executor of tajo workers generates a local query

plan

Query Execution Plan

13

Query Planning Steps in Tajo

14

SQLSQL

AnalyzerAlgebraic

ExpressionLogicalPlanner

Logical Query Plan

Global Planner

Distributed Query Plan

Physical Planner

Local Query Plan

Query Executor

Query Master

Distributed to tajo workers

Join

Logical Query Plan

● A tree of relational algebras● Example

15

SELECT item.brand, sum(price)FROM sales, itemWHERE sales.item_key = item.item_keyGROUP BY item.brand,

Scan on item

Scan on sales

Group by

< SQL > < Logical query plan >

key: item_key

key: brandfunc: sum(price)

Distributed Query Plan

● A plan with additional annotations for distributed execution○ Data exchange (shuffle) keys, methods, ...

16< Distributed query plan >

Join

Scan on item

Scan on sales

Group by

< Logical query plan >

key: item_key


Join

Scan on item

Scan on sales

Group by

key: item_key


Hash shuffle with item_key


Range shuffle with brand

Local Query Plan

● A plan with additional annotations for local execution○ In-memory algorithm, disk-based algorithm, …

17

< Distributed query plan >

Join

Scan on item

Scan on sales

Group by

key: item_key





< Local query plan >

Join

Scan on item

Scan on sales

Group by

key: item_key




Range shuffle with brandSort-merge

join

Hash aggregation

Query Processing in Tajo

● A query is executed by executing multiple stages subsequently○ A stage is a minimum unit to execute at least a single

operator● Each stage is processed by multiple query executors of

tajo worker in parallel

18

Join

Scan on item

Scan on sales

key: item_keyStage 2

Stage 1

● SQL ● Logical query plan

Query Processing Example

19

Join

SELECT item.brand, sum(price)FROM sales, itemWHERE sales.item_key = item.item_keyGROUP BY item.brand,

Scan on item

Scan on sales

Group by

key: item_key


● Logical query plan ● Distributed query plan


20

Join

Scan on item

Scan on sales

Group by

key: item_key


Join

Scan on item

Scan on sales

Group by

key: item_key


Stage 3

Stage 2

Stage 1





● Distributed query plan

21

Join

Scan on item

Scan on sales

Group by

key: item_key


Stage 3

Stage 2

Stage 1




item item sales sales sales

WorkerScan

WorkerScan

WorkerScan

WorkerScan

WorkerScan

● Distributed processing


22

Join

Scan on item

Scan on sales

Group by

key: item_key


Stage 3

Stage 2

Stage 1





WorkerScan

WorkerScan

WorkerScan

WorkerScan

WorkerScan

WorkerJoin

WorkerJoin

WorkerJoin

WorkerJoin

WorkerJoin

shuffle

● Distributed query plan ● Distributed processing



23

Join

Scan on item

Scan on sales

Group by

key: item_key


Stage 3

Stage 2

Stage 1





WorkerScan

WorkerScan

WorkerScan

WorkerScan

WorkerScan

WorkerJoin

WorkerJoin

WorkerJoin

WorkerJoin

WorkerJoin

WorkerGroup by

WorkerGroup by

WorkerGroup by

WorkerGroup by

WorkerGroup by

shuffle

shuffle

● Distributed processing

Query Optimization in Tajo

24

Query Optimization

● Mostly, user queries are not optimized for performance

● The query optimizer attempts to determine the most efficient way to execute a user query ○ Considering the possible query plans, and choosing the

best one

25

Extreme Example

● Query○ select * from t where name like 'tajo%' order by id;

● Possible plans

26

Scan

Sort

Filter

Scan with Filter

Sort● Naive plan○ Filtering out tuples

after sort○ Large cost for sort

● Better plan○ Filtering out tuples

after scan immediately○ Small cost for sort○ Reduced number of

operations

Two Kinds of Query Optimization

● Rule-based optimization○ A set of predefined rules is used to choose a good plan○ Usually, heuristic approaches are used

■ Ex) filters should be pushed down to the lower part of the query plan as much as possible

● Cost-based optimization○ Enumerating possible query plans and choosing the one

having the lowest cost○ Cost function has an important role

● Tajo utilizes both types of optimization

27


● Difference from traditional query optimization○ Unlike traditional database systems, pre-collected

statistics is not so important ■ Data may be added or updated by several systems

including Flume, Kafka, Tajo, … ■ Pre-collected statistics can be useful, but is not fully

trustworthy○ It is important to optimize query plans with minimal

statistics ■ Volume of input relations

28


● Tajo has two different approaches for query optimization○ Static optimization

■ Traditional approach■ Optimizing the plan during the query planning phase

○ Progressive optimization■ Optimizing the plan based on the intermediate statistics

while executing the query● A query plan can be optimized without pre-collected

statistics

● Especially effective for queries which require multiple stage execution 29

Logical Query Plan Optimization

● Rule-based optimization○ Access path rewrite rule

■ Choosing access path to data■ Index scan has the highest priority if available

○ Distributivity rule■ Reducing filters based on distributivity

○ Filter pushdown rule■ Pushing down filters to the lowest part as much as

possible○ In-subquery rewrite rule

■ Transforming subqueries in 'IN' filters to semi(anti) joins30

Logical Query Plan Optimization

● Rule-based optimization (cont')○ Projection pushdown rule

■ Pushing down projections to the lowest part as much as possible

● Cost-based optimization○ Join order optimization

■ Finding a join order of lowest cost■ Greedy heuristic: ordering relations from small ones to

large ones● Very effective in single computing environment● Need to improve for parallel computing environment

31

Distributed Query Plan Optimization

● Rule-based optimization○ Two-phase execution of operators

■ Operators which require data shuffling like aggregation, join, or sort are executed in two-phase

■ First phase is for local computing to reduce the amount of shuffled data

■ Second phase is to get the result of the operation

32

Two-phase Execution Example

● Logical query plan

33


Group by

Scan

Sort

Group by

Scan

SortStage 3

Stage 2

Stage 1

Group by

Sort

Local group by

Local sort

Distributed Query Plan Optimization

● Distributed join algorithm selection○ Two representative distributed join algorithms

■ Join cannot be performed within a single stage in distributed systems● Tuples of the same join key may be distributed over cluster

nodes■ Repartition join

● Both input relations are shuffled with the join key columns■ Broadcast join

● Small relations are broadcasted to every node before join

34

Example of Repartition Join

● select … from employee e, department d where e.DeptName = d.DeptName

35

Example of Broadcast Join

● select … from employee e, department d where e.DeptName = d.DeptName

36

Distributed Join Algorithm Selection

● Repartition join VS broadcast join○ Given a set of joins, some parts can be executed with

broadcast join while remaining parts are executed with repartition join

● Which parts will be executed with broadcast join?○ Greedy heuristic: broadcast join is used as many as

possible ■ The size of input relation should be smaller than pre-

defined threshold■ The total volume of broadcasted relations should not

exceed pre-defined threshold37

Distributed Join Algorithm Selection Example

● select … from lineitem, nation, region …

38

Local Query Plan Optimization

● Selecting the best algorithm based on the current resource status○ Aggregation

■ Hash aggregation, sort aggregation○ Join

■ Hash join, sort-merge join● For sort, hash sort is basically used with spilling data to

disk when it doesn't fit into memory

39

Progressive Optimization

● Data repartition○ Some operators like join or aggregation require to

shuffle data with keys○ The number of result partitions of shuffle should be

carefully decided■ The number of partitions is related to the number of tasks

of the next stage● At the beginning of each stage, the number of

partitions is decided based on the input size

40

Progressive Optimization Example

41

Group by

Scan on item (100GB)

SortStage 3

Stage 2

Stage 1

Group by

Sort

# of partitions: 100

● If the default task size is 1GB,

Group by

Scan on item

SortStage 3

Stage 2

Stage 1

Group by(50GB)

Sort# of partitions: 50

# of tasks: 100

# of tasks: 50

Future Work

● Adding more optimization methods● Improve cost functions for more effective cost-based

optimization● Adding new approaches for progressive optimization

○ Runtime query rewriting○ Integrating with genetic algorithm○ …

42

43

Get Involved!

● General○ http://tajo.apache.org

● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html

● Downloads○ http://tajo.apache.org/downloads.html

● Jira – Issue Tracker○ https://issues.apache.org/jira/browse/TAJO

● Join the mailing list○ [email protected]○ [email protected]

http://tajo.apache.org

http://tajo.apache.org

http://tajo.apache.org/docs/current/getting_started.html

http://tajo.apache.org/docs/current/getting_started.html

http://tajo.apache.org/downloads.html

http://tajo.apache.org/downloads.html

https://issues.apache.org/jira/browse/TAJO

https://issues.apache.org/jira/browse/TAJO

mailto:[email protected]




44

Thanks!

query optimization in apache tajo

Engineering