query execution optimizing performance. resolving an sql query since our sql queries are very high...

27
Query Execution Optimizing Performance

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Query ExecutionOptimizing Performance

Page 2: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Resolving an SQL query• Since our SQL queries are very high level, the query

processor must do a lot of additional processing to supply all of the missing details.

• In practice, an SQL query is translated internally into a relational algebra expression.

• One advantage of using relational algebra is that it makes alternative forms of a query easier to explore.

• The different algebraic expressions for a query are called logical query plans.

• We will focus first on the methods for executing the operations of the relational algebra.

• Then we will focus on how transform logical query plans.

Page 3: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Preview• Parsing: read SQL, output relational algebra tree

• Query rewrite: Transform tree to a form, which is more efficient to evaluate

• Physical plan generation: select implementation for each operator in tree, and for passing results up the tree.

• In this chapter we will focus on the implementation for each operator.

Page 4: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Relational Algebra recap• RA union, intersection and difference correspond to UNION, INTERSECT, and EXCEPT in SQL

• Selection corresponds to the WHERE-clause in SQL• Projection corresponds to SELECT-clause• Product corresponds to FROM-clause• Join’s corresponds to JOIN, NATURAL JOIN, and OUTER JOIN in the SQL2 standard

• Duplicate elimination corresponds to DISTINCT in SELECT-clause

• Grouping corresponds to GROUP BY• Sorting corresponds to ORDER BY

Page 5: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

A graphical picture

• A query plan is a decomposition of the original SQL into an expression tree.

• What is the initial plan?

• Is it the best?

5

πtitle, birthdate

σyear=1996 AND gender = “F” AND starName = name

×

MovieStar StarsIn

SELECT title, birthdate

FROM MoviewStar, StarsIN

WHERE year=1996 AND gender='F'

AND starName=name;

Page 6: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

A better plan

6

πtitle, birthdate

starName = name

MovieStar StarsIn

σgender = “F” σyear=1996

Page 7: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Relational algebra for real SQL• Keep in mind the following fact:

- A relation in algebra is a set, while a relation in SQL is probably a bag

- In short, a bag allows duplicates.- Not surprisingly, this can effect the cost of related

operations.

Page 8: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Bag union, intersection, and difference• Card(t,R) means the number of occurrences of tuple t in

relation R

• Card(t, RS) = Card(t,R) + Card(t,S)• Card(t,RS) = min{Card(t,R), Card(t,S)}• Card(t,R–S) = max{Card(t,R)–Card(t,S), 0}

• Example: R= {A,B,B}, S = {C,A,B,C}• R S = {A,A,B,B,B,C,C}• R S = {A,B}• R – S = {B}

Page 9: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Physical operators• Physical query plans are built from physical operators.

- In most cases, the physical operators are direct implementations of the relational algebra operators.

• However, there are several other physical operators for various supporting tasks. 1. Table-scan (the most basic operation we want to perform

in a physical query plan)

2. Index-scan (E.g. if we have an index on some relation R we can retrieve the blocks of R by using the index)

3. Sort-scan (takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in sorted order)

Page 10: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Computational Model• When comparing algorithms for the same operations we

do not consider the cost of writing the output.

- Because the cost of writing the output on the disk depends on the size of the result, not on the way the result was computed. In other words, it is the same for any computational alternative.

- Also, we can often pipeline the result to other operators when the result is constructed in main memory. So a final output phase may not even be required.

Page 11: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Iterators for Implementation of Physical Operators

• This is a group of three functions that allow a consumer of the result of a physical operation to get the result one tuple at a time.

• An iterator consists of three parts:• Open: Initializes data structures. Doesn’t return tuples• GetNext: Returns next tuple & adjusts the data structures• Close: Cleans up afterwards

• We assume these to be overloaded names of methods.

Page 12: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Iterator for table scan operatorOpen(R) {

b := the first block of R; t := the first first tuple of block b; Found := TRUE;

} GetNext(R) {

IF (t is past the last tuple on block b) {increment b to the next block;

IF (there is no next block) { Found := FALSE; RETURN;

} ELSE /*b is a new block*/ t := first tuple on block b; }oldt := t; /*Now we are ready to return t and increment*/increment t to the next tuple of b; RETURN oldt;

} Close(R) {}

Page 13: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Iterator for Bag Union of R and SOpen(R,S) {

R.open();CurRel := R;

}GetNext(R,S) {

IF (CurRel = R) {t := R.GetNext();IF(Found) /*R is not exhausted*/

RETURN t;ELSE /*R is exhausted*/ { S.Open(); CurRel := S;}

}/*Here we read from S*/

RETURN S.GetNext();/*If s is exhausted Found will be set to FALSE by S.GetNext */

}

Close(R,S) {R.Close();S.Close()

}

Page 14: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Iterator for sort-scan• In an iterator for sort-scan

– Open has to do all of 2PMMS, except the merging

– GetNext outputs the next tuple from the merging phase

Page 15: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Algorithms for implementing RA-operators • Classification of algorithms

– Sort based methods– Hash based methods– Index based methods

• Degree of difficultness of algorithms– One pass (when one relation can fit into main memory)– Two pass (when no relation can fit in main memory, but

again the relations are not very extremely large)– Multi pass (when the relations are very extremely large)

• Classification of operators– Tuple-at-a-time, unary operations(, )– Full-relation, unary operations (, )– Full-relation, binary operations (union, join,…)

Page 16: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Cost parameters• In order to evaluate or estimate the cost of query resolution, we

need to identify the relevant parameters.• Typical cost parameters include:

- R = the relation on disk- M = number of main memory buffers available (1buffer =

1block)- B(R) = number of blocks of R- T(R) = number of tuples of R- V(R, a) = number of distinct values in column a of R- V(R, L) = number of different tuples in R (where L is a list of

attributes or columns)• Simple cost estimate:

- Basic scan: B(R) - 2PMMS: 3B(R)

• Recall that final output is not counted

Page 17: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

One pass, tuple-at-a-time• Selection and projection

• Cost = B(R) or T(R) (if the relation is not clustered)

• Space requirement: M 1 block

• Principle: – Read one block (or one tuple if the relation is not

clustered) at a time– Filter in or out the tuples of this block.

)(RC )(RL

Page 18: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

One pass, unary full-relation operations

• Duplicate elimination: for each tuple decide:• seen before: ignore• new: output

• Principle: – It is the first time we have seen this tuple, in which case

we copy it to the output.– We have seen the tuple before,in which case we must

not output this tuple. • We need a Main Memory hash-table to be efficient.

• Requirement: MRB ))((

Page 19: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Grouping: Accumulate the information on groups in main memory.

• Need to keep one entry for each value of the grouping attributes, through a main memory search structure (hash table).

• Then, we need for each group to keep an aggregated value (or values if the query asks for more than one aggregation).

– For MIN/MAX we keep the min or max value seen so far for the group.

– For COUNT aggregation we keep a counter which is incremented each time we encounter a tuple belonging to the group.

– For SUM, we add the value if the tuple belongs to the group.

– For AVG?

MM requirement.

• Typically, a (group) tuple will be smaller than a tuple of the input relation,

• Typically, the group number will be smaller than the number of tuples in the input relation. This is their number:

• How you would do an iterator for grouping?

)))((( RB L

One pass, unary full-relation operations

Page 20: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

One pass, binary operators• Requirement: min(B(R),B(S)) ≤ M

• Exception: bag union• Cost: B(R) + B(S)• Assume R is larger than S.

How to perform the operations below:– Set union, set intersection, set difference– Bag intersection, bag difference– Cartesian product, natural join

• All these operators require reading the smaller of the relations into main memory using there a search scheme (e.g. main memory hash table) for easy search and insertion.

Page 21: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Set Union• Let R and S be sets.

• We read S into M-1 buffers of main memory. • All these tuples are also copied to the output.

• We then read each block of R into the Mth buffer, one at a time.

• For each tuple t of R we see if t is in S, and if not, we copy t to output.

Page 22: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Set Intersection• Let R and S be sets or bags.

• The result will be set.

• We read S into M-1 buffers of main memory.

• We then read each block of R into the M-th buffer, one at a time.

• For each tuple t of R we see if t is in S, and if so, we copy t to output. At the same time we delete t from S in Main Memory.

Page 23: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Set Difference• Let R and S be sets.

• Since difference is not a commutative operator, we must distinguish between R-S and S-R.

• Read S into M-1 buffers of main memory.

• Then read each block of R into the Mth buffer, one at a time.

• To compute R-S:

• for each tuple t of R we see if t is not in S, and if so, we copy t to output.

• To compute S-R:

• for each tuple t of R we see if t is is in S, we delete t from S in such a case. At the end we output those tuples of S that remain.

Page 24: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Bag Intersection• Let R and S be bags.

• Read S into M-1 buffers of main memory.• Also, associate with each tuple a count, which initially

measures the number of times the tuple occurs in S.

• Then read each block of R into the M-th buffer, one at a time.

• For each tuple t of R we see if t is in S. If not we ignore it. • Otherwise, if the counter is more than zero, we output t and

decrement the counter.

Page 25: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Bag Difference• We read S into M-1 buffers of main memory. • Also, we associate with each tuple a count, which initially measures

the number of times the tuple occur in S.

• We then read each block of R into the M-th buffer, one at a time.

• To compute S-R: • for each tuple t of R we see if t is is in S, we decrement its counter. • At the end we output those tuples of S that remain with counter

positive.• To compute R-S:

• we may think of the counter c for tuple t as having c reasons to not output t.

• Now, when we process a tuple of R we check to see if that tuple appears in S. If not we output t.

• Otherwise, we check to see the counter c of t. If it is 0 we output t. • If not, we don’t output t, and we decrement c.

Page 26: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Product• We read S into M-1 buffers of main memory. No special

structure is needed.

• We then read each block of R into the M-th buffer, one at a time. And combine each tuple with all the tuples of S.

Page 27: Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional

Natural Join• We read S into M-1 buffers of main memory and build a

search structure where the search key is the shared attributes of R and S.

• We then read each block of R into the M-th buffer, one at a time. For each tuple t of R we see if t is in S, and if so, we copy t to output.