basics of database tuning - computing science€¦ · cmpt 454: database ii -- query optimization 5...

Query Optimization

CMPT 454: Database II -- Query Optimization 2

Evaluation of Expressions• Materialization: one operation at a time, materialize intermediate

results for subsequent use– Good for all situations– Sum of costs of individual operations + cost of writing intermediate results

to disk, may be costly• Pipeline: evaluate several operations simultaneously, no need to store

temporary results– May not be applicable in some situations

Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled –allowing overlap of disk writes with computation and reducing execution time


Pipelining• Demand driven (pulling data up): the system keeps making requests

for tuples from the operation at the top of the pipeline• Producer driven (pushing data up): operations generate tuples eagerly

into output buffers until they are full • Iterators: many operations are active at once, pipelining• Join using pipelining ∏balance

σbalance<2500; use index 1

account

IteratorHasnext()

Next()


Query Optimization Example

• Query: find the names of all customers who have an account at any branch located in Brooklyn

• Relational expression:

Pushing selection down


Steps in Query Optimization• Input: an expression• Generate expressions that are logically equivalent

to the given expression– Using equivalence rules

• Estimate the cost of each evaluation plan– Using statistical information about the relations, e.g.,

size, indices, etc.• Annotate the resultant expressions in alternative

ways to generate alternative query-evaluation plans

• Cost-based optimization: find the query plan of the lowest cost


Equivalence Rules• Two relational algebra expressions are said to be equivalent if on every

legal database instance the two expressions generate the same set of tuples– Two expressions in the multiset version of the relational algebra are said to

be equivalent if on every legal database instance the two expressions generate the same multiset of tuples

– Order of tuples does not matter• An equivalence rule says that expressions of two forms are equivalent

– We can replace one form by the other•••• σθ(E1 X E2) = E1 θ E2• σθ1(E1 θ2 E2) = E1 θ1∧ θ2 E2

))(()(2121

EE θθθθ σσσ =∧))(())((

1221EE θθθθ σσσσ =

)())))((((121

EE LLnLL Π=ΠΠΠ KK


More Equivalence Rules• E1 θ E2 = E2 θ E1

• (E1 E2) E3 = E1 (E2 E3)• (E1 θ1 E2) θ2∧ θ3 E3 = E1 θ2∧ θ3 (E2 θ2 E3)• σθ0(E1 θ E2) = (σθ0(E1)) θ E2

• σθ1∧θ2 (E1 θ E2) = (σθ1(E1)) θ (σθ2 (E2))


More Equivalence Rules••• E1 ∪ E2 = E2 ∪ E1

• E1 ∩ E2 = E2 ∩ E1

• (E1 ∪ E2) ∪ E3 = E1 ∪ (E2 ∪ E3)• (E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3)• σθ (E1 – E2) = σθ (E1) – σθ(E2) and similarly for ∪ and ∩ in place of

–

• σθ (E1 – E2) = σθ(E1) – E2 and similarly for ∩ in place of –, but not for ∪

• ΠL(E1 ∪ E2) = (ΠL(E1)) ∪ (ΠL(E2))

))(())(()( 2121 2121 EEEE LLLL ∏∏=∏ ∪ θθ

)))(())((()( 2121 42312121EEEE LLLLLLLL ∪∪∪∪ ∏∏∏=∏ θθ


Example• Find the names of all customers with an account at a Brooklyn branch whose

account balance is over $1000Πcustomer_name((σbranch_city = “Brooklyn” ∧ balance > 1000 (branch (account depositor)))

• Transformation using join associatively (Rule 6a):Πcustomer_name((σbranch_city = “Brooklyn” ∧ balance > 1000 (branch account)) depositor)

• Second form provides an opportunity to apply the “perform selections early”rule, resulting in the subexpression

σbranch_city = “Brooklyn” (branch) σ balance > 1000 (account)• When we compute (σbranch_city = “Brooklyn” (branch) account ), we obtain a

relation whose schema is (branch_name, branch_city, assets, account_number, balance)

• Push projections and eliminate unneeded attributes from intermediate results to getΠcustomer_name ((Πaccount_number ( (σbranch_city = “Brooklyn” (branch) account ))

depositor )– Performing the projection as early as possible reduces the size of the relation to be

joined


Example


Join Order

• Consider the expressionΠcustomer_name ((σbranch_city = “Brooklyn” (branch)) (account

depositor))• Could compute account depositor first, and join result

with σbranch_city = “Brooklyn” (branch), but account depositor is likely to be a large relation

• Only a small fraction of the bank’s customers are likely to have accounts in branches located in Brooklyn– it is better to compute σbranch_city = “Brooklyn” (branch) account

first


Catalog Information• Stored in DBMS catalog• nr: number of tuples in a relation r• br: number of blocks containing tuples of r• lr: size of a tuple of r• fr: blocking factor of r – the number of tuples of r that fit into one block• V(A, r): number of distinct values that appear in r for attribute A; same

as the size of πA(r)• If tuples of r are stored together physically in a file, then• Maintained periodically

– Maintaining exact information is too expensive!• Statistics information in real-world systems

– More information– Histogram

• Attribute age can be divided into ranges of 0-9, 10-19, …, 90-99, 100+• The count of tuples in each range

⎥⎥⎥

⎥

⎤

⎢⎢⎢

⎢

⎡=

rfrn

rb


Obtaining Statistics in SQL Server

CREATE STATISTICS statistics_nameON { table | view } ( column [ ,...n ] )

[ WITH [ [ FULLSCAN

| SAMPLE number { PERCENT | ROWS } ] [ , ] ] [ NORECOMPUTE ]

]

CREATE STATISTICS names ON Customers (CompanyName, ContactName) WITH SAMPLE 5 PERCENT

CREATE STATISTICS anamesON authors (au_lname, au_fname)WITH SAMPLE 50 PERCENT


Selection Size Estimation• Simple case: σA=v(r)• Under uniform distribution: nr / V(A, r)

– nr: number of tuples in a relation r– V(A, r): number of distinct values that appear in r for attribute A

• Uniform distribution often does not hold in real data sets– A reasonable approximation of reality in many cases, keep our

presentation relatively simple– Histogram information can be used

• Estimation of Single Comparison– σA≤V(r)– C = 0 if v < min(A,r)

– C =

– min(A, r) and max(A, r): the minimum and maximum values that appear in rfor attribute A, where nr is number of tuples in a relation r

– If statistics information is unavailable, then estimate c as nr / 2

),min(),max(),min(

rArArAvnr −

−⋅


Estimation of Complex Cases

• Selectivity of a condition θi

– The probability that a tuple in the relation r satisfies θi

– If si is the number of satisfying tuples in r, the selectivity of θi is si /nr

• Conjunction: σθ1∧ θ2∧. . . ∧ θn (r):• Disjunction: σθ1∨ θ2 ∨. . . ∨ θn (r):

• Negation: σ¬θ(r): nr – size(σθ(r))

nr

n

jj

r n

sn

∏=⋅ 1

⎟⎟⎠

⎞⎜⎜⎝

⎛−−⋅ ∏

=

n

j r

jr n

sn

1

)1(1


Join Size Estimation• Cartesian product r x s: (nr * ns) tuples• R ∩ S = ∅: r s is the same as r x s• R ∩ S is a key for R: r s is no greater than the number of

tuples in s– A tuple of s will join with at most one tuple from r

• R ∩ S in S is a foreign key in S referencing R: r s has the same number of tuples as s

• Example– Consider the natural join R S U– Statistics for three relations R(a,b) S(b,c) U(c,d)

T(R) = 1000V(R,b) = 20

T(S) = 2000V(S,b) = 50V(S,c) = 100

T(U) = 5000

V(U,c) = 500


Example (Continued)

• Estimate for (R S) U– T(R S) is T(R)T(S) / max(V(R,b), V(S,b))– T(R S) = 1000 * 2000 / 50 = 40000– Join R S with U– T(R S)T(U) / max(V(R S,c), V(U,c))– V(R S,c) is the same as V(S,c) (=100)– The result is, 40000 * 5000 / max(100, 500), or 400000

• Could start by joining S and U first • The estimate of any natural join is the same,

regardless of how we order the joins


The Case of Non-Key

• Each value appears with equal probability

• All tuples in r produce tuples in r s– Each tuple t in r produces tuples in r s

• All tuples t in s produce tuples in r s

• Choose the smaller one as the estimation–

),( sAVnn sr ⋅

),( rAVnn sr ⋅

),( sAVns

)},(),,(max{ sAVrAVnn sr ⋅


Estimation for Other Operations• Projection: estimated size of πA(r) = V(A,r)• Aggregation: estimated size of AgF(r) = V(A,r)• Outer join

– size of r s = size of r s + size of r– size of r s = size of r s + size of r + size of s

• Inaccurate, only upper bounds on the sizes• Set operations

– Unions/intersections of selections on the same relation: rewrite and use size estimate for selections

• σθ1 (r) ∩ σθ2 (r) can be rewritten as σθ1 σθ2 (r)– Operations on different relations:

• Estimated size of r ∪ s = size of r + size of s• Estimated size of r ∩ s = minimum size of r and size of s• Estimated size of r – s = r

– Inaccurate, upper bounds on the sizes


Estimation of Number of Distinct Values for Selections σθ (r)• If θ forces A to take a specified value: V(A,σθ (r)) = 1• If θ forces A to take one of a specified set of values: V(A,σθ

(r)) = number of specified values– e.g., (A = 1 V A = 3 V A = 4)

• If the selection condition θ is of the form A op r, estimated V(A,σθ (r)) = V(A.r) * s, where s is the selectivity of the selection

• In all the other cases: use approximate estimate of min(V(A,r), nσθ (r) )

• More accurate estimate is feasible using probability theory, but the above works fine generally


Enumeration of Equivalent Plans

• Use equivalence rules to systematically generate expressions equivalent to the given expression– For each expression found so far, use all

applicable equivalence rules, and add newly generated expressions to the set of expressions found so far

– Repeat until no more expressions can be found• Reduce space cost: share common sub-

expressions


A Local Optimal Method

• Choose the best algorithm for each operator• The global effect may not be optimal

– A merge sort may be costlier than a hash join, but enables fast algorithms for later operations, e.g., duplicate elimination, insertion, or another merge join

• All possible algorithms should be considered– Consider all the possible plans and choose the best one

in a cost-based fashion– Use heuristics to choose a plan– Practical system incorporates elements from both

approaches


Cost-Based Optimization

• Generate a range of query-evaluation plans using the equivalence rules

• Choose the one with the least cost• For a complex query, the number of different

possible query plans can be large– For r1 r2 . . . Rn, there are (2(n – 1))!/(n – 1)!

different join orders!• No need to generate all the join orders

– Use dynamic programming, the least-cost join order for any subset of {r1, r2, . . . rn} is computed only once and stored for future use


Find the Best Planprocedure findbestplan(S)

if (bestplan[S].cost ≠ ∞)return bestplan[S]

// else bestplan[S] has not been computed earlier, compute it nowfor each non-empty proper subset S1 of S

P1= findbestplan(S1)P2= findbestplan(S - S1)A = best algorithm for joining results of P1 and P2cost = P1.cost + P2.cost + cost of Aif cost < bestplan[S].cost

bestplan[S].cost = costbestplan[S].plan = “execute P1.plan; execute P2.plan;

join results of P1 and P2 using A”return bestplan[S]

Complexity: O(3n)Space complexity: O(2n)


Heuristic Optimization• Deconstruct conjunctive selections into a sequence of

single selection operations • Move selection operations down the query tree for the

earliest possible execution• Execute first those selection and join operations that will

produce the smallest relations• Replace Cartesian product operations that are followed by

a selection condition by join operations• Deconstruct and move as far down the tree as possible

lists of projection attributes, creating new projections where needed

• Identify those subtrees whose operations can be pipelined, and execute them using pipelining


Left-Deep Join Order• Only consider those join orders where the right operand of each join is

one of the initial relations– Convenient for pipelined evaluation, only one input the each join is

pipelined• If only left-deep join orders are considered, time complexity is O(n!)

– By dynamic programming, the complexity is O(n2n)• For an n-way join, consider n sets of left-deep join orders such that

each set starts with a different one of the n relations

Left-deep join tree

Non-left-deep join tree

basics of database tuning - computing science€¦ · cmpt 454: database ii -- query optimization 5...

Documents