query processing and optimization. query processing efficient query processing crucial for good or...
of 88
/88
Query Processing and Optimization
Embed Size (px)
TRANSCRIPT
- Slide 1
- Query Processing and Optimization
- Slide 2
- Query Processing Efficient Query Processing crucial for good or even effective operations of a database Query Processing depends on a variety of factors, not everything under the control of the DBMS Insufficient or incorrect information can result in vastly ineffective plans Query Cataloging
- Slide 3
- Steps in Query Execution SQL Query Scanning, Parsing, Validating Intermediate query representation (logical query plan tree)
- Slide 4
- Steps in Query Execution Intermediate Query Representation Query Optimizer Query Execution Plan (physical query plan tree)
- Slide 5
- Steps in Query Execution Query Execution Plan Query Code Generator Code to execute query
- Slide 6
- Steps in Query Execution Code to Execute Query Run-time Database Processor Query Results
- Slide 7
- Query Processing Intermediate Form Usually a relational algebra form of the SQL query Uses heuristics and cost-based measures for optimization Physical Query Plan In a language that is interpreted and executed on the machine or compiled into machine code.
- Slide 8
- Physical-Query-Plan Operators Basic set of operators that define the language of physical query execution plans Comprises the set of relational operators in addition to some more required operators
- Slide 9
- Physical-Query-Plan Operators Table scan operator: Scan and return an entire relation R Scan and return only those tuples of R that satisfy a given predicate Table scan: reading all blocks of a sorted data file in sequence Index scan: Use an index to read al blocks of a data file in sequence
- Slide 10
- Sorting while Scanning Sort-scan operator sorts a relation R while scanning it into memory If sorting is done on an indexed key attribute, no need to do anything; scanning will read R in sorted order If R is small enough to fit in main memory, perform sorting after scanning Else, use external sort-merge techniques to implement sort-scan
- Slide 11
- Iterators Iterators are physical-query-plan operators that comprise of three stages: Open() where an iteratable object (such as a relation is opened) GetNext() returns the next element of the object Close() closes control on the object.
- Slide 12
- Iterators Example of a Table scan iterator: Open() { b first block of R; t first tuple of b; } GetNext() { if (t is beyond the last tuple in b) { increment b; if (b is beyond last block) RETURN NoMoreData; else t first block of b; }
- Slide 13
- Iterators Example of a Table scan iterator (contd): oldt t; increment t; RETURN oldt; } Close() { }
- Slide 14
- Iterators Computing Bag union R+S using iterators over iterators: Open() { R.Open(); CurRel R; } GetNext() { If (CurRel = R) { t R.GetNext(); if (t != NoMoreData) return t;
- Slide 15
- Iterators Computing Bag union R+S using iterators over iterators: ELSE { /* R is exhausted */ S.Open(); CurRel S; } } ELSE { /* if (CurRel = R) */ RETURN S.GetNext; } /* If S is exhausted, it returns NoMoreData, which is what should be returned by this GetNext as well */ Close() { R.Close(); S.Close() }
- Slide 16
- Database Access Algorithms Algorithms for database access can be broadly divided in the following categories: Sorting-based methods Hash-based methods Index-based methods
- Slide 17
- Physical-Query-Plan Operator Types Tuple-at-a-time unary operators: Can read only one block at a time and required to work with only one tuple Full-relation unary operators: Requires knowledge of all or most of the relation. Read into memory for small relations in one-pass algorithms Full-relation binary operators: Same as above, but on two relations.
- Slide 18
- Tuple-at-a-time Unary Operations Examples: (R), (R) etc A strategy for a one-pass algorithm: R Input Buffer Unary Operator Output Buffer
- Slide 19
- Relation-at-a-time Unary Operations Example: UNIQUE, GROUP BY, etc A strategy for a one-pass algorithm R Input Buffer Unary Operator Output Buffer Data-structure Holding history
- Slide 20
- Relation-at-a-time Binary Operators One-pass strategies for binary relation-at-a-time operators vary between different operators. Almost all of them require at least one of the relation to be completely stored in memory.
- Slide 21
- Strategies Set Union R U S Assuming R is the bigger relation: Read S into memory completely and make it accessible through an in-memory index structure Output all tuples of S while reading For each tuple of R, search if it already exists in S, and output if not.
- Slide 22
- Strategies Set Intersection R S Assuming R is the bigger relation: Read S into memory completely and make it accessible through an in-memory index structure For each tuple of R, search if it already exists in S, and output if true.
- Slide 23
- Strategies Set Difference R - S Assuming R is the bigger relation: Read S into memory completely and make it accessible through an in-memory index structure For each tuple of R, search if it already exists in S. If tuple exists in S, then ignore; else output the tuple
- Slide 24
- Strategies Set Difference S - R Assuming R is the bigger relation: Read S into memory completely and make it accessible through an in-memory index structure For each tuple of R, search if it already exists in S, and delete it from S if it exists Output all remaining tuples of S.
- Slide 25
- Strategies Cross Product R x S Assuming R is the bigger relation: Read S into memory completely and store it in a buffer. No special data structure required. For each tuple of R, combine it with each tuple of S and output the result.
- Slide 26
- Strategies Natural Join R * S. Assume R(X,Y) and S(Y,Z) Assuming R is the bigger relation: Read S into memory completely and store it in an balanced tree index structure or a hash table. For each tuple of R, search S to see if a matching tuple exists. Output if matching tuple found.
- Slide 27
- One-pass algorithms One-pass algorithms are applicable only when one of the relation fits completely into memory In addition, there is enough memory to store at least one block of the other relation Hence, if M memory buffers are available, then one of the relations should have a maximum size of M-1.
- Slide 28
- One-pass Algorithms One pass algorithms rely on correctly estimating relation sizes and allocating memory buffers If too many buffers are allocated, there is a possibility of thrashing If too few buffers are allocated, then one-pass algorithms may not run
- Slide 29
- Summary Stages in Query Processing Logical Query Plan and Physical Query Plan Intermediate Query Language Physical Query Plan language constructs One-pass algorithms for unary and binary operators.
- Slide 30
- Query Processing and Optimization (contd.)
- Slide 31
- Query Processing Efficient Query Processing crucial for good or even effective operations of a database Query Processing depends on a variety of factors, not everything under the control of the DBMS Insufficient or incorrect information can result in vastly ineffective plans Query Cataloging
- Slide 32
- Steps in Query Execution SQL Query Scanning, Parsing, Validating Intermediate query representation (logical query plan tree)
- Slide 33
- Steps in Query Execution Intermediate Query Representation Query Optimizer Query Execution Plan (physical query plan tree)
- Slide 34
- Steps in Query Execution Query Execution Plan Query Code Generator Code to execute query
- Slide 35
- Steps in Query Execution Code to Execute Query Run-time Database Processor Query Results
- Slide 36
- Query Processing Intermediate Form Usually a relational algebra form of the SQL query Uses heuristics and cost-based measures for optimization Physical Query Plan In a language that is interpreted and executed on the machine or compiled into machine code.
- Slide 37
- Physical-Query-Plan Operators Basic set of operators that define the language of physical query execution plans Comprises the set of relational operators in addition to some more required operators Example operators: Table-scan, Index- scan, Sort-scan, Iterator, etc
- Slide 38
- One-pass Algorithms One-pass algorithms are applicable only when one of the relation fits completely into memory In addition, there is enough memory to store at least one block of the other relation Hence, if M memory buffers are available, then one of the relations should have a maximum size of M-1.
- Slide 39
- One-pass Algorithms One pass algorithms rely on correctly estimating relation sizes and allocating memory buffers If too many buffers are allocated, there is a possibility of thrashing If too few buffers are allocated, then one-pass algorithms may not run
- Slide 40
- Multi-pass Algorithms Used when entire relations cannot be read into memory Requires alternate computation and retrieval of intermediate results Many multi-pass algorithms are generalizations of their corresponding two pass algorithms
- Slide 41
- Basic Idea: Two-pass Algorithms Based on Sorting Suppose relation R is too big to fit in memory which can accommodate only M blocks of data. The sorting-based 2-pass algorithms have the following basic structure:
- Slide 42
- Basic Idea: Two-pass Algorithms Based on Sorting 1.Read M blocks of records into memory and sort them 2.Write them back to disk 3.Continue steps 1 and 2 until R is exhausted 4.Use a variety of query-merge techniques to extract relevant results from all the sorted M-blocks on disk.
- Slide 43
- Duplicate Elimination Using Sorting 1.Let relation R, in which duplicates have to be eliminated, be too big to fit in memory 2.Read M
- Parse Trees Parse tree for: DeptName Salary>300000 (Manager Manager.Dno = Dept.DNO Dept) DeptName Salary>300000 Manager.DNO = Dept.DNO Manager Dept
- Slide 69
- Checks on Parse Trees Syntactic Checks: Is the syntax of every operator correct? Entity checks: Does every relation name refer to a valid relation? View Expansion: If a relation name refers to a view, replace the relation node with the parse tree of the view Attribute checks: Does every attribute name refer to valid attributes? Type checks: Does each attribute participating in an expression have the proper type?
- Slide 70
- Rewriting Parse Trees Queries are optimized by rewriting parse trees Rewriting parse trees is guided by a set of rewrite rules Parse tree should be expanded to its maximum extent before rewriting (Ex: views should be replaced by relevant parse trees) Some rewrite rules are situation specific: they work if certain conditions hold on the data set.
- Slide 71
- Pushing Selects Since a select statement reduces the size of a relation, they can be pushed as far down a parse tree as possible: DeptName Salary>300000 Manager.DNO = Dept.DNO Manager Dept The parse tree shown here can be rewritten as
- Slide 72
- Pushing Selects DeptName Salary>300000 Manager.DNO = Dept.DNO Manager Dept Instead of pairing all managers with their respective departments, choose only those who have Salary > 300000.
- Slide 73
- Pushing Selects Conjunctive Selects can be split and pushed to form cascading Selects that progressively reduce relation size: C AND D (R) C ( D (R))
- Slide 74
- Pushing Selects When a query contains a view, Selects may have to be first moved up before they are moved down: Consider relations Movie (title, year, director, language) StarsIn (title, year, StarName, language) and the view: CREATE VIEW BengaliMovies AS SELECT * FROM Movie WHERE language=Bengali;
- Slide 75
- Pushing Selects Consider the query: which star worked under which director in Bengali movies? SELECT starname, director FROM BengaliMovies NATURAL JOIN StarsIn; starname, director language=Bengali Movie StarsIn Parse Tree for BengaliMovies
- Slide 76
- Pushing Selects starname, director language=Bengali Movie StarsIn All tuples of StarsIn are selected, even if they are joined with tuples having language = Bengali starname, director language=Bengali Movie StarsIn language=Bengali Rewritten parse tree
- Slide 77
- Pushing Selects If select over a join involves attributes of only one of the relations, move select below the join. Consider the following query over the Movies database comprising of Movie, StarsIn and BengaliMovies relations. Which stars acted under the direction of Satyajit Ray in Bengali Movies. In SQL: SELECT starname FROM BengaliMovies NATURAL JOIN StarsIn WHERE director = Satyajit Ray
- Slide 78
- Pushing Selects Corresponding Parse Tree Optimized Tree starname, director language=Bengali Movie StarsIn director=Satyajit Ray Expansion of view starname, director language=Bengali Movie StarsIn director=Satyajit Ray
- Slide 79
- Inserting Projects Extra projects can be added near the leaves of the parse tree to reduce the size of tuples going up the tree: DeptName Salary>300000 Manager.DNO = Dept.DNO Manager Dept DeptName Salary>300000 Manager.DNO = Dept.DNO Manager Dept DNO,Salary DNO,DeptName
- Slide 80
- Cost-Based Optimization Factors affecting query cost: Access cost to secondary storage Storage cost of intermediate files Computation cost Memory usage cost Communication cost (between the DBMS server and its client)
- Slide 81
- Catalogs Catalogs in a Database store information for cost estimation Catalogs are meta-data that could be either: Table specific Field specific Index specific Database wide
- Slide 82
- Catalog Examples B(R) Number of blocks taken by relation R T(R) Number of tuples in relation R V(R,a) Number of distinct values relation R has for value a. V(R, [a 1,a 2,a n ]) is the number of distinct values relation R has for the combined set of attributes a 1,a 2,a n.
- Slide 83
- Cost Estimation Examples Estimating the cost of selection. Consider a select of the form: S = A=c (R), where c is a constant and A is an attribute of R. T(S) = estimate of the number of tuples in S = T(R) / V(R,A) Gives a good estimate if all values of A have uniform probabilities of occurrence (in the selection query).
- Slide 84
- Cost Estimation Examples Estimating the cost of selection. Consider an inequality condition in select: S = A c (R), where c is a constant and A is an attribute of R. T(S) = estimate of the number of tuples in S = T(R) (V(R,A)-1) / V(R,A) Gives a good estimate if all values of A have uniform probabilities of occurrence (in the selection query).
- Slide 85
- Cost Estimation Examples Consider a composite condition in select: S = C OR D (R), where C and D are conditions on attributes of R. Let T(R) = n. Let p be the set of conditions satisfying C, and q be the set of conditions satisfying D. The probability that a given tuple will match C OR D is given by: (1 - (1 p/n)(1 q/n). Hence T(S) estimate is given by: n (1 (1-p/n)(1-q/n)).
- Slide 86
- Estimating size of natural joins Consider a natural join R(X,Y) * S(Y,Z). For simplicity, assume Y is a single attribute, while X and Z could be sets. Case 1: If V(R,Y) = V(S,Y) the opposite argument holds and V(R,Y) appears in the denominator.
- Slide 87
- Estimating size of natural joins In a general sense, the maximum of V(R,Y) and V(S,Y) appear in the denominator. Estimate = T(R)T(S) / max(V(R,Y), V(S,Y)) When Y is a composite parameter, the max of each corresponding attribute in Y are compared and multiplied in the denominator.
- Slide 88
- Summary Index Based Algorithms Query Optimization by rewriting parse tree Pushing selects Cascading selects Pulling selects from views Extra projects Cost estimation of query components based on catalog information