9. query processing
Embed Size (px)
DESCRIPTION
Section 9 # 1. 9. Query Processing. SQL Queries in a high level language such as SQL are processed by Horizontal DBMSs in the following steps: 1. SCAN and PARSE (SCANNER-PARSER): The Scanner identifies the tokens or language elements. The Parser check for syntax or grammar validity. - PowerPoint PPT PresentationTRANSCRIPT

9. Query ProcessingSQL Queries in a high level language such as SQL are processed by Horizontal
DBMSs in the following steps:
1. SCAN and PARSE (SCANNER-PARSER): The Scanner identifies the tokens or language elements. The Parser check for syntax or grammar validity.
2. VALIDATED: The Validator checks for valid names and semantic correctness.
3. CONVERTER converts to an internal representation (usually a QUERY TREE)
|4. QUERY OPTIMIZED: Query Optimzier devises a stategy for executing query (chooses among alternative Query trees).
5. CODE GENERATION: generates code to implement each operator in the selected query plan (the optimizer-selected the query tree).
6. RUNTIME DATABASE PROCESSORING: run plan code
Section 9 # 1

The CONVERTER converts to an internal representation (usually a QUERY TREE). E.g., given the database:
_S______________ _C___________ _E______ |S#|SNAME |LCODE | |C#|CNAME|SITE| |S#|C#|GR| |25|CLAY |NJ5101| |8 |DSDE |ND | |32|8 |89| |32|THAISZ|NJ5102| |7 |CUS |ND | |32|7 |91| |38|GOOD |FL6321| |6 |3UA |NJ | |25|7 |68| |17|BAID |NY2091| |5 |3UA |ND .| |25|6 |76| |57|BROWN |NY2092| |32|6 |62|
The SQL request:
SELECT S.SNAME, C.CNAME, E.GRFROM S,C,E WHERE E.GR=68 and C.SITE="ND" and S.LCODE=NJ5101 and C.C#=E.C# and S.S#=E.S#;
gets SCANNED, PARSED, VALIDATED, then may get CONVERTED to query tree following the sequencing of the WHERE-clause.
Section 9 # 2

CONVERTER _S______________ _C___________ _E______ |S#|SNAME |LCODE | |C#|CNAME|SITE| |S#|C#|GR| |25|CLAY |NJ5101| |8 |DSDE |ND | |32|8 |89| |32|THAISZ|NJ5102| |7 |CUS |ND | |32|7 |91| |38|GOOD |FL6321| |6 |3UA |NJ | |25|7 |68| |17|BAID |NY2091| |5 |3UA |ND .| |25|6 |76| |57|BROWN |NY2092| |32|6 |62|
M=PROJ(L)[SNAME,CNAME,GR] | L=SELECT(K.GR=68) | K=SELECT(H.SITE="ND") | H=SELECT(G.LCODE="NJ5101") | G=JOIN(F.C#=C.C#)
/\ / \
JOIN(S.S#=E.S#)=F C /\ / \ S E
This is simplest CONVERTER (uses the ordering in WHERE clause)
SELECT S.SNAME, C.CNAME, E.GRFROM S,C,E WHERE E.GR=68 and C.SITE="ND" and S.LCODE=NJ5101 and C.C#=E.C# and S.S#=E.S#;
Section 9 # 3

CONVERTER
S#|SNAME |LCODE 25|CLAY |NJ510132|THAISZ|NJ510238|GOOD |FL632117|BAID |NY209157|BROWN |NY2092
M=PROJ(L)[SNAME,CNAME,GR] || L=SELECT(K.GR=68) || K=SELECT(H.SITE="ND") ||
H=SELECT(G.LCODE="NJ5101") | |
G=JOIN(F.C#=C.C#) /\ / \
JOIN(S.S#=E.S#)=F /\ / \ S E
Let's see the resultsat each step.
C#|CNAME|SITE8 |DSDE | ND 7 |CUS | ND 6 |3UA | NJ 5 |3UA | ND
S#|C#|GR 32|8 |89 32|7 |91 25|7 |68 25|6 |76 32|6 |62
S#|SNAME |LCODE |C#|GR25|CLAY |NJ5101|7 |6825|CLAY |NJ5101|6 |7632|THAISZ|NJ5102|8 |8932|THAISZ|NJ5102|7 |9132|THAISZ|NJ5102|6 |62
C
S#|SNAME |LCODE |C#|GR|CNAME|SITE25|CLAY |NJ5101|7 |68|CUS |ND25|CLAY |NJ5101|6 |76|3AU |NJ32|THAISZ|NJ5102|8 |89|DSDE |ND32|THAISZ|NJ5102|7 |91|CUS |ND32|THAISZ|NJ5102|6 |62|3UA |NJ
S#|SNAME |LCODE |C#|GR|CNAME|SITE25|CLAY |NJ5101|7 |68|CUS |ND25|CLAY |NJ5101|6 |76|3AU |NJ
S#|SNAME |LCODE |C#|GR|CNAME|SITE25|CLAY |NJ5101|7 |68|CUS |ND
S#|SNAME |LCODE |C#|GR|CNAME|SITE25|CLAY |NJ5101|7 |68|CUS |ND
SNAME |CNAME|GRCLAY |CUS |68
Section 9 # 4

The OPTIMIZER devises a stategy for executing the query (chooses among alternative Query trees).
Is the query tree optimal? Is this tree better?
M=PROJ(L)[SNAME,CNAME,GR] ||
G=JOIN(F.C#=K.C#) /\ / \ / \
JOIN(H.S#=L.S#)=F \
/\ \ / \ \ / \ \
SEL(S.LCODE=NJ5101)=H L=SEL(E.GR=68) K=SEL(C.SITE=ND)
S#|SNAME |LCODE 25|CLAY |NJ510132|THAISZ|NJ510238|GOOD |FL632117|BAID |NY209157|BROWN |NY2092
C#|CNAME|SITE8 |DSDE | ND 7 |CUS | ND 6 |3UA | NJ 5 |3UA | ND
S#|C#|GR 32|8 |89 32|7 |91 25|7 |68 25|6 |76 32|6 |62
CES
C#|CNAME|SITE8 |DSDE | ND 7 |CUS | ND 5 |3UA | ND
S#|C#|GR 25|7 |68
S#|SNAME |LCODE 25|CLAY |NJ5101
S#|SNAME |LCODE |C#|GR 25|CLAY |NJ5101|7 |68
S#|SNAME |LCODE |C#|GR|CNAME|SITE 25|CLAY |NJ5101|7 |68|CUS |ND
SNAME |CNAME|GRCLAY |CUS |68
YES! This tree is better since the intermediate files created are much smaller!!
Section 9 # 5

M=PROJ(L)[SNAME,CNAME,GR] ||
G=JOIN(F.C#=K.C#) /\ / \ / \
JOIN(H.S#=L.S#)=F \
/\ \ / \ \ / \ \
H=SEL(S.LCODE=NJ5101)[[S#,SNAME]] L=SEL(E.GR=68) K=SEL(C.SITE=ND)[[C#,CNAME]]
S#|SNAME |LCODE 25|CLAY |NJ510132|THAISZ|NJ510238|GOOD |FL632117|BAID |NY209157|BROWN |NY2092
C#|CNAME|SITE8 |DSDE | ND 7 |CUS | ND 6 |3UA | NJ 5 |3UA | ND
S#|C#|GR 32|8 |89 32|7 |91 25|7 |68 25|6 |76 32|6 |62
CES
C#|CNAME8 |DSDE 7 |CUS 5 |3UA
S#|C#|GR 25|7 |68
S#|SNAME25|CLAY
S#|SNAME |C#|GR 25|CLAY |7 |68
SNAME |GR|CNAMECLAY |68|CUS
SNAME |CNAME|GRCLAY |CUS |68
Even better! The intermediate files
created are even smaller!!
Note that the following could be done: • SITE attribute can be projected from K (doesn't require elimination of
duplicates because it is not part of the key).• The LCODE attrib can be projected off of H (doesn't require elimination
of duplicates because it is not part of the key).• S# could be projected off of F (it is part of the key but duplicate
elimination could be deferred until M since it will have to be done again there anyway - thus this projection can be a "non duplicate-eliminating" projection also (which we will denote by [[ ]]). [[ ]]-projections take no time, whereas duplicate eliminating projections take a lot of time).
• C# can be (non-duplicate-eliminating) projected off G (just reordering attrs and eliminating duplicates, if any).
Section 9 # 6

What have we learned about QP?GOOD RULES?a. Do SELECTS first (push to the bottom of the tree).b. Do attribute elimination part of PROJECT as soon as possible (push down).c. Only do duplicate elimination once (at top-most PROJECT only or in conjunction with a latter join step).
QUERY OPTIMIZATION, then, is finding an efficient strategy to implement query requests (Automatically, Heuristically, not necessarily optimally)
Note: In lower level languages, the user does the query optimization by writing the procedural code to specify all steps and order those steps. (of course there are optimizing compilers that will automatically alter your "procedures", but still you are mostly responsible for ordering).
Relational queries are issued at a high level (SQL or ODBC), so that system has maximal oportunity to optimize them.
HEURISTIC RULES are used to re-order query tree. (e.g., RULES a. b. c. above) . Some rules depend upon size and complexity estimates.
ESTIMATION estimates the cost of different strategies and chooses the best. Challenge: Get acceptable performance (took 10 years to optimize join process acceptably so that the first
viable Relational DBMSs could be successfully sold!).
Section 9 # 7

Some SELECT implementations: (Each of S2 - S6 requires a special access path.)
S1. Linear search: sequentially search every record. S2. Binary search: (for selections on a clustered or ordered attribute) S3. Using indexes (or hash structures) for an equality comparison S4. Using primary index for an inequality comparison on a key (clustered). S5. Using a clustering index for "=" comparison S6. Using a secondary B+-tree index for "=", use the index set.
SELECTION methods with a WHERE conjunction (AND): S7. Of the many conjunctive attributes, select 1 attribute (usually involving an "=") S8. Intersection of Rrecord Pointers: Intersect RRN-sets then retrieve recordsS9. If there are Bitmapped Indexes, AND bitmaps
CASE-1: SELECT is on an attribute with few distinct values.CASE-2: SELECT is on an attribute with uniqueness (key) or near uniqueness.
S10. If there is a composite index on the attributes involved in condition, use it. S11. If there is a composite hash function, use it.
SELECTION methods when there is a WHERE disjuntion (OR):S12. If there is no access path (indexes or hash functions), use S1 (brute force). S13. If there are access paths, use them and UNION the results.S14. If there are BitMaps, take the OR of the bitmaps.
CODE GENERATION implements the operators above (e.g., SELECT, PROJECT, JOIN...)
Section 9 # 8

S1. Linear search: sequentially search every record.
Required for selections from an unordered relation with no index or access path. SELECT C#, GR FROM ENROLL WHERE S# = 32;
S#|C#|GR 32|8 |89 32|7 |91 25|7 |68 25|6 |76 32|6 |6238|6 |9817|5 |96
ENROLL32|89
S2. Binary search: For selections on a clustered (ordered) attribute (in this case, S#):SELECT C#, GR FROM ENROLL WHERE S# = 38;
32|91
32|91
Go half way (to RRN=3), since S# < 38, go half way down what's left (to RRN= 5).
RRN|S#|C#|GR 0 |17|5 |96 1 |25|7 |68 2 |25|6 |76 3 |32|8 |89 4 |32|7 |915 |34|6 |626 |38|6 |98
ENROLL
Since S# < 38, go half way down what's left (to RRN= 6). Match! Output. Scan aheadand output until no match or EoF. 32|91
Section 9 # 9

S3. Using Indexes: (or hash structures) for an equality comparison.
SELECT C#, NAME FROM STUDENT WHERE S# = 32
S4. Using primary index for an inequality comparison on a key (clustered). (Find starting point with "=", then retrieve all records beyond that point).
SELECT S#,NAME FROM STUDENT WHERE S# 32
RRN|S#|SNAME | LCODE 0 |25|CLAY |NJ51011 |32|THAISZ|NJ51022 |38|GOOD |FL63213 |17|BAID |NY20914 |57|BROWN |NY2092
STUDENTRRN| S#
Index on S#
0 | 25
4 | 57
3 | 17
1 | 32
2 | 38
Index always clustered on the key (here S#)for binarykey search.
32| THAISZ
RID| S#
nondense Primary Index on S#
3,0| 57
1,0| 17
2,0| 32
Find startingpoint (firstS# 32) thenscan aheadtaking alluntil End
32| THAISZ38| GOOD
57| BROWN
RID|S#|SNAME | LCODE 1,0|17|BAID |NY20911,1|25|CLAY |NJ51012,0|32|THAISZ|NJ51022,1|38|GOOD |FL63213,0|57|BROWN |NY2092
STUDENT
Section 9 # 10

S5. Using a Clustered Index: for = comparison.
SELECT C#, GR FROM ENROLL WHERE S# = 32
RRN|S#|C#|GR 0 |17|5 |96 1 |25|7 |68 2 |25|6 |76 3 |32|8 |89 4 |32|7 |915 |32|6 |626 |38|6 |98
ENROLL=ERRN| S#
Clustering Index on S#
1 | 25
0 | 17
3 | 32
6 | 38
Find first S#=32,then scanE ahead for others.
32| 89 32| 91 32| 62
RRN|S#|SNAME |CITY |ST 0 |57|BROWN |NY |NY1 |32|THAISZ|KNOB |NJ2 |17|BAID |NY |NY3 |38|GOOD |GATER|FL4 |25|CLAY |OUTBK|NJ5 |20|JOB |MRHD |MN6 |56|BURGUM|FARGO|ND7 |35|BOYD |FLAX |NE
STUDENT
S6. Using a secondary B+-tree index: For "=", use the index set (assuming a B+tree index)SELECT NAME,CITY FROM STUDENT WHERE S# = 25
*32*38*
*20* n n32* n *56* n
|17 20|25 32|35 38| 56|57 | 2 5| 4 1| 7 3| 6| 0
CLAY|OUTBK
Section 9 # 11

RRN|S#|SNAME |CITY |ST 0 |57|BROWN |NY |NY1 |32|THAISZ|KNOB |NJ2 |17|BAID |NY |NY3 |38|GOOD |GATER|FL4 |25|CLAY |OUTBK|NJ5 |20|JOB |MRHD |MN6 |56|BURGUM|FARGO|ND7 |35|BOYD |FLAX |NE
STUDENT
S6. Using a secondary B+-tree index: For use the index set, then use sequence set (of B+) SELECT NAME,CITY FROM STUDENT WHERE S# 38
*32*38*
*20* n n32* n *56* n
|17 20|25 32|35 38| 56|57 | 2 5| 4 1| 7 3| 6| 0
GOOD |GATER
BURGUM|FARGO
BROWN |NY
Section 9 # 12

S7. Of the many conjunctive attributes, select on 1 attribute (usually 1 involving an "=")
then check the other condition(s) for each retrieved record.
SELECT NAME, CITY FROM STUDENT WHERE S#>25 and ST=NE
RRN| ST
Secondary Index on ST
5 | MN
3 | FL
7 | NE6 | ND1,4| NJ0,2| NY
*32*38*
*20* n n32* n *56* n
|17 20|25 32|35 38| 56|57 | 2 5| 4 1| 7 3| 6| 0
RRN|S#|SNAME |CITY |ST 0 |57|BROWN |NY |NY1 |32|THAISZ|KNOB |NJ2 |17|BAID |NY |NY3 |38|GOOD |GATER|FL4 |25|CLAY |OUTBK|NJ5 |20|JOB |MRHD |MN6 |56|BURGUM|FARGO|ND7 |35|BOYD |FLAX |NE
STUDENT
BOYD |FLAX
Section 9 # 13

S7. Of the many conjunctive attributes, select on 1 attribute (neither involve =! taking S#)
then check the other condition(s) for each retrieved record.
SELECT NAME, CITY FROM STUDENT WHERE S#>38 and STNE
RRN| ST
Secondary Index on ST
5 | MN
3 | FL
7 | NE6 | ND1,4| NJ0,2| NY
*32*38*
*20* n n32* n *56* n
|17 20|25 32|35 38| 56|57 | 2 5| 4 1| 7 3| 6| 0
RRN|S#|SNAME |CITY |ST 0 |57|BROWN |NY |NY1 |32|THAISZ|KNOB |NJ2 |17|BAID |NY |NY3 |38|GOOD |GATER|FL4 |25|CLAY |OUTBK|NJ5 |20|JOB |MRHD |MN6 |56|BURGUM|FARGO|ND7 |35|BOYD |FLAX |NE
STUDENT
GOOD |GATER
BURGUM|FARGO
BROWN |NY
truetruetrue
Section 9 # 14

S#-RRN-list ST-RRN-list intersection 1,7,3,6,0 0,2,7 0,7
RRN|S#|SNAME |CITY |ST 0 |57|BROWN |NY |NY1 |32|THAISZ|KNOB |NJ2 |17|BAID |NY |NY3 |38|GOOD |GATER|FL4 |25|CLAY |OUTBK|NJ5 |20|JOB |MRHD |MN6 |56|BURGUM|FARGO|ND7 |35|BOYD |FLAX |NE
STUDENT
S8. INTERSECTION OF RECORD POINTERS: Intersect RRN-sets then retrieve records. SELECT NAME,CITY FROM STUDENT WHERE S#>25 and (ST=NE or ST=NY);(This can be done in conjunction with any of the above methods. If the RRN-sets are stored ahead of time for particular selection criterial, then they can greatly speed up
the execution. The question is, which should be generated and stored?).
S#|bit-filter 17| 00100000 20| 00000100 25| 00001000 32| 01000000 OR here to end (S#>25) result: 11010011 35| 00000001 OR NE, NY bitfilters: 10100001 38| 00010000 AND two for result: 10000001 56| 00000010 57| 10000000
S9. If Bitmap Indexes BMI on ST ST|bit-filter FL| 00010000 MN| 00001000 NE| 00000001 ND| 00000010 NJ| 01001000 NY| 10100000
Section 9 # 15

S8. INTERSECTION OF RECORD POINTERS: ANDing bitmaps, then retrieve records. SELECT NAME,CITY FROM STUDENT WHERE S#>25 and (ST=NE or ST=NY);
BitMapped Indexes (BMIs) are used only for "low cardinality" attributes in DataWarehouses. (those with a small domain - ie, only a few possible values.
The reason is that for low-cordinality domains (eg, MONTH, STATE, GENDER, etc.), BMI has few entries (rows) and each bitmap is quite dense (many 1-bits To see why this is so, consider two extremes.
CASE-1: For a GENDER attribute in a relation with 80,000 tuples. The BMI looks like:GENDER| bit-filter Female| 0111001010100...1 Male | 1000110101011...0
Eaach bitfilter is 80,000 bits or 10KB so the index is ~20KB with only two distinct values (Note the Male entry is unnecessary since it can be calculated from Female bitfilter as the bit-compliment. Thus, the index is only ~10KB in size altogether.
If a regular index were used:GENDER|RID-list
Female|RID-F1, RID-F2, ..., RID-Fn
Male |RID-M1, RID-M2, ..., RID-Mn Each RID takes 8 bytes (maybe more?) The size is ~640KB. Thus BMI size could be as low as ~10KB and the regular index size ~640KB.
Section 9 # 16

S8. INTERSECTION OF RECORD POINTERS: ANDing bitmaps, then retrieve records. SELECT NAME,CITY FROM STUDENT WHERE S#>25 and (ST=NE or ST=NY);
BitMapped Indexes (BMIs)
CASE-2: SSN attr of employee file for large company (say, with 80,000 employees) BMI: SSN |bit-filter 324-66-9870 |1000000000000...0 ... 687-99-2536 |0000000000000...1 Extant Domain (only those SSN's of existing employees)
Each bitfilter 80Kb (10KB) so the index is 80,000 * ~10KB or ~800MB in size.
If a regular index were used:SSN |RID 324-66-9870 |RID1
... 687-99-2536 |RID80000
If RIDs take 8 bytes and SSN+separators take another 12 bytes, the size is ~20*80,000 bits = ~200KB
Thus the BMI size could be as low as ~800,000KB and the regular index size would be ~200KB
Section 9 # 17

S10. If there is a composite index on the attrs involved in condition, use it.
If there is a composite hash function, use it
Selection implementation is matter of choosing among these alternatives (possibly others?).
SELECTION methods when there is a WHERE disjuntion (OR) in the condition
If there is no access path (indexes or hash fctns), use S1 (brute force).
If there are access paths , use them and UNION the results, or UNION the RID-sets, then get the records (rather than interesection as in the case of AND condition).
If there are BitMaps, take the OR of BitMaps, then get records
Section 9 # 18