Transcript
Page 1: Query Processing and Query Optmization

Query Processing and Query Optmization

Prof. Sin-Min Lee

Department of Computer Science

San Jose State University

Page 2: Query Processing and Query Optmization

Transaction ManagementUntil now, our concept of database has been one in

which programs accessing the database are run one

at a time (serially). Often this is indeed the case.

However, there are also numerous applications in

which more than one program, or different

executions of the same program, run simultaneously

(concurrently). This is where TRANSACTION

MANAGEMENT comes in handy

Page 3: Query Processing and Query Optmization

Example #1An airline reservation system, where at one time,

several agents may be selling tickets, and therefore,

changing lists of passengers and counts of available

seats. The problem here is that if we are not careful

when we allow two or more processes to access the

database, we could sell the same seat twice. Here 2

processes that read and change the value of the same

object must not be allowed to run concurrently,

because they might interact in undesirable ways.

SAFETY

Page 4: Query Processing and Query Optmization

Example #2

Statistical database, such as census data, where

many people may be querying the database at

once. Here, as long as no one is changing the

data, we do not really care in what order the

processes read data. We can let the operating

system schedule simultaneous read requests.

Here we want to allow maximum concurrent

operations, so time can be saved

SPEED

Page 5: Query Processing and Query Optmization

Important Terms

Serializable schedule -- is a linear

arrangement of the database calls from several transactions with the property: the final database state obtained by executing the calls in schedule order is the same as the obtained by running the transactions in some unspecified serial order

Page 6: Query Processing and Query Optmization

Important Terms (cont…)

Lock -- is an access privilege on a database

object, which the DBMS grants to a particular transaction. To allow the privileged transaction to complete its work without undue interference, the lock restricts the access of competing transactions to the database object.

Page 7: Query Processing and Query Optmization

Important Terms (cont…)

Two-phase locking protocol -- is a discipline

that transactions can use to ensure serializability. The first phase is characterized by the monotonic acquisition of new locks or the strengthening of existing locks. The second phase involves the monotonic downgrade or release of existing locks. This lock is the most important technique for

managing concurrency.

Page 8: Query Processing and Query Optmization

Important Terms (cont…)

Strict two-phase locking protocol -- each transaction holds all its locks until it commits or rolls back. When the commit or rollback is secure, it releases all the locks. In other words, the shrinking phase occurs all at once, immediately after the transaction terminates.

Page 9: Query Processing and Query Optmization

Important Terms (cont…)Timestamping -- is another method for

securing concurrency in the face of conflicting transactions.

Timestamp -- is a time of origin of each

transaction given by transaction manager portion of DBMS

Items -- units of data to which access is

controlled

Page 10: Query Processing and Query Optmization

Important Terms (cont…)

Transaction -- is simply a single execution of a program. This program may be a simple query expressed in one of the query languages or an elaborate host language program with embedded calls to a query language. Several independent executions of the same program may be in progress simultaneously; each is a different transaction

Page 11: Query Processing and Query Optmization

Main Idea

To a large extent, transaction management can be seen as an attempt to make complex operations appear atomic. That is, they either occur in their entirety or do not occur at all, and if they occur, nothing else apparently went on during the time of their occurrence. The normal approach to ensuring atomicity of transactions is ‘serialization’, which forces transactions to run concurrently in a way that makes it appear that they ran one-at-a-time (serially).

Page 12: Query Processing and Query Optmization

Main Idea

In reality, transaction are sequences of more

elementary steps, such as reading or writing of

single items from the database, and

performing simple arithmetic steps in the

workspace. When concurrency control is

provided, other primitive steps are also

needed, steps which set and release locks,

commit (complete) transactions, and others

Page 13: Query Processing and Query Optmization

Items

To manage concurrency, the database must be partitioned into items, which are the units of data to which access is controlled. The nature and size of items are for the system designer to choose. In the relational model of data, for example, we could choose large items, like relations, or small items like individual tuples or even components of tuples.

Page 14: Query Processing and Query Optmization

Locks

The most common way in which access to items is controlled is by “locks.” Lock manager is the part of a DBMS that records, for each item I, whether one or more transactions are reading or writing any part of I. If so, the manager will forbid another transaction from gaining access to I, provided the type of access (read or write) could cause a conflict, such as the duplicate selling of an airline seat.

Page 15: Query Processing and Query Optmization

Locks

As it is typical for only a small subset of the

items to have locks on them at any one time,

the lock manager can store the current locks in

a lock table which consists of records

(<item>,<lock type>,<transaction> The

meaning of record (I,L,T) is that transaction T

has a lock of type L on item I.

Page 16: Query Processing and Query Optmization

Example of locks

Lets consider two transaction T1 and T2. Each accesses an item A, which we assume has an integer value, and adds one to A.

Read A;A:=A+1;Write A;-----------------------------------------------------------T1: Read A A:=A+1 Write AT2: Read A A:=A+1 Write A-----------------------------------------------------------

Page 17: Query Processing and Query Optmization

Example of locks (cont…)

The most common solution to this problem is to provide a lock on A. Before reading A, a transaction T must lock A, which prevents another transaction from accessing A until T is finished with A. Furthermore, the need for T to set a lock on A prevents T from accessing A if some other transaction is already using A. T must wait until the other transaction unlocks A, which it should do only after finishing with A.

Page 18: Query Processing and Query Optmization
Page 19: Query Processing and Query Optmization
Page 20: Query Processing and Query Optmization
Page 21: Query Processing and Query Optmization
Page 22: Query Processing and Query Optmization

Introduction

Goals:

• get the right answer to a query

• get it fast

Possible query representations (after parsing)

• relational algebra (not usually)

• query graph (usually)

Page 23: Query Processing and Query Optmization

Strategies for making relational algebra queries “better”

• Push down selects – eg.: r1(r1no, a, b, c, d) r2(r2no, x, y, z, r1no)

σ r1.a = 7 (r1 JOIN r2) (σ r1.a = 7 r1) JOIN r2

• push down projects– Πr1.a, r1.b(r1 JOIN r2) Πr1.a, r1.b(

(Πr1no,a,b (r1)) JOIN r2)

Page 24: Query Processing and Query Optmization

Strategies for making relational algebra queries “better”(cntd)

• Eliminate products– σr1.r1no=r2.r1no (r1 r2) r1 JOIN r1.r1no=r2.r1no r2

• replace– σp1 and p2 (e1) by σp1 (σp2 (e1))

Page 25: Query Processing and Query Optmization

Strategies for making relational algebra queries “better”(cntd)

• maybe there is an index to support p2 solution

eg,:

σ (sal > 10000) and (name = “elmo”) emp

do name = “elmo” first with index

test results for sal > 10000

Page 26: Query Processing and Query Optmization

Strategies for making relational algebra queries “better”(cntd)

• Join relations in increasing order of size to keep intermediate results small ( saves time)

| e1| = 1

| e2| = 200

| e3| = 100000

eg.: e1 JOIN (e2 JOIN e3) (e1 JOIN e2) JOIN e3

Page 27: Query Processing and Query Optmization

Query Processing Cost Estimation

• Catalog Statistics– r = a relation– nr = |r| = # tuples in r– sr = size of record r in bytes– V(A,r) = # unique attribute A values in r

Given these statistics,• |r s| = nrns

• size of a tuple of r s = sr + ss

Page 28: Query Processing and Query Optmization

Query Processing Cost Estimation

If ‘a’ is a constant, then

|σ r.A = a| = 1 * nr

V(A,r)

Note: histograms can be better for selectivity estimation than V(A,r) statistic.

Page 29: Query Processing and Query Optmization

Selection Predicate Selectivity Estimates

sel(r.A=a) 1 / V(A,r)

sel(r.A<a) (a-min(A,r))/(max(A,r)-min(A,r))

sel(a1 <r.A< a2) (a2 - a1) / (max(A,r) - min(A,r))

sel(p(R.a)) 1/3 guess!!

Page 30: Query Processing and Query Optmization

Join Predicate Selectivity

• What is | r1 (R1) JOIN r2 (R2)| ?

– If R1 R2 is a key of r1, then at most 1 tuple of r1 joins to each r2 tuple. Hence,

| r1 JOIN r2| | r2 |

– if R1 R2 is a key of neither r1 nor r2, then,

• assume R1 R2 = {A}

• a tuple t r1 produces approximately n r2 / V(A, r2) tuples when joined to r2.

• Using the above, | r1 JOIN r2 | (n r1n r2 )/V(A, r2)

Page 31: Query Processing and Query Optmization

Join Predicate Selectivity(cntd)

• If we had used t r2 instead of r1 to compute the above estimate, we would have gotten

| r1 JOIN r2 | (n r1 n r2 )/V(A, r1)

If V(A,r1) V(A,r2), there are likely to be some dangling tuples that don’t participate in the join. Hence, the smaller of the two (join size) estimates is probably better.

Page 32: Query Processing and Query Optmization

Maintaining Statistics

• Doing it after every update is too much overhead

• Hence, most systems gather statistics periodically during slack time ( often this is initiated manually).

Page 33: Query Processing and Query Optmization

1-table selection subquery cost

• Without index– must use a sequential scan

– C I/O = cost of an I/O 11 msec

– C CPU = cost of CPU to process 1 tuple .1msec

cost(σ p (r)) = nr C CPU + pages ( r ) * C I/O

Page 34: Query Processing and Query Optmization

1-table selection subquery cost(cntd)

with clustered index• B+ tree on r.A• p= ( r.A = a)

cost (σ p (r)) = (1/ V(A,r)) * nr C CPU +

(height (r.A B+tree) +(1/V(A,r))*pages ( r )) C I/O

Page 35: Query Processing and Query Optmization

1-table selection subquery cost(cntd)

With unclustered index

• secondary B+ tree on r.A

• p = (r.A = a)

cost (σ p (r)) = (1/ V(A,r)) * nr C CPU +

(height (r.A B+tree) +(1/V(A,r)) * nr) C I/O

Page 36: Query Processing and Query Optmization

Exercise

• What is the cost of solving

σ a1 < r.A < a2 (r)

using a clustered B+ tree on R?

Page 37: Query Processing and Query Optmization

Join StrategiesNested Loop Join ( Simple Iteration) NLJto solve r1 JOINp r2

for each t1 r1 do begin

for t2 r2 do begin

if p(t1 , t2) = true then

output(t1 , t2) as part of result endend

NOTE: outer scan is over r1

inner scan is over r2

(this could have been reversed)

Page 38: Query Processing and Query Optmization

QUIZ• Should outer scan be over biggest or smallest table?• Hint: Consider buffer pool size.

• Variation of nested loop join: block oriented iteration

• Scan inner table once per outer table block instead of outer table tuple.

Page 39: Query Processing and Query Optmization

Sort Merge Join

To do r1 JOIN r2

r1.a = r2 . b

• sort r1 on a

• sort r2 on b

• merge– scan r1 sequentially

– iterate over only needed portion of r2 for each r1 tuple

Page 40: Query Processing and Query Optmization

Sort Merge Join(cntd)

r1 x a r2 b y

a 1 1 p

b 1 2 q

c 2 2 r

d 2 3 s

e 3

cost = sort cost + constant * | r1 JOIN r2|

Page 41: Query Processing and Query Optmization

Use of an Index

• In nested loop join, if there is an index on JOIN attribute of inner table, you can use it.

• In sort-merge join, if there is a B+tree index (preferably clustered) on the join attribute(s) you can avoid sorting that table (read it in sorted order).

Page 42: Query Processing and Query Optmization

Conclusion

• Query processing• Query processing operations

– sequential scan

– index scan

– join• nested loop

• sort merge

• Query processing operation costs• Next time: Hash join

Page 43: Query Processing and Query Optmization
Page 44: Query Processing and Query Optmization
Page 45: Query Processing and Query Optmization
Page 46: Query Processing and Query Optmization
Page 47: Query Processing and Query Optmization
Page 48: Query Processing and Query Optmization
Page 49: Query Processing and Query Optmization
Page 50: Query Processing and Query Optmization
Page 51: Query Processing and Query Optmization
Page 52: Query Processing and Query Optmization
Page 53: Query Processing and Query Optmization
Page 54: Query Processing and Query Optmization
Page 55: Query Processing and Query Optmization
Page 56: Query Processing and Query Optmization
Page 57: Query Processing and Query Optmization
Page 58: Query Processing and Query Optmization

Top Related