query optimization and intro to transactions
Post on 19-Jan-2016
29 Views
Preview:
DESCRIPTION
TRANSCRIPT
Query Optimizationand Intro to
Transactions
R&G, Chapter 15Chapter 16Lecture 18
Administrivia
• Homework 3 due next Tuesday, March 20 by end of class period
• Homework 4 available on class website– Implement nested loops and hash join operators for (new!)
minibase– Due date: April 10 (after Spring Break)
• Midterm 2 is 3/22, 1 week from today– In class, covers lectures 10-17– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall
• Internships at Google this summer…– See http://www.postgresql.org/developer/summerofcode– Booth at the UCB TechExpo this Thursday:– http://csba.berkeley.edu/tech_expo.html– Contact josh@postgresql.org
Review
Query Optimizationand Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
We are here
•Query plans are a tree of operators that compute the result of a query•Optimization is the process of picking the best plan•Execution is the process of executing the plan
Example
Sailors:Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key)40,000 sid values10 ratingsReserves:Tuples 40 bytes long, 100 tuples/page, 1000 pagesClustered B+ tree on bid (key), Unclustered B+ on sid100 distinct bid valuesBoatsTuples 50 bytes long, 80 tuples/page, 100 pagesUnclustered B+, Clustered Hash on color, 50 distinct color values
Select S.sid, COUNT(*) AS numberFROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” GROUP BY S.sid
Reserves
Sailors
sid=sid
Boats
Sid, COUNT(*) AS numbers
GROUPBY sid
bid=bid
Color=red
Pass 1: Best access method for each relation
• Sailors– No predicates, so File Scan is
best
• Reserves– No predicates so File Scan is
best– What about Index Scan with
B+ index on bid for Reserves?– Keep it in mind…tuples
will come out in join order…might come in handy later for a join on bid…
Sailors:Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key)40,000 sid values10 ratings
Reserves
Sailors
sid=sid
Boats
Sid, COUNT(*) AS numbers
GROUPBY sid
bid=bid
Color=red
Reserves:Tuples 40 bytes long, 100 tuples/page, 1000 pagesClustered B+ tree on bid (key), Unclustered B+ on sid100 distinct bid values
Pass 1: Best access method for each relation
• Boats– File Scan
100 I/Os– B+ Index Scan on color
2-3 + 80*100/50 =
3 + 160 I/Os = 163 I/Os– Hash Index Scan on color
1.2 + (80*100/50)/80 =1.2 + 2 I/Os
BoatsTuples 50 bytes long, 80 tuples/page, 100 pagesUnclustered B+, Clustered Hash on color, 50 distinct color values
Reserves
Sailors
sid=sid
Boats
Sid, COUNT(*) AS number
GROUPBY sid
bid=bid
Color=red
Cheapest
Always keep around just in case
Book says to keep it because of interesting order
Pass 2
• For each of the plans in pass 1:– generate left deep plans for joins-
consider different order and join methods
Reserves
Boats
bid=bid
Color=red
Sailors
sid=sid
Reserves
Reserves
sid=sid
SailorsColor=red
Reserves
Boats
bid=bid
• Question: what about SailorsXBoats?XX
Pass 2
Reserves
Boats
bid=bid
Color=red
• First consider which pass 1 plans to use for access path
Reserves
Boats
bid=bid
Color=red
1. Hash index(color) Boats, File scan Reserves
2. B+(color), File Scan Reserves
• File scan Reserves, File Scan Boats
Boats: B+ tree(color), Hash(color), File ScanSailors: File ScanReserves: File Scan
Pass 2
• Note: book also includes these plans:– File Scan Sailors (outer) with Boats (inner)– Boats hash on color with Sailors (inner)– Boats Btree on color with Sailors (inner)
• Would you agree these should be considered?
• File Scan Sailors, File Scan Reserves• B+(sid), File Scan Reserves
• File scan Reserves, File Scan Sailors
Sailors
sid=sid
ReservesReserves
sid=sid
Sailors
• First consider which pass 1 plans to use for access path
Boats: B+ tree(bid), Hash(bid), File ScanSailors: File ScanReserves: File Scan
Pass 2
Reserves
Boats
bid=bid
Color=red
• Now consider join methods and consider all access paths for inner
Reserves
Boats
bid=bid
Color=red
1. Hash index(color) Boats, File scan Reserves
2. B+(color), File Scan Reserves
• File scan Reserves, File Scan Boats
Boats: B+ tree(bid), Hash(bid), File ScanSailors: File ScanReserves: File Scan
Reserves:Clustered B+ tree on bid (key), Unclustered B+ on sidBoatsUnclustered B+, Clustered Hash on color, 50 distinct color values
If you replace file scan of Reserves with B+ index scan, Index Nested Loops is a good choice!
Sort-Merge could also be a good choice becauseTuples come out in bid-order.
Pass 2
• File Scan Sailors, File Scan Reserves• B+(color), File Scan Reserves
• File scan Reserves, File Scan Sailors
Sailors
sid=sid
ReservesReserves
sid=sid
Sailors
• Now consider join methods
Boats: B+ tree(bid), Hash(bid), File ScanSailors: File ScanReserves: File Scan
Exercise: Which plans would you keep for this set of joins?
Pass 3 and beyond
• For each of the plans retained from Pass 2, taken as the outer, generate plans for the next join– For example, let’s take the sort-merge plan for
BoatsxReserves and add in Sailors:
Sailors
sid=sid
B+ index(sid) Reserves
Sort-merge
Reserves
Boats
bid=bid
Color=red
Hash index(color) Boats
B+ index(bid) Reserves
Sort-merge
From pass 2
Note that this plan will produce tuples in sid order; interesting order for the upcoming GROUP BY.
• GROUP BY, ORDER BY, AGGREGATES are all considered after the join plans are chosen.
Nested Queries (Subqueries)
SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)
SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.day = ’01/31/07’)
• Nested queries are parsed into their own ‘query block’ and optimized separately.
• Two kinds: Uncorrelated
Correlated
• An uncorrelated subquery can be computed once per query.
-> Optimizer can choose a plan for the subquery and add temp operator to ‘cache’ the subquery results for use in the rest of the query
Subquery plan
TEMP
Plan for outer query
Nested Queries
• Optimizer chooses a plan for subquery, and treats it as a subroutine to be invoked once per tuple produced by the outer query block.
SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)
Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value
Correlated
Plan for outer query
Subquery plan
Invoked once per tuple produced by outer query block
Nested Queries
• Sometimes correlated subqueries can be rewritten as a non-correlated query– When rewriting, need to be careful to preserve semantics
SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)
Equivalent non-nested query:SELECT DISTINCT S.snameFROM Sailors S, Reserves RWHERE S.sid=R.sid AND R.bid=103
Correlated
Returns at most 1 tuple per sailor; need to add DISTINCT to ensure same semantics in rewritten query
• Rewritten query gives the optimizer more choices for optimization, and it is more likely to choose a good plan
Taking it one step further…
• My area of research is data integrationThis is a very common case in the world today:
Query Optimizationand Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
DB21
DB22
Oracle PostgresOther DBsFilesSpreadsheets….
Query over multiple DBMS’sQuery over multiple DBMS’s What if I could use the query planning and execution of a DBMS to query over multiple, different DBMSs, and other data sources as well?
Taking it one step further…• Heterogenous Data Integration leads to all sorts of
fun with optimization …– Compensation becomes possible…
• You can do complex SQL queries over simple data sources…even if a spreadsheet can’t do joins, the DBMS optimizing the query can!
– Knowing what other sources can do becomes a factor• Oracle, DB2, Postgres all support slightly different versions of SQL• If the data source isn’t a DBMS, how do I know what operations it
can perform (e.g. can a spreadsheet do projects? Filters? Joins?)
– Where to do the work becomes a factor…• e.g., Should I always push as much of query as possible to another
DBMS?
– Network cost becomes a factor…• When do we ship tuples from original source to DBMS processing
the complete query?
Summary
• Query optimization is an important task in a relational DBMS.
• Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).
• Two parts to optimizing a query:– Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
– Must estimate cost of each plan that is considered.• Must estimate size of result and cost for each plan node.• Key issues: Statistics, indexes, operator implementations.
Transaction Management Overview
R & G Chapter 16
There are three side effects of acid. Enhanced long term memory, decreased short term memory, and I forget the third.
- Timothy Leary
Query Compiler
query
Execution Engine Logging/Recovery
LOCK TABLE
Concurrency Control
Storage ManagerBUFFER POOLBUFFERS
Buffer Manager
Schema Manager
Data Definition
DBMS: a set of cooperating software modules
Transaction Manager
transaction
Components of a DBMS
Concurrency Control & Recovery
• Very valuable properties of DBMSs– without these, DBMSs would be much less
useful• Based on concept of transactions with
ACID properties (yep…they’re baa-aack…)
• Concurrent execution of independent transactions– utilization/throughput (“hide” waiting for I/Os.)– response time– fairness
• Example:
Statement of Problem
t0:t1:t2:t3:t4:t5:
T1:tmp1 := read(X)
tmp1 := tmp1 – 20
write tmp1 into X
T2:
tmp2 := read(X)
tmp2 := tmp2 + 10
write tmp2 into X
T2 wins, but if there T2 wins, but if there was a slight delay, was a slight delay, maybe T1 would winmaybe T1 would win-> DBMS wants to -> DBMS wants to ensure a predictable ensure a predictable outcome for outcome for concurrent usersconcurrent users
Statement of problem (cont.)
• Arbitrary interleaving can lead to – Temporary inconsistency (ok, unavoidable)– “Permanent” inconsistency (bad!)
• Need formal correctness criteria.
Definitions
• A program may carry out many operations on the data retrieved from the database
• However, the DBMS is only concerned about what data is read/written from/to the database.
• transaction - a sequence of read and write operations (read(A), write(B), …)– DBMS’s abstract view of a user program
Correctness criteria: The ACID properties
• AA tomicity: All actions in the Xact happen, or none happen.
• CC onsistency: If each Xact is consistent, and the DB starts consistent, it ends up consistent.
• II solation: Execution of one Xact is isolated from that of other Xacts.
• D D urability: If a Xact commits, its effects persist.
Atomicity of Transactions
• Two possible outcomes of executing a transaction:– Xact might commit after completing all its actions– or it could abort (or be aborted by the DBMS) after
executing some actions.
• DBMS guarantees that Xacts are atomic. – From user’s point of view: Xact always either
executes all its actions, or executes no actions at all.
AA
Mechanisms for Ensuring Atomicity
• Main approach: LOGGING– DBMS logs all actions so that it can undo the
actions of aborted transactions.
• Logging used by modern systems, because of need for audit trail and for efficiency reasons.
AA
Transaction Consistency
• “Consistency” - data in DBMS is accurate in modeling real world and follows integrity constraints
• User must ensure transaction consistent by itself– I.e., if DBMS consistent before Xact, it will be after
also
consistent database
S1
consistent database
S2
transaction T
•Key point:
CC
Transaction Consistency (cont.)
• Recall: Integrity constraints– must be true for DB to be considered consistent– Examples:
1. FOREIGN KEY R.sid REFERENCES S2. ACCT-BAL >= 0
• System checks ICs and if they fail, the transaction rolls back (i.e., is aborted).– Beyond this, DBMS does not understand the
semantics of the data.– e.g., it does not understand how interest on a
bank account is computed
CC
Isolation of Transactions• Users submit transactions, and • Each transaction executes as if it was running by
itself.– Concurrency is achieved by DBMS, which interleaves
actions (reads/writes of DB objects) of various transactions.
• Many techniques have been developed. Fall into two basic categories:– Pessimistic – don’t let problems arise in the first
place– Optimistic – assume conflicts are rare, deal with
them after they happen.
II
Example• Consider two transactions (Xacts):
T1: BEGIN A=A+100, B=B-100 ENDT2: BEGIN A=1.06*A, B=1.06*B END
• 1st xact transfers $100 from B’s account to A’s
• 2nd credits both accounts with 6% interest.• Assume at first A and B each have $1000.
What are the legal outcomes of running T1 and T2???• $2000 *1.06 = $2120
• There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. But, the net effect must be equivalent to these two transactions running serially in some order.
II
Example (Contd.)• Legal outcomes: A=1166,B=954 or A=1160,B=960• Consider a possible interleaved schedule:
T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B
This is OK (same as T1;T2). But what about:
T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B
• Result: A=1166, B=960; A+B = 2126, bank loses $6
• The DBMS’s view of the second schedule:T1: R(A), W(A), R(B), W(B)T2: R(A), W(A), R(B), W(B)
II
Formal Properties of Schedules
• Serial schedule: Schedule that does not interleave the actions of different transactions.
• Equivalent schedules: For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule.
• Serializable schedule: A schedule that is equivalent to some serial execution of the transactions.
(Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )
II
top related