the behind declarative -...

41
The behind declarative query processing NAME Program Year LAB ADVISOR RESEARCH TOPIC GATE RANK Permanent Address Anshuman Dutt Ph.D. (CSA) 2010 Database Systems Lab Prof. Jayant Haritsa Robust Query Processing 13 Muzaffarnagar U.P. Select * From summer_school as SS, departments as D, research_details as R, student_details as SP Where SS.speaker_id = D.student_id and ….. SS.talk like ‘%Magic Behind%’ Who is speaker for this talk ? Looks like a student here … student_details IISc Academic DB departments research_details summer_school

Upload: lethu

Post on 13-Feb-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

The behind declarative query processing

NAME Program Year LAB ADVISOR RESEARCH TOPIC

GATE RANK

Permanent Address

Anshuman Dutt

Ph.D. (CSA)

2010 Database Systems Lab

Prof. Jayant Haritsa

Robust Query Processing

13 Muzaffarnagar U.P.

Select * From summer_school as SS, departments as D, research_details as R, student_details as SP Where SS.speaker_id = D.student_id and ….. SS.talk like ‘%Magic Behind%’

Who is speaker for this talk ?

Looks like a student here …

student_details

IISc Academic DB

departments

research_details

summer_school

Page 2: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Querying in RDBMS

GATE rank required to get into CSA ?

Select max(GATE rank) From student_details Where dept = ‘CSA’ and year = 2012

student_details

placements

Max(GATE rank) 58

IISc Academic DB

departments

Aspiring Student

10 July 2013 2/37

Page 3: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Querying in RDBMS

How much salary package would I get

after M.E. (CSA) ?

Select avg(salary) From student_details S, placements P Where S.Student ID = P.Student ID and dept = ‘CSA’

student_details

placements

avg(salary) 20,00,000

IISc Academic DB

departments

Aspiring Student

10 July 2013 3/37

Page 4: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

I know what I want (admission/query result)

but the problem is how to get it

OR Ask for professional

advice

Mr. Query Optimizer

Learn and use MAGIC

The Magic / Planning All plans give the correct answer,

only difference is time taken

10 July 2013 4/37

Page 5: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Query Processing Steps

Query Parser (SYNTAX)

Query Optimizer

Code Generator

Executor

User Queries

Validator (SEMANTICS)

Specify WHAT we want as

result

Parser - checks for spelling

errors as well as the SQL

keywords

Validator - checks whether

the relations, attributes used

in the query are VALID

Optimizer – finds HOW to

execute the query efficiently to

get the results

… self explanatory

Supposed to give us the Most

Efficient Plan

10 July 2013 5/37

Page 6: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Query Complexity • Various cases:

– Single-relation query

– Multiple-relation query

Select max(GATE rank) From students_details Where dept = ‘CSA’

Select avg(salary) From student_details S, placements P Where S.Student ID = P.Student ID and dept = ‘CSA’

10 July 2013 6/37

Page 7: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Single relation : What is there to plan about ?

• For single relation queries

– We need to read the relation from the database/stored on disk

– Can be read in two ways

• Scan all the tuples of the relation directly and for each tuple – do

the processing (predicate evaluation)

• Scan the relation INDIRECTLY (index knows pages having CSA

tuples) and do processing of selected tuples

Relation

Seq scan/Index scan

10 July 2013 7/37

Page 8: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Seq Scan Operator

Relation student_details

Select max(GATE rank) From student_details Where dept = ‘CSA’

10 July 2013 8/37

Page 9: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Index Scan Operator

Index scan - preferable if significantly less number of tuples need to be accessed

Remember: Sequentially accessing disk pages is cheaper than random access

Index on the dept column

Select max(GATE rank) From student_details Where dept = ‘CSA’

Relation student_details

10 July 2013 9/37

Page 10: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Two-Relation Queries

• Decide the scan operator for each of the relations

• Match the two sets of tuples for equality of student ID to get

the answer (this operation is termed as Join)

Select max(salary) From Students S, Placements P Where S.Student ID = P.Student ID and dept = ‘CSA’

Students

Scan

Placement

Scan

JOIN S.Student ID = P.Student ID

10 July 2013 10/37

Page 11: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Logical Plan vs Physical Plan

• Different ways of Scanning

– Sequential Scan

– Index Scan

• Different ways of joining

– Nested Loop join

– Sort Merge Join

– Hash based Join

Students

Scan

Placement

Scan

Join

Students

Seq Scan

Placement

Index Scan

Nested Loop Join

10 July 2013 11/37

Page 12: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Nested Loop Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

Student_ID Salary

1000 4500000

365 5000000

2 500000

… …

1 9000000

653 370000

student_details placements

10 July 2013 12/37

Page 13: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Nested Loop Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

Student_ID Salary

1000 4500000

365 5000000

2 500000

… …

1 9000000

653 370000

Can also utilize index on placements(student_id)

(if present) student_details placements

10 July 2013 13/37

Page 14: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Sort-Merge Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

Student_ID Salary

1000 4500000

365 5000000

2 500000

… …

1 9000000

653 370000

student_details placements

10 July 2013 14/37

Page 15: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Sort-Merge Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

student_details placements

Student_ID Salary

1 9000000

2 500000

3 9500000

… …

999 10000000

1000 8500000

Sort the relation(s) on join column

10 July 2013 15/37

Page 16: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Sort-Merge Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

student_details placements

Student_ID Salary

1 9000000

2 500000

3 9500000

… …

999 10000000

1000 8500000

10 July 2013 16/37

Page 17: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Sort-Merge Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

student_details placements

Student_ID Salary

1 9000000

2 500000

3 9500000

… …

999 10000000

1000 8500000

10 July 2013 17/37

Page 18: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Hash Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

Student_ID Salary

1000 4500000

365 5000000

2 500000

… …

1 9000000

653 370000

student_details placements

Build a Hash Table

HashKey (Student_id, row_id)

0 (1000,1), (365, 2) ….

1 (1,999), ….

2 (2,3) …

3 (653, 1000) …

4 …

10 July 2013 18/37

Page 19: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Hash Join

Student_ID Dept

1 CSA

2 EC

3 CSA

… …

999 CSA

1000 SERC

Student_ID Salary

1000 4500000

365 5000000

2 500000

… …

1 9000000

653 370000

student_details placements

HashKey (Student_id, row_id)

0 (1000,1), (365, 2) ….

1 (1,999), ….

2 (2,3) …

3 (653, 1000) …

4 …

Probe using key

10 July 2013 19/37

Page 20: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Multi-relation Query

• Join of Relations R1, R2, R3, ..., Rn

• Decide join order (joined two at a time)

• Decide a(ri) (access method for Rj)

• Decide ji (join method)

C D B A

bushy

B A

C

D

left-deep

A B

C

D

right-deep

10 July 2013 20/37

Page 21: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Queries Over Multiple Relations

• As the number of joins increases,

– number of alternative plans grows rapidly

– need to restrict the search space and process it efficiently.

• Once first k relations are joined, method to join the composite to (k+1)st relation

– is independent of the order of joining first k

Dynamic Programming

10 July 2013 21/37

Page 22: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Example Search Space Reduction

• Consider r1 r2 r3. There are 12 different join orders:

r1 (r2 r3)

r1 (r3 r2)

(r3 r2) r1

(r2 r3) r1

r2 (r1 r3)

r2 (r3 r1)

(r3 r1) r2

(r1 r3) r2

r3 (r1 r2)

r3 (r2 r1)

(r2 r1) r3

(r1 r2) r3

10 July 2013 22/37

Page 23: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

• To find best join order for (r1 r2 r3) r4 r5 ,

– 12 join orders for computing r1 r2 r3,

– 12 join orders for computing this result with r4 and r5.

– there appear to be 144 join orders to examine.

– find the best join order for the subset of relations {r1, r2, r3}

– use only that order for further joins with r4 and r5.

– instead of 144 choices

– only 12 + 12 = 24 choices.

Example Search Space Reduction

10 July 2013 23/37

Page 24: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Comment on Search Complexity • Consider finding the best join-tree for r1 r2 . . . rn.

• There are (2(n – 1))! / (n – 1)! different join-trees:

– n = 5, number is 1680; n = 7, number is 665280

– n = 10, the number is greater than 176 billion!

• No need to generate all join-trees.

• Using DP, the best join order for any subset of {r1, r2, . . . rn} is

computed only once and stored for future use.

• This reduces time complexity to around O(3n)

– n = 10, number is ~59000.

10 July 2013 24/37

Page 25: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Comparing plans (cost estimation)

• How do we know which plan is better/more efficient

– by estimating the amount of work to be done

• For each plan considered:

– estimate cost of each operation in plan tree and take the summation

– Cost = w1 * Page Fetches from disk (I/O work)

+ w2 * # of tuples processed by CPU (processing work)

– Need to estimate size of result (cardinality) for each operation in tree.

– # tuples (NTuples) and # pages (NPages) for each relation.

– # pages (NPages) for each index.

10 July 2013 25/37

Page 26: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Cardinality Estimation

• Consider a query:

• Maximum # tuples in result is the product of the cardinalities of

relations in the FROM clause.

• Reduction factor (RF) associated with each term reflects the impact

of the term in reducing cardinality from underlying relations.

SELECT attribute list FROM relation list WHERE term1 AND ... AND termk

10 July 2013 26/37

Page 27: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Big Picture

RDBMS

Statistical Metadata

(Execution strategy – Plan)

students

5 x 104 registration

2 x 105

SeqScan 4 x 103

SeqScan 3 x 104

Hash Join (120)

(SQL version)

select count(S.student_id) from registration R, students S where R.student_id = S. student_id and R.cname IN (‘DBMS’,’OS’, …) and S.program IN (‘Ph.D.’, ‘M.S’) Query Optimizer

Base-relation RF 60% 2%

Join RF 0.0001%

# of rows

Reduction Factor: # of rows satisfying the predicate

Number of research students in CSA systems courses

10 July 2013 27/37

Page 28: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Big Picture

RDBMS

Statistical Metadata

(Execution strategy – Plan)

students

5 x 104 registration

2 x 105

SeqScan 4 x 103

SeqScan 3 x 104

Hash Join (120)

(SQL version)

select count(S.student_id) from registration R, students S where R.student_id = S. student_id and R.cname IN (‘DBMS’,’OS’, …) and S.program IN (‘Ph.D.’, ‘M.S’) Query Optimizer

Base-relation RF 60% 2%

Join RF 0.0001%

# of rows

Reduction Factor: # of rows satisfying the predicate

Number of research students in CSA systems courses

10 July 2013 28/37

Page 29: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Big Picture

RDBMS

Statistical Metadata

(Execution strategy – Plan)

students

5 x 104 registration

2 x 105

SeqScan 4 x 103

SeqScan 3 x 104

Hash Join (120)

(SQL version)

select count(S.student_id) from registration R, students S where R.student_id = S. student_id and R.cname IN (‘DBMS’,’OS’, …) and S.program IN (‘Ph.D.’, ‘M.S’) Query Optimizer

Base-relation RF 60% 2%

Join RF 0.0001%

# of rows

Reduction Factor: # of rows satisfying the predicate

Number of research students in CSA systems courses

10 July 2013 29/37

Page 30: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Issue 1: Plan Space Construction • Operator Algorithms:

– Access: Sequential, Index

– Join: Nested Loop, Sort-merge, Hash based

• Example (recommendation of first paper)

– Only left-deep plans are considered

• Left-deep plans allow output of each operator to be pipelined

into the next operator without storing it in a temporary relation.

– Cartesian products are avoided

10 July 2013 30/37

Page 31: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Issue 2: Search/Enumeration Algorithm

• Dynamic Programming (DP)

– practical for n 10

• Other choices are

– Top down/transformation based algorithm

– Randomized algorithm

– Sampling based algorithm

10 July 2013 31/37

Page 32: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Issue 3: Cost Modeling

• Weighted combination of CPU and I/O costs

– disk resident database

– Memory resident database

• Simple counts of relations and indexes maintained in system

catalogs

– Wide variety of metadata structures have been proposed

– No easy way for intermediate cardinality estimation

10 July 2013 32/37

Page 33: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Some more Issues

• Can we exploit parallelization during query optimization ?

• Multi-query optimization

– Simultaneous optimization of multiple queries

• Adaptive/Robust Query Processing

– estimated RF values are often far from actual values

10 July 2013 33/37

Page 34: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Design Framework

• Issues

– For a given query, what plans are considered (i.e. plan space)?

– How to efficiently search through plan space for “cheapest” plan?

– How is the cost of a plan estimated (i.e. cost model, statistics)?

– Parallelization, Multi-query optimization, Robust Query Processing …

10 July 2013 34/37

Page 35: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Few selected papers in Query-Optimization

• 1979: Access Path Selection in RDBMS

• 1984: Implementation techniques for main memory DB

• 1988: Multiple query optimization

• 1989: Optimization of large join queries (randomization algos)

• 1990: Measuring Complexity of Join enumeration

• 1991: Analysis of Strategy spaces and its implications(Left

Deep vs Bushy)

TOP-Conferences: SIGMOD VLDB ICDE

10 July 2013 35/37

Page 36: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

• 1991: On propagation of errors in join result sizes

• 1994: Dynamic query evaluation plans

• 1997: Parametric query Optimization

• 1998: Efficient mid-query re-optimization

• 2002: Least Expected Cost Optimization

• 2004: Robust Query processing through Progressive Optimization

Few selected papers in Query-Optimization

10 July 2013 36/37

Page 37: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

• 2005: Proactive re-optimization

• 2007: Adaptive Join reordering at run-time

• 2007: Optimal Top-down join enumeration

• 2008: Dynamic Programming Strikes back

• 2008: Parallelizing Query Optimization

• 2009: Query Opt- time to rethink the contract

2002: Plan Selection based on Query Clustering

2005: Analyzing Plan Diagrams of Database Query Optimizers

2007: On the Production of Anorexic Plan Diagrams

2008: Identifying Robust plans through Plan-diagram reduction

2010: On the stability of Plan costs and cost of plan stability 2010: The Picasso Database Query Optimizer Analyzer 2012: Peak Power Plays in Databases

Few selected papers in Query-Optimization

DSL Contributions

10 July 2013 37/37

Page 38: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

DSL – Research Projects A

ssociation Rule M

ining

Real Time Database Systems Mining Data Cubes

Priv

acy

pres

ervi

ng D

ata

min

ing

Biological Databases

Analyzing Query Optimizers

Energy-aware Query Processing

PICASSO

CODD

10 July 2013 38/37

Page 39: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Robust Query Processing

• Plan-choice may turn out to be very-inefficient

– Due to Estimation Errors:

• outdated statistics

• attribute-value independence(AVI) assumption

• insufficient data distribution details

• multiplication of errors

Query Optimizer (Dynamic Programming)

SQL query

Cost

Optimal plan

Cost: represents the estimated

time to execute the plan

Statistical

Metadata

Estimated

RFs

(base +

join)

Long standing problem

(> 30 years)

Plan Bouquet: Query Processing without Estimation (submitted to VLDB)

10 July 2013 39/37

Page 40: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Research in Databases

• Fine combination of Theory and Systems

– Simulation

– Operating System

– Real time system

– Computer Networks

– Data mining techniques

– Clustering techniques

– Linear Algebra (Matrix theory)

– Probability, Probabilistic Graphical models

– Randomized algorithms, Online algorithms

– Statistical Inference

– Algebraic structures PODS, ICDT …

10 July 2013 40/37

Page 41: The behind declarative - events.csa.iisc.ernet.inevents.csa.iisc.ernet.in/summerschool2013/slides/magic-behind... · 2010 Database Systems ... placements avg(salary) 20,00,000 IISc

Thanks for your attention Queries ??

Plan Space

Candidate plans Most efficient plan

Search process