1 adaptive query processing | dec 22, 2005 adaptive query processing amol deshpande, university of...

1 Adaptive Query Processing | Dec 22, 2005

Adaptive Query Processing

Amol Deshpande, University of Maryland

Vijayshankar Raman, IBM Almaden Research Center


Goal

Research in query processing and optimization has taken a new track into adaptive query processing (QP)– Most new research projects that need QP use some adaptive approach

• Whether regular QP, XML QP, text, continuous QP, web QP, …

• Selinger-style query processing fallen out of flavor

– Also true in industrial projects, especially new ones.

Educate audience on what this is all about– Where adaptive QP works, where it has known problems, what is unknown

What is Out of Scope of this tutorial– Parallelism, distributed eddies, continuous query processing…


Agenda

Background on non-adaptive Query Processing– History, where it works, and where it breaks

Adaptive Query Processing– Dimensions of Adaptivity

– Evolutionary Adaptive query processing

• Late binding and parametric optimization

• Competition

• Mid-query reoptimization

– Tuple-wise adaptive Query Processing

• Eddies, SteMs, STAIRs

Commentary– What works and what fails

– Open problems


Background on Non-Adaptive Query Processing

Central value proposition of relational model: declarative queries

– You decide what data access (query) you want to do; the DBMS will figure out the “right” algorithm (plan)

– “right” usually translated to “good enough”

Standard method of realizing this: cost-based query optimization


Cost-based Query Optimization

DeclarativeQuery

Results

Query Optimizer

Query Executor

CompiledQuery Plan

Disk(s)


Non-Adaptive Query Processing

Several Improvements over the years

Wider plan space– Started out with left deep tree of binary join operators – Now:

• lots more operators• Many more query transformations• multi-block SQL• bushy plans• …

– Plan space has grown to be huge and messy

Use of statistics to get better costs

Better dynamic programming techniques (transformational, memoization)

Good surveys exist: [Graefe 1993], [Chaudhuri 1998]


Success of Non-Adaptive QP

Used pretty much everywhere– every commercial DBMS uses some form of cost-based optimization

– Why? • Plan space is too wide and combinatorial• Other ideas like randomized optimization simply don’t search widely enough

(e.g. all possible ways of using indexes)

Benefits from 25 years of tuning

Widely viewed as success story for relational databases, and as vindication of relational model– Some qualifications

• DBAs love and demand control (hints, syntax-direction, plan-fixing across releases, …)

• Applications often help the query processor by splitting a query into multiple phases (precursor of adaptivity)


Where does traditional optimization break (1)

DeclarativeQuery

Results

Query Optimizer

Query Executor

Disk(s)

In recent years, optimizers have started to break at the seams as they are extended to new environments

SELECT p.profile_id, p.account_id, p.profile_title, p.profile_default, r.profile_detail_id AS registrant_detail_id, r.profile_first_name AS registrant_first_name, r.profile_last_name AS registrant_last_name, r.profile_job_title AS registrant_job_title, r.profile_organization_name AS registrant_organization_name, r.profile_address1 AS registrant_address1, r.profile_address2 AS registrant_address2, r.profile_city AS registrant_city, r.state_id … FROM TABLE_ACCOUNT_PROFILES p LEFT JOIN TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id = r.profile_id AND r.profile_type_id = 1 LEFT JOIN TABLE_STATES rs ON r.state_id = rs.state_id LEFT JOIN TABLE_COUNTRIES rc ON …

very complex user queries



DeclarativeQuery

Results

Query Optimizer

Query Executor

Disk(s)

In recent years, optimizers have started to break at the seams as they are extended to new environments

SELECT p.profile_id, p.account_id, p.profile_title, p.profile_default, r.profile_detail_id AS registrant_detail_id, r.profile_first_name AS registrant_first_name, r.profile_last_name AS registrant_last_name, r.profile_job_title AS registrant_job_title, r.profile_organization_name AS registrant_organization_name, r.profile_address1 AS registrant_address1, r.profile_address2 AS registrant_address2, r.profile_city AS registrant_city, r.state_id … FROM TABLE_ACCOUNT_PROFILES p LEFT JOIN TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id = r.profile_id AND r.profile_type_id = 1 LEFT JOIN TABLE_STATES rs ON r.state_id = rs.state_id LEFT JOIN TABLE_COUNTRIES rc ON …

very complex user queries

Netw

ork

wide area federations

data streamsX M L



Bad Statistics

– DBA-challenged environment is the norm

– New data types like XML and text

– Deep Web and federated systems

– Correlations (many optimizers implement only marginal histograms)

Very dynamic environments

– Continuous queries – query runs for ever, so things change

– Wide area federations



Complex queries

– Large numbers of tables (usually bail to greedy beyond 10-20 tables in a join)

Queries with Parameter markers

– Select average(salary) from Employee where age > $1 and age < $2

Need for sharing work between queries – esp. in continuous queries

Asynchronous (push) data sources (e.g. data streams)

New Metrics for QP

– E.g., Interactive metric

• User preferences change too quickly for optimization to pay off

• Pipelined plans: focus on early results


Agenda





• Competition





– Open problems


Idea of Adaptive Query Processing

Interleave the optimization and execution stages– We learn better statistics as the query executes

Two main styles

Static Late Inter Intra PerPlans Binding Operator Operator Tuple

Traditional Dynamic QEP Query Scrambling Xjoin, DPHJ, EddiesDBMS Parametric Mid-query Reopt. Convergent QP Competitive Progressive Opt.

Evolutionary Revolutionary


Idea of Adaptive Query Processing

Interleave the optimization and execution stages– We learn better statistics as the query executes

Two main styles





Dimensions of Adaptation

Measure

Analyze

Plan

Actuate

Measure: monitor cardinalities, resource utilization, etc Analyze: verify compliance with performance goals

(very tricky problem during intermediate stages of a query) Plan: can involve optimizer, or be a routing policy Actuate: switch to a new query plan, hopefully without wasting too much

work


Dimensions of Adaptation

Measure

Analyze

Plan

Actuate

Traditional database system use a simple form of inter-query adaptation Measurements done on the tables directly, and analyzed during query

optimization time Also, by measuring the cardinalities during execution:

– Adaptive selectivity estimation [Chen/Roussopoulos 1994]– LEO [Volker et al 2001]– SITS [Bruno/Chaudhuri 2002]

But always a single plan is used to execute the query fully


Agenda





• Competition





– Open problems


Late Binding and Parametric Optimization

Idea: let optimizer pick multiple plans for each query

– Choose the best one just before execution

Plan

Pick one and Actuate

PlanPlans

Q



Parametric optimization [Graefe/Cole 1994, Ioannidis 1992, Ganguly 1998, etc.]

– Pick optimal plans for various regions of the parameter spaces

– Do this during dynamic programming itself

• Using plausible assumptions about cost functions

Challenges:– makes optimization too expensive

– Many plausible assumptions don’t hold in practice (see, e.g., [Haritsa 2005])

– Cannot address problems that 1st crop up during execution

Plan 1

Plan 2

Plan 4

Plan 3

Plan 5

selectivity 1

selectivity 2



Variation: Validity Ranges ([Markl/Raman/Lohman/Piraheh/Simmens 2005])

Attach to each edge of a query plan, a range of cardinalities within which plan is optimal

Easier to compute during dynamic programming optimizer– Use numerical root finding techniques; works well with real cost functions

– Extends nicely to more general notions of “robust query optimization”

• e.g., range within which plan is within 50% of optimal (Raman/Haas/Reder/Markl 2005).

But, Limited effectiveness– only tells you if a plan is broken, not how to fix it

Meshes well with mid-query reoptimization


Competition

E.g. DEC RDB [Antonshenkov 1993]: run access methods competitively Measure: monitor cardinalities, costs of access methods Analyze: decide which access method has higher rate

Advantage: simple method Challenge: avoid redundant work, avoid duplicates

– Use a shared hash table & global timestamp [Raman/Deshpande/Hellerstein 2003]

– Needs more evaluation of overheads

Measure

Analyze

Plan

Actuate a few

PlanPlans Q

Pick faster one

Run for a while


Mid-Query Reoptimization

Kabra et al 1998, Markl et al 2005, Babu et al 2005 Switch query execution plans at well-defined points

– Typically, materialization points

Advantage: seems to work well in practice Independent implementations

Challenge: coarse grained adaptation Cannot always reuse work done in prior executions

Measure

Analyze

New Plan

Actuate

QPlan

If plan is sufficiently sub-optimal, switch to new plan


Measure: monitoring cardinalities

During execution, monitor cardinalities on each edge of query plan– E.g., each call to getNext() increments count

– must refine for operators that are executed multiple times(e.g., inner of nested loops joins) [e.g. Stillger/Markl/Lohman 2001]

As we monitor, we narrow down range of possible values for true cardinality (primarily the lower bound)

Can extrapolate to expected value of true cardinality– E.g.:

– Expected value of c2 = expected value of c1 * (current c2) / (current c1)

opc1 c2


Analyze: Switching plans Mid-Query

Re-optimize the original query to form a new plan Getting a better plan:

– Plug in all cardinality information learned during this query Reusing work:

– Plug in fully materialized relations from current executionas materialized views

Materialization points

– Sorts

– Inner of hash joins

– Explicit materializations (uncorrelated inners, common subexpressions, …)

ST

R

TS

re-optimize


Analyze: when to switch plans

When no tuples have been output, and

When current plan is sufficiently sub-optimal, and – Range of true cardinalities on an edge has no overlap with

validity range for that edge

– Or, expected true cardinality falls far outside of validity range

Cost of reoptimization > cost of new plan – remaining cost of current plan– Hard to gauge this accurately; many variables

• How many intermediate results we will succeed in reusing

• how much better is the new plan

– Relatively easy to gauge this if we switch at a materialization point

• Because cost of reoptimization is negligible


Query Scrambling

Amsaleg/Franklin/Tomasic/Urhan 1996, Urhan/Franklin/Amsaleg 1998 Designed for coping with delays in a wide area setting

Network

Query Processor

Wide Area Data Sources

Measure

Analyze

New Plan

Actuate

QPlan

Measure delays

Response timeoptimization


Agenda







– Open problems


Eddies [Avnur/Hellerstein 2000]

Measure

Analyze

Plan

Actuate

Eddies combine all the dimensions of adaptivity into a single operator

Eddy


Eddies [Avnur/Hellerstein 2000]

A traditional query plan: queries executed using iterator (getNext) model

Query execution using an eddy

S E

E C

S E

Output

CEddy

S E

E C

SEC

Output

An eddy operator• Intercepts tuples from sources and output tuples from operators• Uses feedback from the operators to route


Example Database

Name Level

Joe Junior

Jen Senior

Name Course

Joe CS1

Jen CS2

Course Instructor

CS2 Smith

select *from students, enrolled, courseswhere students.name = enrolled.name and enrolled.course = courses.course

Students Enrolled

Name Level Course

Joe Junior CS1

Jen Senior CS2

Enrolled Courses

Students Enrolled

Courses

Name Level Course Instructor

Jen Senior CS2 Smith


Query Execution using Eddies

Eddy

SE

C

Insert with key hash(joe)Probe

to find matches

S E

HashTableS.Name

HashTableE.Name

E C

HashTableE.Course

HashTableC.Course

Joe Junior

Joe JuniorJoe Junior

No matches; Eddy processesthe next tuple

Output



Eddy

SE

C

InsertProbe

S E

HashTableS.Name

HashTableE.Name

E C

HashTableE.Course

HashTableC.Course

Joe Jr

Jen Sr

Joe CS1

Joe CS1Joe CS1

Joe Jr CS1

Joe Jr CS1Joe Jr CS1

Output

CS2 Smith



Eddy

SE

COutput

Probe

S E

HashTableS.Name

HashTableE.Name

E C

HashTableE.Course

HashTableC.Course

Joe Jr

Jen Sr

CS2 Smith

Jen CS2

Joe CS1

Joe Jr CS1Jen CS2

Jen CS2Jen CS2 Smith

Probe

Jen CS2 SmithJen CS2 SmithJen Sr. CS2 Smith

Jen Sr. CS2 Smith

Note: this is a symmetric hash join


Eddies: Postmortem

• Eddy executes different query execution plans for different parts of data• Eddy adapts the join order during query execution

• Access methods and join algorithms still chosen up front

Students Enrolled

Output

Courses

E C

S E

Courses Enrolled

Output

Students

E S

C E

Course Instructor

CS2 Smith

Course Instructor

CS2 Smith

Name Course

Joe CS1

Name Level

Joe Junior

Jen Senior

Name Level

Joe Junior

Jen Senior

Name Course

Jen CS2


Eddies and Pipelined Joins

Simplest way to describe eddies is in terms of symmetric hash join– This is the approach we used in our example

Approach extends directly to index nested loops joins

Can also extend this to non-pipelined joins (e.g. sort-merge, or hybrid hash)

– BUT, opportunities for adaptation are much less

• Only at end of building whole hash table (just like with mid-query re-optimizations)

– Generalized using SteMs and STAIRs


Eddies: Routing Policy – Non-adaptive

Choosing which operator to route a given tuple to– The brain of the eddy

Eddy

S E

E C

SEC

Output

Non-Adaptive Optimization

1. Run a regular optimizer to figure

out the join order

2. Map join order into order of tuple routing

Send S and E tuples hereReturns S JOIN E (SE) tuples

Students Enrolled

Output

Courses

E C

S E

Send SE, and C tuples hereReturns SEC tuples


Eddies: Routing Policy – Lottery Scheduling


Eddy

S E

E C

SEC

Output

Lottery Scheduling [Avnur 00]

Simplified Description

1. Maintain for each operator:

tuples sent

tuples returned

cost per tuple

2. Choose (roughly) based on the above

3. Explore by randomly sending tuples in

the wrong orders

sent = 100received = 2

sent = 10received = 20

Send here 99% of the timeSend to the other operator 1% of the time


Eddies: Routing Policy – Statistics Based


Eddy

S E

E C

SEC

Output

[Deshpande, Hellerstein 04]

1. Learn statistics on each input table as

data streams in

2. Choose (roughly) based on the

statistics (analogous to traditional QO)


Eddies: Routing Policy – Interactivity Metric


Eddy

S E

E C

SEC

Output

[Raman, Hellerstein 02]

Use user preferences for different kinds of

tuples and different kinds of result columns to

determine the routing order


Agenda







– Open problems


State Modules (SteMs): More aggressive adaptation

Eddies allow dynamic adaptation of join order– Within a particular spanning tree of the join graph

Other adaptations:

– Access methods: especially index vs scan access methods

– Join algorithms: e.g., index join (IJ) vs hash join (HJ)

• To guard against wrong cardinality estimates

• To simultaneously optimize for both pipelining & completion time (IJHJ)

• To deal with memory overflows (HJIJ)

R

Eddy

P PR

Eddy

P

hash jn

R

Eddy

ind.jn

+

PSteMR SteMP


Eddy execs queries by routing tuples to SteMs and AMs

SteM– Dictionary of homogeneous tuples– (insert) build and (search) probe operations

Access modules (AMs)– probe with tuples and get matches

One Example: Symmetric Hash Join– Build SteMR

– Probe SteMS with R tuple

– Build SteMS

– Probe SteMR with R tuple

Generalizes to n-ary join Can simulate other join algorithms, by using other routing policy

– Thus, changing the routing allows adaptation of join algorithm– Subtle routing constraints needed to ensure correct execution [Raman et al 2003]

Query Processing with SteMs

build R

probe ST

RST matches

SteMR

repeat

R

Eddy

S

S b

ldR bldS probe R probe


Sharing State across queries E.g. 1000 queries of the form:

select * from R where R.a = parameter Equivalent to:

select * from R, Q where R.a = Q.parameter Eddy routes tuples from R SteM and queries from Q SteM

Useful for continuous-query proc. and multi-query proc – Need to extend SteMs with eviction (deletion) operation – for window joins

– See Madden et. al paper in SIGMOD 2002

Another application of SteMs


Agenda





• Eddies, SteMs, STAIRs …


– Open problems


HashTable

E.Name STAIR

HashTable

S.Name STAIR

HashTable

E.Course STAIR

HashTable

C.Course STAIR

EddySEC

Output

Query execution using STAIRS [Deshpande/Hellerstein 2004]

(Storage, Transformation and Access for Intermediate Results)

Extend SteMs to handle reuse of intermediate results

s1s1

s1

Probe into E.Name STAIRBuild into S.Name

STAIR

s1


Agenda





• Eddies, SteMs, STAIRs …


– Open problems


So where are we…

Evolutionary techniques– Relatively well understood

– Commercial database support in recent years

But– Don’t offer fine-grained adaptivity required for the new environments

Going back to the list of reasons why adaptive query processing is required…


Evolutionary Techniques Can handle to some extent:

– Queries with parameter markers– Complex queries with large tables – Bad Statistics/correlations (if used in conjunction with sophisticated synopses)

Can not easily handle:– Deep web and federated systems (maybe)– Interactive query metrics

Can not handle:– Very dynamic environments

• Continuous queries – query runs for ever, so things change– Extensive sharing of work between queries – esp. in continuous queries– Asynchronous (push) data sources, triggers, etc.

Open– XML (depends on the complexity of the query language)– Text and other types of complex data


So where are we…

Revolutionary techniques– Show great promise

– Have the potential to handle most of the issues

But– Much harder to implement

– Seem to have non-negligible overheads in the common case

– Very immature research area

– More work is needed…


Some Open Problems

Parametric query optimization– Can potentially avoid the need for adaptive query processing in many

scenarios

• Especially parameter markers and correlations

– Few practical techniques so far

Optimization with expanded plan space– Eddies can explore a plan space much larger than traditional plan space

• They allow relations to be broken into pieces, with each piece executed separately

– Can we explore this plan space in a non-adaptive setting ?

– Recent work on:

• Conditional Planning [Deshpande et al, ICDE 2005]

• Content-based Routing [Babu et al, VLDB 2005]


Some Open Problems

Eddies– Routing policies

• Whether eddies will succeed depends on the routing policies

• Little work so far...

– SteMs, STAIRs

• Theoretical analysis of optimization space

• Practical viability analysis

• Especially in the context of continuous query processing

– Parallel, distributed environments, P2P, Grid

Disk:– Flexibility demanded by adaptive techniques at odds against the careful

scheduling typically done by DBMSs

– Very little work on understanding this


Conclusions

Declarative queries are the fundamental justification for the relational model– Otherwise we could all go back to CODASYL

Traditional query optimization does not work all that well outside of a very narrow domain

Evolutionary techniques– Well understood; Offer limited adaptivity

Revolutionary techniques– Hold much more promise; Immature area

Very exciting research lies ahead …


Questions ?

1 adaptive query processing | dec 22, 2005 adaptive query processing amol deshpande, university of...

Documents