eddies: continuously adaptive query processing based on a sigmod’2002 paper and talk by avnur and...

37
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Eddies: Continuously Adaptive Query Processing

Based on a SIGMOD’2002 paper and talk

by Avnur and Hellerstein.

State-of-Art in Query Optimization

• Given:1. Database state and statistics known a-priori 2. One (short) user query to process 3. Query may be run only once

• Query Processing:1. A-priori decide on a (static) query plan2. Run query using this one plan

• Also:1. Possibly update statistics sometimes (in

steady state)

Adaptive Systems: General Flavor

Repeat:1. Observe (model) environment

2. Use observation to choose behavior

3. Take action

Adaptivity in Current DBs

• Limited & coarse grain

Repeat:1.Observe (model) environment

– runstats (once per week!!): model changes in data

2.Use observation to choose behavior – query optimization: fixes a single static query plan

3.Take action– query execution: blindly follow plan

Query Optimization

– Adaptivity at a per-week frequency!• Not suited for volatile environments

A Networking Problem!?

• Networks do dataflow!• Significant history of adaptive techniques

– E.g. TCP congestion control– E.g. routing

• But traditionally much lower function– Ship bitstreams– Minimal, fixed code

• Lately, moving up the foodchain?– app-level routing– active networks

Query Plans are Dataflow

• Programming model: iterators– old idea, widely used in

DB query processing– object with three

methods:• Init(), GetNext(), Close()

– input/output types– query plan: graph of

iterators• pipelining: iterators that

return results before children Close()

Querying in Volatile Environments• Federated query processors

– No control over stats, performance, admin (DataJoiner)

• Shared-Nothing Systems– No control over “system balance”

• User control of running queries– No control over user interaction (online

aggregation)• Sensor Nets: the next killer app

– No control over anything!

Varying …• Computing resources

– Data flows unpredictably from sources– Code performs unpredictably along flows– Continuous volatility due to many decentralized

systems

• Data Characteristics– Distributions– Burstiness

• User preferences– What get fast– How much data

Toward Continuous Adaptivity

• Need much more frequent adaptivity

– Goal: adapt per tuple of each relation??

– The traditional runstats-optimize-execute loop is far too coarse-grained

– So, continuously perform all 3 functions, at runtime

– Aim for adaptivity over best-case performance (as the later never exists for long)

Road Map

• Adaptive Query Processing

• Intra-join adaptivity– Synchronization Barriers– Moments of Symmetry

• Eddies– Encapsulated, adaptive dataflow

Adaptable Operators and Plans

• Moments of symmetry = query processing stage during which pipelined query operators or inputs can be easily reordered (with no or minimal state management)

• Synchronization barriers = require inputs from different sources to be coordinated and possibly restricted to the rate of the slower input

• We need “good” operators.

Adaptable Joins, Issue 1

• Synchronization Barrier: merge join– Right input frozen,

waiting for left– Can’t adapt while waiting

for barrier!– So, favor joins that have:

• no barriers or seldom barriers• at worst, adaptable barriers

23456

20002001200220032004

Adaptable Joins, Issue 2

• Would like to reorder in-flight (pipelined) joins

• Base case: swap inputs to a join ??

• Moment of symmetry:– inputs can be swapped with no/little state

management

• Aim for frequent moments of symmetry more frequent adaptivity

Adaptable Joins, Issue 2

• Moments of Symmetry – Suppose you can adapt an in-flight query

plan• How would you do it?

– Base case: reorder inputs of a single join• Nested loops join

R

S

R SS R

• Moments of Symmetry – Suppose you can adapt an in-flight query

plan• How would you do it?

– Base case: reorder inputs of a single join• Nested loops join• Cleaner if you wait

til end of inner loop

R

S

Adaptable Joins, Issue 2

Adaptable Joins, Issue 2

• Moments of Symmetry – Suppose you can adapt an in-flight query

plan• How would you do it?

– Base case: reorder inputs of a single join• Nested loops join• Cleaner if you wait

til end of inner loop

– Hybrid Hash• Reorder while “building”?

R

S

Moments of Symmetry, cont.

• Moment of Symmetry:– Can swap join inputs w/o state modification

– Nested Loops join: end of each inner loop– Hybrid Hash join: never– Sort-Merge join: essentially always

• More frequent moments of symmetry more frequent adaptivity

Joins for Adaptivity

– Pipelined hash join (hash ripple or Xjoin) • No synchronization barriers• Continuous symmetry• Good for equi-join

– Simple (or block) ripple join• Synchronization barriers at “corners”• Moments of symmetry at “corners”• Good for non-equi-join

– When symmetry: At corners, i.e., for each “new” tuple, once it has been processed using the given operator ‘s state

R

S

Beyond Binary Joins

• Think of swapping “inners”– Can be done at a global

moment of symmetry

• Intuition: like an n-ary join– Except that each pair can be

joined by a different algorithm!

• So…– Need to introduce n-ary joins

to a query engine

Need well-behaved join algorithms

– Pipelining– Avoid synch barriers– Frequent moments of symmetry

Continuous Adaptivity Goal: Eddies

• Avoid need for traditional cost estimation• Avoid generation of a ‘good’ query plan

Eddy

Continuous Adaptivity: Eddies

• A pipelining n-ary tuple-routing iterator (just like join or sort)– works well with ops that have

frequent moments of symmetry

Eddy

Continuous Adaptivity: Eddies

• Adjusts flow adaptively– Tuples flow in different orders– Visit each op once before output

Eddy

Routing: Eddies

• Naïve routing policy:– All ops fetch from eddy as fast as possible– Previously-seen tuples precede new tuples

Eddy

Schedule : Grab when Ready?

– Two expensive selections s1 and s2

• Selectivity(s1)=Selectivity(s2)=50%• Cost(s2) = 5. • Vary Cost(s1).

– What expect? ?

• Does it make a difference at all?

Cost Factor?

– Two expensive selections, 50% selectivity• Cost(s2) = 5. Vary cost of s1.• Favors faster operation

But is it Enough?

– Given two expensive selections:

• Cost same, say cost(s1)=cost(s2)=5• Selectivity(s2) = 50%. • Vary selectivity of s1.

– Does that make a difference?

Selectivity-based?

– Two expensive selections, cost 5• Selectivity(s2) = 50%. Vary selectivity of s1.

Schedule: Selectivity-based?

– Conclude: Heavy tuple shedder early on is good.

How to choose?

• If we knew all selectivities and all costs (and they were static), maybe we could pick the best overall “schedule” here.

• Otherwise, we need a cheap means to observe their changes

• And, we need a means to react in a simply manner based on those perceived changes

An Aside: How to choose?

• A machine learning problem? – Each agent pays off differently

– Explore Or Exploit?

– Heuristics ?• Sometimes want to randomly choose

one • Usually want to go with the best• If probabilities are stationary, dampen

exploration over time

Eddies with Lottery Scheduling

• Operator gets 1 ticket when it takes a tuple– Favor operators that run fast (low cost)

• Operator loses a ticket when it returns a tuple– Favor operators that drop tuples (low selectivity)

• Winner?– Large number of tickets == measure of goodness

• Lottery Scheduling:– When two operators vie for the same tuple,

hold a lottery– Never let any operator go to zero tickets

• Support occasional random “exploration”

Lottery-Based Eddy

– Two expensive selections, cost 5• Selectivity(s2) = 50%. Vary selectivity of s1.

In a Volatile Environment

• Two index joins– Slow: 5 second delay; Fast: no delay– Toggle after 30 seconds

Related Work

– Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97]

– Per Query: Mariposa [SA+96], ASE [CR94] – Competition: RDB [AZ96]– Inter-Op: [KD98], Tukwila [IF+99]– Query Scrambling: [AF+96,UFA98]

• Survey: Hellerstein, Franklin, et al., DE Bulletin 2000

Syste

m R

Late

Bin

ding

Per Q

uery

Com

petit

ion

& Sa

mpl

ing

Inte

r-Ope

rato

rQ

uery

Scr

ambl

ing

Eddie

s

Ingr

es D

ECO

MP

Frequency of Adaptivity

Futu

re W

ork

Summary

• Eddies: Continuously Adaptive Dataflow– Suited for volatile performance environments

• Changes in operator/machine peformance• Changes in selectivities (e.g. with sorted inputs)• Changes in data delivery

– Currently adapts join order• Competitive methods to adapt access & join methods?

• Requires well-behaved join algorithms– Pipelining– Avoid synch barriers– Frequent moments of symmetry

• The end of the runstats/optimizer/executor boundary!– At best, System R is good for “hints” on initial ticket

distribution