content-based routing: different plans for different data pedro bizarro joint work with shivnath...

34
Content-Based Routing: Different Plans for Different Data Pedro Bizarro Joint work with Shivnath Babu, David DeWitt, Jennifer Widom September 1, 2005 VLDB 2005

Upload: beryl-griffin

Post on 29-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Content-Based Routing:Different Plans for Different

Data

Pedro BizarroJoint work with

Shivnath Babu, David DeWitt, Jennifer Widom

September 1, 2005VLDB 2005

2

• The opportunity to improve:

– Optimizers pick a single plan for a query

– However, different subsets of data may have very different statistical properties

– May be more efficient to use different plans for different subsets of data

IntroductionIntroduction

3

Overview of CBR

• Eliminates single plan assumption• Identifies tuple classes• Uses multiple plans, each customized for a

different tuple class• CBR applies to any streaming data:

– E.gs.: stream systems, regular DBMS operators using iterators, and acquisitional systems.

• Adaptive and low overhead algorithm• Implemented in TelegraphCQ as an extension to

Eddies

4

Overview of Eddies

• Eddy routes tuples through a pool of operators

• Routing decisions based on operator characteristics: selectivity, cost, queue size, etc.

O1

Stream ofInput tuples

Output tuples

O2

O3

Eddy

• Routing: Tuples notdifferentiated based on content

• We call it SBR: Source-Based Routing

5

Content-Based Routing Example

• Consider stream S processed by O1, O2, O3

O1 O2 O3

Selectivities 30% 40% 60%Overall Operator Selectivities

• Best routing order is O1, then O2, then O3

6

Content-Based Routing Example

• Let A be an attribute with domain {a,b,c}

Value of A O1 O2 O3

A=a 32% 10% 55%

A=b 31% 20% 65%

A=c 27% 90% 60%

Overall 30% 40% 60%Content-Specific Selectivities

• Best routing order for A=a: O2, O1, O3

• Best routing order for A=b: O2, O1, O3

• Best routing order for A=c: O1, O3, O2

7

Classifier Attributes

• Goal: identify tuple classes– Each with a different optimal operator ordering

• CBR considers:– Tuple classes distinguished by content, i.e.,

attribute values

• Classifier attribute (informal definition):– Attribute A is classifier attribute for operator O

if the value of A is correlated with selectivity of O.

8

Best Classifier Attribute Example:

σ 1-σ

A=a 10% 90%

A=b 20% 80%

A=c 90% 10%

Overall 40% 60%

σ 1-σ

B=x 43% 57%

B=y 38% 62%

B=z 39% 61%

Overall 40% 60%

• Attribute A with domain {a, b, c}• Attribute B with domain {x, y, z}• Which is the best to use for routing decisions?• Similar to AI problem: classifier attributes for decision trees

– Two labels: pass operator, dropped by operator

• AI solution: Use GainRatio to pick best classifier attribute

9

d

i

ii

R

R

R

Rmation(A)SplitInfor

12log*

d

ii

i REntropyR

RREntropyARInfoGain

1

)()(),(

GainRatio to Measure Correlation

• R: random sample of tuples processed by operator O

σ)(σ).(σσ.Entropy(R) 1log1log 22

mation(A)SplitInfor

A),InfoGain(RA)R,GainRatio(

σ 1-σ

A=a 10% 90%

A=b 20% 80%

A=c 90% 10%

Overall 40% 60%

σ 1-σ

B=x 43% 57%

B=y 38% 62%

B=z 39% 61%

Overall 40% 60%

GainRatio(R, A) = 0.87 GainRatio(R, B) = 0.002

Formulas from T. Mitchell, Machine Learning. McGraw-Hill, '97.

10

Classifier Attributes:Definition

An attribute A is a classifier attribute for operator O, if for any large random sample

R of tuples processed by O, GainRatio(R,A)>, for some threshold

11

Content-Learns Algorithm:Learning Routes Automatically

• Content-Learns consists of two continuous, concurrent steps:– Optimization: For each Ol O1, …,On find:

• that Ol does not have a classifier attribute or

• find the best classifier attribute, Cl, of Ol.

– Routing: Route tuples according to the:• selectivities of Ol if Ol does not have a classifier

attribute or• according to the content-specific selectivities of the

pair <Ol, Cl> if Cl is the best classifier attribute of Ol

12

operator 3being profiled

0

In[]=

0

0

0

0

0

0

Out[]=

0

0

0

0

0

tuples in,tuples out

Content-Learns: Optimization Step

• Find Cl by profiling Ol:– Route a fraction of input tuples to Ol

– For each sampled tuple• For each attribute

– map attribute values to d partitions– update pass/fail counters

– When all sample tuples seen, compute Cl

2 -1 1 1CA[]=

4 operators

classifierattributes

4 4 7

sampled tuple

2 1 2

corresponding partitions

1

1

11

1

1

3 1 2

2 1 12

2 1

1

1 1

2

2

2 1

1

1 1 f1 f2 f3

3 attributes

2 partitions

13

Content-Learns: Routing Step• SBR routes to Ol with probability inversely proportional to

Ol’s selectivity, W[l]• CBR routes to operator with minimum:

– If Ol does not have a classifier attribute, its =W[l]– If Ol has a classifier attribute, its =S[l,i], j=CA[l], i=fj(t.Cj)

25% 40% 50% 60%W[]=

4 operators

operatorselectivities

2 -1 2 1CA[]=classifierattributes

3 1 2

tuple

2 1 1

corresponding partitions

S[]=50% 20% 75%

0% 80% 55%

-

-

detailedselectivities

40%

20%

21

1

-1

50%

55%2 partitions

15

CBR Update Overheads• Once per tuple:

– selectivities as fresh as possible • Once per sampled tuple:

– correlations between operators and content• Once per sample (~2500 tuples)

– Computing GainRatio and updating one entry in array CA

25% 40% 50% 60%W[]=

operators: 1,...,n

operatorselectivities

2 -1 2 1CA[]=classifierattributes

S[]=50% 20% 75%

0% 80% 55%

-

-

detailedselectivities

0

attributes: 1,…,k

In[]=

2

1

1

2

0

partitions:1,…,d

0

Out[]=

1

1

0

1

0

tuples in,tuples out

16

Experimental Results:Datasets

• Stream-star dataset– Synthetic dataset based on a star-schema:

SELECT * FROM stream S, d1, d2, …, dN WHERE s.fkd1 = d1.pk // Operator O1 … AND s.fkdN = dN.pk; // Operator ON

– Attribute S.attrC best classifier attribute for all operators– 8 other attributes in S not correlated with operator selectivities– 100K records in stream, 10K records in dimension tables

• Lab dataset– Explained later

17

Experimental Results:System, Metrics, Defaults

• TelegraphCQ version 0.2• Tao Linux release 1, 512 MB RAM, Pentium 4 – 2.4 GHz• Metrics: % improvement running time and routing calls• Default values:

Parameter Defaults Comment

P 6% Tuple sampling probability

|R| 150 tuples Sample size

d 24 Number of partitions

Confidence 95% Conf. interval in graphs

18

Experimental Results:Run-time Overheads

• Routing overhead– time to perform routing

decisions (SBR, CBR)

• Learning overhead:– Time to update data

structures (SBR, CBR) plus

– Time to compute gain ratio (CBR only).

0

1

2

3

4

5

6

7

8

SBR CBR SBR CBR SBR CBR

Learning per tupleRouting per tuple

Microseconds

8 joins4 joins 6 joins

Overhead increase: 30%-45%

19

Experimental Results:Varying Skew

• One operator with selectivity A, all others with selectivity B• Skew is A-B. A varied from 5% to 95%• Overall selectivity: 5%

-10%0%

10%20%30%40%50%60%70%

-10

0%

-80

%

-60

%

-40

%

-20

%

0%

20

%

40

%

Skew = A - B

% Improvement over SBR

6 joins

20

Experimental Results:Random Selectivities

• Attribute attrC correlated with the selectivities of the operators• Other attributes in stream tuples not correlated with selectivities• Random selectivities in each operator

77.3% 85.2%83.0%

6.3%6.5%6.5%

3.5% 4.8% 6.0%5.7% 2.3%10.6%

0%

20%

40%

60%

80%

100%

4 joins 6 joins 8 joins

Using wrong classifierNot using a classifierProfilingUsing right classifier

Breakdown ofrouting calls:

35.1%

29.4%

21.2%

12.8%

18.8% 19.7%

0%

5%

10%

15%

20%

25%

30%

35%

40%

4 joins 6 joins 8 joins

Routing callsExecution time

% Improvement over SBR

21

Experimental Results:Varying Aggregate Selectivity

• Aggregate selectivity in previous experiments was 5% or ~8%• Here we vary aggregate selectivity between 5% to 35%• Random selectivities within these bounds

37.0%31.2%

25.4%

20.0%

13.0%

22.0%18.5%

25.1%

0%5%

10%15%20%25%30%35%40%

5% 15% 25% 35%Aggregate selectivity

Routing callsExecution time

% Improvement over SBR

6 joins

22

Experimental Results:Datasets

• Lab dataset– Real data– 2.2 M readings from 54 sensors (Intel Research Berkeley Lab)– Single stream with attributes:

• Light• Humidity• Temperature• Voltage• sensorID• Year• Month• Day• Hours• Minutes• Seconds

23

Experimental Results:Challenging Adaptivity Experiment

(1)• Using Lab dataset• Example query:

SELECT * FROM sensors WHERE light>500;

• Observations:– Very high variation

in selectivity– Best classifier

attributes change with time

– No classifier attribute found for over half the time

24

Experimental Results:Challenging Adaptivity Experiment

(2)• Query: SELECT * FROM sensors

WHERE light BETWEEN lowL AND highL AND temperature BETWEEN lowT AND highT AND humidity BETWEEN lowH AND highH AND voltage BETWEEN lowV AND highV;

• lowX random number from lower 25% of domain• highX random number from upper 25%• Results for 50 different queries.• Average improvement of:

– 8% in routing calls– 5% in execution time– 7% in time spent evaluating operators– 18% in routing calls until a tuple is dropped

25

Experimental Results:Varying Operator Cost

• Run random query from previous slide• Run query for periods with correlations• Varied operator cost by running CPU intensive computations

0%

5%

10%

15%

20%

25%

0 100 200 300 400 500

Operator cost (microsecs)

Routing callsExecution time

% Improvement over SBR

4 operators

26

Conclusions

• CBR eliminates single plan assumption• Explores correlation between tuple content

and operator selectivities• Adaptive learner of correlations with

negligible overhead• Performance improvements over non-CBR

routing• Selectivity changes much more than

correlations

27

Acknowledgements

• Sam Madden and Amol Deshpande for providing the Lab dataset.

• Sailesh Krishnamurthy, Amol Deshpande, Joe Hellerstein, and the rest of the TelegraphCQ team for providing TelegraphCQ and answering all my questions.

Q & A?

Extra slides

30

Motivational Example (1):Intrusion Detection Query

• “Track packets with destination address matching a prefix in table T, and containing the 100-byte and 256-byte sequences “0xa...8” and “0x7...b” respectively as subsequence”

• SELECT * FROM packetsWHERE matches(destination, T)AND contains(data, “0xa...8”)AND contains(data, “0x7...b”);

O1

O2

O3

31

Motivational Example (2):Intrusion Detection Query

• Assume:– costs are: c3>c1>c2

– selectivities are: 3>1>2

• SBR routing converges to O2, O1, O3

SBR

Stream of tuples

O1 O2

O3

almost alltuples follow

this route

32

CBRSBR

Motivational Example (3):Intrusion Detection Query

• Suppose an attack (O2 and O3) on a network whose prefix is not in T (O1) is underway:– σ2 and σ3 will be very high, σ1 will be very low

– O1, O2, O3 will be the most efficient ordering for “attack” tuples

Stream of tuples

O1 O2

O3

almost alltuples follow

this route

Stream of tuples

O1 O2

O3

attacktuples follow

this route

non-attacktuples follow

this route

addr

33

Experimental Results:Varying Skew

• One operator with selectivity A, all others with selectivity B• Skew is A-B. A varied from 5% to 95%• Overall selectivity: 5%

-10%0%

10%20%30%40%50%60%70%

-10

0%

-80

%

-60

%

-40

%

-20

%

0%

20

%

40

%

Skew = A - B

% Improvement over SBR

-10%0%

10%20%30%40%50%60%70%

-10

0%

-80

%

-60

%

-40

%

-20

%

0%

20

%

40

%

60

%

80

%

10

0%

Skew = A - B

% Improvement over SBR

2 joins 6 joins

34

Related WorkAdaptivity in Stream Systems• [Avnur+] Eddies: Continuously Adaptive Query

Processing. SIGMOD'00.• [Arasu+] STREAM: The Stanford Stream Data

Manager. DEBul 26(1).• [Babu+] Adaptive ordering of pipelined stream

filters. SIGMOD'04.• [Babu+] StreaMon: An Adaptive Engine for

Stream Query Processing. SIGMOD'04.• [Carney+] Monitoring streams – a new class of

data management applications. VLDB'02.• [Chandrasekaran+] TelegraphCQ: Continuous

dataflow processing for an uncertain world. CIDR'03.

• [Chandrasekaran+] Psoup: a system for streaming queries over streaming data. VLDBj 12(2).

• [Deshpande] An initial study of overheads of eddies. SIGMODrec 33(1).

• [Deshpande+] Lifting the Burden of History from Adaptive Query Processing. VLDB'04.

• [Madden+] Continuously adaptive continuous queries over streams. SIGMOD'02.

• [Raman+] Using state modules for adaptive query processing. ICDE'03.

• [Tian+] Tuple Routing Strategies for Distributed Eddies. VLDB'03.

• [Viglas+] Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. VLDB'03.

AQP Surveys• [Babu+] Adaptive Query Processing in the

Looking Glass. CIDR'05.• [Hellerstein+] Adaptive query processing:

Technology in evolution. DEBul 23(2).

Adaptivity in DBMS• [Babu+] Proactive Re-optimization. SIGMOD'05.• [Graefe+] Dynamic query evaluation plans.

SIGMOD'89.• [Kabra+] Efficient Mid-Query Re-Optimization of

Sub-Optimal Query Execution Plans. SIGMOD'98.• [Markl+] Robust Query Processing through

Progressive Optimization. SIGMOD'04.• [Wong+] Decomposition - a strategy for query

processing. TODS 1(3).Adaptivity in Distributed Systems• [Bouganim+] Dynamic query scheduling in data

integration systems. ICDE'00.• [Ives+] An adaptive query execution system for

data integration. SIGMOD'99.• [Ives+] Adaptive query processing for internet

applications. DEBul 23(2).• [Ives+] Adapting to Source Properties in

Processing Data Integration Queries. SIGMOD'04.• [Ives] Efficient Query Processing for Data

Integration. PhD thesis.• [Ng+] Dynamic query re-optimization. ICSSDM'99.• [Urhan+] Dynamic pipeline scheduling for

improving interactive performance of online queries. VLDB'01.

• [Urhan+] Cost based query scrambling for initial delays. SIGMOD'98.

• [Zhu+] Dynamic plan migration for continuous queries over data streams. SIGMOD'04.

35

Related Work (cont’d)Acquisitional Systems• [Madden+] The Design of an Acquisitional Query

Processor for Sensor Networks. SIGMOD'03.• [Deshpande+] Model-Driven Data Acquisition in

Sensor Networks. VLDB'04• [Deshpande+] Exploiting Correlated Attributes in

Acquisitional Query Processing. ICDE'05.Multiple Plans in Same Query• [Antoshenkov+] Query processing and

optimization in oracle rdb. VLDBj 5(4).• [Bizarro+] Content-Based Routing: Different Plans

for Different Data. VLDB'05.• [Polyzotis] Selectivity-Based Partitioning: A Divide-

and-Union Paradigm For Effective Query Optimization. Unpublished.

URDs and Modeling Uncertainty• [Anderson+] Index key range estimator. U. S.

Patent 4,774,657. • [Babcock+] Towards a Robust Query Optimizer: A

Principled and Practical Approach. SIGMOD'05.• [Chu+] Least expected cost query optimization:

An exercise in utility. SoPoDS'99.• [Ramamurthy+] Buffer-pool Aware Query

Optimization. CIDR'05.• [Viglas] Novel Query Optimization and Evaluation

Techniques, Ph.D. Thesis.

Optimization, Cardinality Estimation, Correlations

• [Selinger+] Access Path Selection in a Relational Database Management System. SIGMOD'79.

• [Acharya+] Join Synopses for Approximate Query Answering. SIGMOD'99.

• [Christodoulakis] Implications of certain assumptions in database performance evaluation. TODS 9(2).

• [Graefe] Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2).

• [Getoor+] Selectivity Estimation using Probabilistic Models. SIGMOD'01.

• [Ioannidis+] On the Propagation of Errors in the Size of Join Results. SIGMOD'91.

• [Ilyas+] CORDS: Automatic discovery of correlations and soft functional dependencies. SIGMOD'04.

• [Stillger+] LEO - DB2’s LEarning Optimizer. VLDB'01.

AI and Learning from Streams• [Mitchell] Machine Learning. McGraw-Hill, '97.• [Guha+] Data-streams and histograms. ACM

Symp. on Theory of Computing, '01.• [Domingos+] Mining high-speed data streams.

SIGKDD'00.• [Gehrke+] On computing correlated aggregates

over continual data streams. SIGMOD'01.