exploiting correlated attributes in acquisitional query processing

32
Exploiting Correlated Exploiting Correlated Attributes in Attributes in Acquisitional Query Acquisitional Query Processing Processing Amol Deshpande Amol Deshpande University of Maryland University of Maryland Joint work with Carlos Guestrin Joint work with Carlos Guestrin @CMU @CMU , Sam , Sam Madden Madden @MIT @MIT , , Wei Hong Wei Hong @Intel Research @Intel Research

Upload: ramona

Post on 19-Feb-2016

19 views

Category:

Documents


1 download

DESCRIPTION

Exploiting Correlated Attributes in Acquisitional Query Processing. Amol Deshpande University of Maryland Joint work with Carlos Guestrin @CMU , Sam Madden @MIT , Wei Hong @Intel Research. Network. Information Sources. Motivation. Emergence of large-scale distributed information systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Correlated Attributes in Acquisitional Query Processing

Exploiting Correlated Exploiting Correlated Attributes in Acquisitional Attributes in Acquisitional

Query ProcessingQuery Processing

Amol DeshpandeAmol DeshpandeUniversity of MarylandUniversity of Maryland

Joint work with Carlos GuestrinJoint work with Carlos Guestrin@CMU@CMU, Sam Madden, Sam Madden@MIT@MIT, , Wei HongWei Hong@Intel Research@Intel Research

Page 2: Exploiting Correlated Attributes in Acquisitional Query Processing

MotivationMotivation► Emergence of large-scale distributed information Emergence of large-scale distributed information

systemssystems Web, Grid, Sensor Networks etc..Web, Grid, Sensor Networks etc..

► Characterized by high data Characterized by high data acquisitionacquisition costs costs Data doesn’t reside on the local disk; must acquire itData doesn’t reside on the local disk; must acquire it Goal: reduce data acquisition as much as possibleGoal: reduce data acquisition as much as possible

Network

Information Sourcestemperature at

location 1

humidity at location 2

Page 3: Exploiting Correlated Attributes in Acquisitional Query Processing

MotivationMotivation► Many of these systems exhibit strong data Many of these systems exhibit strong data

correlationscorrelations E.g. in sensor networks…E.g. in sensor networks…

► Temperature and voltageTemperature and voltage► Temperature and lightTemperature and light► Temperature and humidityTemperature and humidity► Temperature and time of dayTemperature and time of day► etc. etc.

Page 4: Exploiting Correlated Attributes in Acquisitional Query Processing

Data is correlatedData is correlated►Temperature and voltageTemperature and voltage►Temperature and lightTemperature and light►Temperature and humidityTemperature and humidity►Temperature and time of dayTemperature and time of day►etc.etc.

Page 5: Exploiting Correlated Attributes in Acquisitional Query Processing
Page 6: Exploiting Correlated Attributes in Acquisitional Query Processing

MotivationMotivation► Many of these systems exhibit strong data Many of these systems exhibit strong data

correlationscorrelations E.g. in sensor networks…E.g. in sensor networks…

► Temperature and voltageTemperature and voltage► Temperature and lightTemperature and light► Temperature and humidityTemperature and humidity► Temperature and time of dayTemperature and time of day► etc. etc.

Observing one attribute Observing one attribute information about other information about other attributesattributes

Page 7: Exploiting Correlated Attributes in Acquisitional Query Processing

MotivationMotivation► Database systems typically ignore correlationsDatabase systems typically ignore correlations► Our agenda:Our agenda:

Exploit the naturally occuring Exploit the naturally occuring spatialspatial and and temporaltemporal correlations in novel ways to reduce data acquisitioncorrelations in novel ways to reduce data acquisition

► Model-driven Acquisition in Sensor Networks Model-driven Acquisition in Sensor Networks [VLDB’04][VLDB’04] A new way to look at data managementA new way to look at data management

► This talk:This talk: A more traditional query optimization questionA more traditional query optimization question Signficant benefits possible even hereSignficant benefits possible even here

► Even in traditional domains Even in traditional domains

Page 8: Exploiting Correlated Attributes in Acquisitional Query Processing

Conjunctive QueriesConjunctive Queries► Queries of the form:Queries of the form:

SELECT X1, X2, X3SELECT X1, X2, X3WHERE WHERE pred1pred1(X1)(X1) AND AND pred2pred2(X2)(X2) AND AND pred3pred3(X3)(X3)

► Common in acquisitional environmentsCommon in acquisitional environments Sensor networks:Sensor networks:

► Find sensors reporting temperature between 10Find sensors reporting temperature between 10ooC and 20C and 20ooC and light C and light less than 100 Luxless than 100 Lux

► Are there any sensors not within above bounds ?Are there any sensors not within above bounds ? Web data sources:Web data sources:

► What fraction of people who donated to Gore in 2000 also have What fraction of people who donated to Gore in 2000 also have patents ?patents ?

Page 9: Exploiting Correlated Attributes in Acquisitional Query Processing

Conjunctive QueriesConjunctive Queries► Queries of the form:Queries of the form:

SELECT X1, X2, X3SELECT X1, X2, X3WHERE X1 = x1WHERE X1 = x1 AND X2 = x2AND X2 = x2 AND X3 = x3AND X3 = x3

► Current approach: Current approach: Sequential PlansSequential Plans Choose a single order of attributes to observe, for all tuplesChoose a single order of attributes to observe, for all tuples

X1 X2 X3

Observe X1

If X1 = x1, then Observe X2

If X1 = x1 and X2 = x2, then Observe X3

May involve: turning a sensor on and sampling, or querying a web index over the network

Page 10: Exploiting Correlated Attributes in Acquisitional Query Processing

Finding Sequential PlansFinding Sequential Plans► Naïve:Naïve:

Order predicates by Order predicates by cost/(1 – selectivity)cost/(1 – selectivity)► selectivityselectivity = fraction of tuples that pass the predicate = fraction of tuples that pass the predicate

Ignores correlationsIgnores correlations Used by most current query optimizersUsed by most current query optimizers

► Optimal:Optimal: Find the optimal order considering correlationsFind the optimal order considering correlations NP-HardNP-Hard

► A 4-approximate solution:A 4-approximate solution: Greedily choose the attribute that maximizes the return Greedily choose the attribute that maximizes the return

[Munagala, Babu, Motwani, Widom, ICDT 05][Munagala, Babu, Motwani, Widom, ICDT 05]

Page 11: Exploiting Correlated Attributes in Acquisitional Query Processing

ObservationObservation► Cheap, correlated attributes can improve planningCheap, correlated attributes can improve planning

By improving estimates of selectivitiesBy improving estimates of selectivities Extreme Case - Perfect Correlation: Extreme Case - Perfect Correlation: X4X4 X1X1 Observing Observing X4X4 sufficient to decide whether sufficient to decide whether X1 = x1X1 = x1 is true is true

► Would be a better plan if Would be a better plan if X4X4 cheaper to acquire cheaper to acquire E.g. E.g. temperaturetemperature and and voltagevoltage in SensorNets in SensorNets

► Unfortunately, perfect correlations rarely exist Unfortunately, perfect correlations rarely exist False +ve’s and – False +ve’s and –ve’sve’s

Approach taken by Shivakumar et al [VLDB 1998]Approach taken by Shivakumar et al [VLDB 1998]► Our Approach: Choose different plans based on observed Our Approach: Choose different plans based on observed

attribute valuesattribute values Query always evaluated correctlyQuery always evaluated correctly Sometimes, observe attributes not involved in querySometimes, observe attributes not involved in query Sounds adaptive, but we generate complete plans Sounds adaptive, but we generate complete plans a prioria priori Applicable in both exact and approximate caseApplicable in both exact and approximate case

Page 12: Exploiting Correlated Attributes in Acquisitional Query Processing

ExampleExample

Light > 100 Lux

Temp < 20° C

Cost = 100Selectivity = .5

Cost = 100Selectivity = .5

Expected Cost = 150

Light > 100 Lux

Temp < 20° C

Cost = 100Selectivity = .5

Cost = 100Selectivity = .5

Expected Cost = 150

SELECT * FROM sensorsWHERE light < 100 Lux and temp > 20o

C

Page 13: Exploiting Correlated Attributes in Acquisitional Query Processing

ExampleExample

Light > 100 Lux

Temp < 20° C

Cost = 100Selectivity = .1

Cost = 100Selectivity = .9

Expected Cost = 110

Light > 100 Lux

Temp < 20° C

Cost = 100Selectivity = .1

Cost = 100Selectivity = .9

Time in [6pm, 6am]

T

F

SELECT * FROM sensorsWHERE light < 100 Lux and temp > 20o

C

A Conditional Plan

Page 14: Exploiting Correlated Attributes in Acquisitional Query Processing

Problem StatementProblem Statement► Given a conjunctive query of the formGiven a conjunctive query of the form

XX11 = x = x11 and and X X22 = x = x2 2 andand … … andand X Xmm = x = xmm

and additional attributes Xand additional attributes Xm+1m+1, …, X, …, Xn n not referenced in the not referenced in the query, find the optimal conditional planquery, find the optimal conditional plan

We will restrict ourselves to We will restrict ourselves to binarybinary conditional plans conditional plans

Observe XEvaluate predicate of form X > a

T

FTerminal Nodes

Return false

Return true

Page 15: Exploiting Correlated Attributes in Acquisitional Query Processing

Costing a conditional planCosting a conditional plan

X

< a

>= a

<a

>=a

Satisfied Predicates

Page 16: Exploiting Correlated Attributes in Acquisitional Query Processing

Costing a conditional planCosting a conditional plan

X

< a

>= a

<a

>=aCost(Cost() = C(X | ) = C(X | ))

+ P(X < a | + P(X < a | ) Cost() Cost(<a<a) ) + P(X >= a | + P(X >= a | ) Cost () Cost (>=a>=a))

Satisfied Predicates

= and (X < a)

Page 17: Exploiting Correlated Attributes in Acquisitional Query Processing

ComplexityComplexity► Given an oracle that can compute any conditional Given an oracle that can compute any conditional

probabilities in probabilities in O(1)O(1) time, deciding whether a plan time, deciding whether a plan with expected cost < with expected cost < KK exists is #P-hard exists is #P-hard Reduction from #3-SAT (counting version of 3-SAT)Reduction from #3-SAT (counting version of 3-SAT)

► Given a dataset Given a dataset DD, finding the optimal conditional , finding the optimal conditional plan for that dataset is NP-hardplan for that dataset is NP-hard Reduction from complexity of finding binary decision treesReduction from complexity of finding binary decision trees

Page 18: Exploiting Correlated Attributes in Acquisitional Query Processing

Solution StepsSolution Steps► Plan CostingPlan Costing

Need method to estimate conditional Need method to estimate conditional probabilities probabilities

► Plan EnumerationPlan Enumeration Exhaustive vs. heuristicExhaustive vs. heuristic

Page 19: Exploiting Correlated Attributes in Acquisitional Query Processing

Conditional Probability Conditional Probability EstimationEstimation

► We estimate conditional probabilities using We estimate conditional probabilities using observations over historical dataobservations over historical data

► Options:Options: Build a complete multidimensional distribution over Build a complete multidimensional distribution over

attributesattributes+ Can read off probabilities+ Can read off probabilities- Very large memory requirements- Very large memory requirements

Scan historical data as estimates are neededScan historical data as estimates are needed + Minimal memory requirements+ Minimal memory requirements

+ Can use random samples if datasets too large+ Can use random samples if datasets too large Build a model that allows quick estimation of probabilitiesBuild a model that allows quick estimation of probabilities

► E.g., graphical models E.g., graphical models ► Allows reasoning about unobserved eventsAllows reasoning about unobserved events► Avoids overfittingAvoids overfitting

Page 20: Exploiting Correlated Attributes in Acquisitional Query Processing

Solution StepsSolution Steps► Plan CostingPlan Costing

Need Method to Estimate Conditional Need Method to Estimate Conditional Probabilities Probabilities

► Plan EnumerationPlan Enumeration Exhaustive vs. heuristicExhaustive vs. heuristic

Page 21: Exploiting Correlated Attributes in Acquisitional Query Processing

Exhaustive SearchExhaustive Search►Dynamic programming applicableDynamic programming applicable

X

< a

>= a

<a

>=aSubproblem defined by conditioning predicates ()

Can solve independently

Page 22: Exploiting Correlated Attributes in Acquisitional Query Processing

Exhaustive SearchExhaustive Search►Dynamic programming applicableDynamic programming applicable

► Complexity: Complexity: O(KO(K2n2n))

► Prohibitive in most cases !!Prohibitive in most cases !! Even if we use branch-and-bound techniquesEven if we use branch-and-bound techniques

Page 23: Exploiting Correlated Attributes in Acquisitional Query Processing

Greedy Binary Split HeuristicGreedy Binary Split Heuristic► Uses optimal sequential plans as base case Uses optimal sequential plans as base case ► Chooses locally optimal splits to improve greedilyChooses locally optimal splits to improve greedily

Example Query:X1 = 1 and X2 = 1

X1 X21. Optimal Sequential Plan

X3

< 10

>= 10

Eg:

Use optimal sequential plans to solve the (smaller) subproblems

2. Check all possible splits:

Page 24: Exploiting Correlated Attributes in Acquisitional Query Processing

Greedy Binary Split HeuristicGreedy Binary Split Heuristic► Uses optimal sequential plans as base case Uses optimal sequential plans as base case ► Chooses locally optimal splits to improve greedilyChooses locally optimal splits to improve greedily

Example Query:X1 = 1 and X2 = 1

X1 X21. Optimal Sequential Plan

2. Check all possible splits:

X3

< 10

>= 10

Eg:

3. Choose locally optimal split

4. Recurse

Page 25: Exploiting Correlated Attributes in Acquisitional Query Processing

EvaluationEvaluation► Datasets from real deploymentsDatasets from real deployments

LabLab► 45 motes deployed in Intel Berkeley Lab45 motes deployed in Intel Berkeley Lab► 400,000 readings 400,000 readings ► Total 6 attributes; 3-predicate queries Total 6 attributes; 3-predicate queries

Garden-11: Garden-11: ► 11 motes deployed in a forest11 motes deployed in a forest► 3 attributes per mote, temperature, voltage, and humidity. 3 attributes per mote, temperature, voltage, and humidity. ► Total 34 attributes; 33-predicate queriesTotal 34 attributes; 33-predicate queries► Queries are issued against the sensor network as a wholeQueries are issued against the sensor network as a whole

Also experiemented with Garden-5, and synthetic datasetsAlso experiemented with Garden-5, and synthetic datasets► Separated Separated testtest from from trainingtraining► Randomly generated range queries for a given set of query Randomly generated range queries for a given set of query

variablesvariables► Java implementation, simulated execution based on cost modelJava implementation, simulated execution based on cost model

Costs represent data acquisition onlyCosts represent data acquisition only

Page 26: Exploiting Correlated Attributes in Acquisitional Query Processing

Example Plan: Lab DatasetExample Plan: Lab Dataset

Page 27: Exploiting Correlated Attributes in Acquisitional Query Processing

Garden-11 (1)Garden-11 (1)

Comparing Naive and Heuristic-10

00.20.40.60.81

1.2

>0.9 >1 >1.1 >1.5 >2 >2.5Performance Gain

Fraction of Experiments

Page 28: Exploiting Correlated Attributes in Acquisitional Query Processing

Garden-11 (2)Garden-11 (2)

Comparing CorrSeq and Heuristic-10

00.20.40.60.81

1.2

>0.85 >1 >1.1 >1.3 >1.5Performance Gain

Fraction of Experiments

Page 29: Exploiting Correlated Attributes in Acquisitional Query Processing

ExtensionsExtensions► Probabilistic queries with confidencesProbabilistic queries with confidences► General queriesGeneral queries

E.g. Disjunctive queriesE.g. Disjunctive queries► A large class of Join queriesA large class of Join queries

E.g. “Star” queries with K-FK join predicatesE.g. “Star” queries with K-FK join predicates► Existential queriesExistential queries

E.g., “tell me k answers to this query”E.g., “tell me k answers to this query”►Can order the observations so that tuples most likely to satisfy Can order the observations so that tuples most likely to satisfy

are observed firstare observed first► Adaptive conditional planningAdaptive conditional planning

With With eddieseddies

Page 30: Exploiting Correlated Attributes in Acquisitional Query Processing

Conditional Planning for Conditional Planning for Probabilistic QueriesProbabilistic Queries

► Basic idea similarBasic idea similar Cost computation is different; typically requires numerical Cost computation is different; typically requires numerical

integrationintegrationConditional Planning with BBQ

0

20

40

60

80

100

120

140

99.50% 99% 98% 95%

Confidence

Energy (mJ)

TinyDBBBQ-StaticBBQ-Static-Cond-5BBQ-DynamicBBQ-Dynamic-Cond-5

Page 31: Exploiting Correlated Attributes in Acquisitional Query Processing

ConclusionsConclusions► Large-scale distributed information Large-scale distributed information

systems systems Must acquire data carefullyMust acquire data carefully

►High disparate acquisition costsHigh disparate acquisition costs Exhibit strong temporal and spatial correlationsExhibit strong temporal and spatial correlations

► Conditional planning Conditional planning Change plans based on observed attribute Change plans based on observed attribute

valuesvalues Significant benefits even for traditional tasksSignificant benefits even for traditional tasks

►Many other opportunities to exploit such Many other opportunities to exploit such correlations…correlations…

Page 32: Exploiting Correlated Attributes in Acquisitional Query Processing

Questions ?Questions ?