exploiting correlated attributes in acquisitional query processing
DESCRIPTION
Exploiting Correlated Attributes in Acquisitional Query Processing. Amol Deshpande University of Maryland Joint work with Carlos Guestrin @CMU , Sam Madden @MIT , Wei Hong @Intel Research. Network. Information Sources. Motivation. Emergence of large-scale distributed information systems - PowerPoint PPT PresentationTRANSCRIPT
Exploiting Correlated Exploiting Correlated Attributes in Acquisitional Attributes in Acquisitional
Query ProcessingQuery Processing
Amol DeshpandeAmol DeshpandeUniversity of MarylandUniversity of Maryland
Joint work with Carlos GuestrinJoint work with Carlos Guestrin@CMU@CMU, Sam Madden, Sam Madden@MIT@MIT, , Wei HongWei Hong@Intel Research@Intel Research
MotivationMotivation► Emergence of large-scale distributed information Emergence of large-scale distributed information
systemssystems Web, Grid, Sensor Networks etc..Web, Grid, Sensor Networks etc..
► Characterized by high data Characterized by high data acquisitionacquisition costs costs Data doesn’t reside on the local disk; must acquire itData doesn’t reside on the local disk; must acquire it Goal: reduce data acquisition as much as possibleGoal: reduce data acquisition as much as possible
Network
Information Sourcestemperature at
location 1
humidity at location 2
MotivationMotivation► Many of these systems exhibit strong data Many of these systems exhibit strong data
correlationscorrelations E.g. in sensor networks…E.g. in sensor networks…
► Temperature and voltageTemperature and voltage► Temperature and lightTemperature and light► Temperature and humidityTemperature and humidity► Temperature and time of dayTemperature and time of day► etc. etc.
Data is correlatedData is correlated►Temperature and voltageTemperature and voltage►Temperature and lightTemperature and light►Temperature and humidityTemperature and humidity►Temperature and time of dayTemperature and time of day►etc.etc.
MotivationMotivation► Many of these systems exhibit strong data Many of these systems exhibit strong data
correlationscorrelations E.g. in sensor networks…E.g. in sensor networks…
► Temperature and voltageTemperature and voltage► Temperature and lightTemperature and light► Temperature and humidityTemperature and humidity► Temperature and time of dayTemperature and time of day► etc. etc.
Observing one attribute Observing one attribute information about other information about other attributesattributes
MotivationMotivation► Database systems typically ignore correlationsDatabase systems typically ignore correlations► Our agenda:Our agenda:
Exploit the naturally occuring Exploit the naturally occuring spatialspatial and and temporaltemporal correlations in novel ways to reduce data acquisitioncorrelations in novel ways to reduce data acquisition
► Model-driven Acquisition in Sensor Networks Model-driven Acquisition in Sensor Networks [VLDB’04][VLDB’04] A new way to look at data managementA new way to look at data management
► This talk:This talk: A more traditional query optimization questionA more traditional query optimization question Signficant benefits possible even hereSignficant benefits possible even here
► Even in traditional domains Even in traditional domains
Conjunctive QueriesConjunctive Queries► Queries of the form:Queries of the form:
SELECT X1, X2, X3SELECT X1, X2, X3WHERE WHERE pred1pred1(X1)(X1) AND AND pred2pred2(X2)(X2) AND AND pred3pred3(X3)(X3)
► Common in acquisitional environmentsCommon in acquisitional environments Sensor networks:Sensor networks:
► Find sensors reporting temperature between 10Find sensors reporting temperature between 10ooC and 20C and 20ooC and light C and light less than 100 Luxless than 100 Lux
► Are there any sensors not within above bounds ?Are there any sensors not within above bounds ? Web data sources:Web data sources:
► What fraction of people who donated to Gore in 2000 also have What fraction of people who donated to Gore in 2000 also have patents ?patents ?
Conjunctive QueriesConjunctive Queries► Queries of the form:Queries of the form:
SELECT X1, X2, X3SELECT X1, X2, X3WHERE X1 = x1WHERE X1 = x1 AND X2 = x2AND X2 = x2 AND X3 = x3AND X3 = x3
► Current approach: Current approach: Sequential PlansSequential Plans Choose a single order of attributes to observe, for all tuplesChoose a single order of attributes to observe, for all tuples
X1 X2 X3
Observe X1
If X1 = x1, then Observe X2
If X1 = x1 and X2 = x2, then Observe X3
May involve: turning a sensor on and sampling, or querying a web index over the network
Finding Sequential PlansFinding Sequential Plans► Naïve:Naïve:
Order predicates by Order predicates by cost/(1 – selectivity)cost/(1 – selectivity)► selectivityselectivity = fraction of tuples that pass the predicate = fraction of tuples that pass the predicate
Ignores correlationsIgnores correlations Used by most current query optimizersUsed by most current query optimizers
► Optimal:Optimal: Find the optimal order considering correlationsFind the optimal order considering correlations NP-HardNP-Hard
► A 4-approximate solution:A 4-approximate solution: Greedily choose the attribute that maximizes the return Greedily choose the attribute that maximizes the return
[Munagala, Babu, Motwani, Widom, ICDT 05][Munagala, Babu, Motwani, Widom, ICDT 05]
ObservationObservation► Cheap, correlated attributes can improve planningCheap, correlated attributes can improve planning
By improving estimates of selectivitiesBy improving estimates of selectivities Extreme Case - Perfect Correlation: Extreme Case - Perfect Correlation: X4X4 X1X1 Observing Observing X4X4 sufficient to decide whether sufficient to decide whether X1 = x1X1 = x1 is true is true
► Would be a better plan if Would be a better plan if X4X4 cheaper to acquire cheaper to acquire E.g. E.g. temperaturetemperature and and voltagevoltage in SensorNets in SensorNets
► Unfortunately, perfect correlations rarely exist Unfortunately, perfect correlations rarely exist False +ve’s and – False +ve’s and –ve’sve’s
Approach taken by Shivakumar et al [VLDB 1998]Approach taken by Shivakumar et al [VLDB 1998]► Our Approach: Choose different plans based on observed Our Approach: Choose different plans based on observed
attribute valuesattribute values Query always evaluated correctlyQuery always evaluated correctly Sometimes, observe attributes not involved in querySometimes, observe attributes not involved in query Sounds adaptive, but we generate complete plans Sounds adaptive, but we generate complete plans a prioria priori Applicable in both exact and approximate caseApplicable in both exact and approximate case
ExampleExample
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .5
Cost = 100Selectivity = .5
Expected Cost = 150
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .5
Cost = 100Selectivity = .5
Expected Cost = 150
SELECT * FROM sensorsWHERE light < 100 Lux and temp > 20o
C
ExampleExample
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .1
Cost = 100Selectivity = .9
Expected Cost = 110
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .1
Cost = 100Selectivity = .9
Time in [6pm, 6am]
T
F
SELECT * FROM sensorsWHERE light < 100 Lux and temp > 20o
C
A Conditional Plan
Problem StatementProblem Statement► Given a conjunctive query of the formGiven a conjunctive query of the form
XX11 = x = x11 and and X X22 = x = x2 2 andand … … andand X Xmm = x = xmm
and additional attributes Xand additional attributes Xm+1m+1, …, X, …, Xn n not referenced in the not referenced in the query, find the optimal conditional planquery, find the optimal conditional plan
We will restrict ourselves to We will restrict ourselves to binarybinary conditional plans conditional plans
Observe XEvaluate predicate of form X > a
T
FTerminal Nodes
Return false
Return true
Costing a conditional planCosting a conditional plan
X
< a
>= a
<a
>=a
Satisfied Predicates
Costing a conditional planCosting a conditional plan
X
< a
>= a
<a
>=aCost(Cost() = C(X | ) = C(X | ))
+ P(X < a | + P(X < a | ) Cost() Cost(<a<a) ) + P(X >= a | + P(X >= a | ) Cost () Cost (>=a>=a))
Satisfied Predicates
= and (X < a)
ComplexityComplexity► Given an oracle that can compute any conditional Given an oracle that can compute any conditional
probabilities in probabilities in O(1)O(1) time, deciding whether a plan time, deciding whether a plan with expected cost < with expected cost < KK exists is #P-hard exists is #P-hard Reduction from #3-SAT (counting version of 3-SAT)Reduction from #3-SAT (counting version of 3-SAT)
► Given a dataset Given a dataset DD, finding the optimal conditional , finding the optimal conditional plan for that dataset is NP-hardplan for that dataset is NP-hard Reduction from complexity of finding binary decision treesReduction from complexity of finding binary decision trees
Solution StepsSolution Steps► Plan CostingPlan Costing
Need method to estimate conditional Need method to estimate conditional probabilities probabilities
► Plan EnumerationPlan Enumeration Exhaustive vs. heuristicExhaustive vs. heuristic
Conditional Probability Conditional Probability EstimationEstimation
► We estimate conditional probabilities using We estimate conditional probabilities using observations over historical dataobservations over historical data
► Options:Options: Build a complete multidimensional distribution over Build a complete multidimensional distribution over
attributesattributes+ Can read off probabilities+ Can read off probabilities- Very large memory requirements- Very large memory requirements
Scan historical data as estimates are neededScan historical data as estimates are needed + Minimal memory requirements+ Minimal memory requirements
+ Can use random samples if datasets too large+ Can use random samples if datasets too large Build a model that allows quick estimation of probabilitiesBuild a model that allows quick estimation of probabilities
► E.g., graphical models E.g., graphical models ► Allows reasoning about unobserved eventsAllows reasoning about unobserved events► Avoids overfittingAvoids overfitting
Solution StepsSolution Steps► Plan CostingPlan Costing
Need Method to Estimate Conditional Need Method to Estimate Conditional Probabilities Probabilities
► Plan EnumerationPlan Enumeration Exhaustive vs. heuristicExhaustive vs. heuristic
Exhaustive SearchExhaustive Search►Dynamic programming applicableDynamic programming applicable
X
< a
>= a
<a
>=aSubproblem defined by conditioning predicates ()
Can solve independently
Exhaustive SearchExhaustive Search►Dynamic programming applicableDynamic programming applicable
► Complexity: Complexity: O(KO(K2n2n))
► Prohibitive in most cases !!Prohibitive in most cases !! Even if we use branch-and-bound techniquesEven if we use branch-and-bound techniques
Greedy Binary Split HeuristicGreedy Binary Split Heuristic► Uses optimal sequential plans as base case Uses optimal sequential plans as base case ► Chooses locally optimal splits to improve greedilyChooses locally optimal splits to improve greedily
Example Query:X1 = 1 and X2 = 1
X1 X21. Optimal Sequential Plan
X3
< 10
>= 10
Eg:
Use optimal sequential plans to solve the (smaller) subproblems
2. Check all possible splits:
Greedy Binary Split HeuristicGreedy Binary Split Heuristic► Uses optimal sequential plans as base case Uses optimal sequential plans as base case ► Chooses locally optimal splits to improve greedilyChooses locally optimal splits to improve greedily
Example Query:X1 = 1 and X2 = 1
X1 X21. Optimal Sequential Plan
2. Check all possible splits:
X3
< 10
>= 10
Eg:
3. Choose locally optimal split
4. Recurse
EvaluationEvaluation► Datasets from real deploymentsDatasets from real deployments
LabLab► 45 motes deployed in Intel Berkeley Lab45 motes deployed in Intel Berkeley Lab► 400,000 readings 400,000 readings ► Total 6 attributes; 3-predicate queries Total 6 attributes; 3-predicate queries
Garden-11: Garden-11: ► 11 motes deployed in a forest11 motes deployed in a forest► 3 attributes per mote, temperature, voltage, and humidity. 3 attributes per mote, temperature, voltage, and humidity. ► Total 34 attributes; 33-predicate queriesTotal 34 attributes; 33-predicate queries► Queries are issued against the sensor network as a wholeQueries are issued against the sensor network as a whole
Also experiemented with Garden-5, and synthetic datasetsAlso experiemented with Garden-5, and synthetic datasets► Separated Separated testtest from from trainingtraining► Randomly generated range queries for a given set of query Randomly generated range queries for a given set of query
variablesvariables► Java implementation, simulated execution based on cost modelJava implementation, simulated execution based on cost model
Costs represent data acquisition onlyCosts represent data acquisition only
Example Plan: Lab DatasetExample Plan: Lab Dataset
Garden-11 (1)Garden-11 (1)
Comparing Naive and Heuristic-10
00.20.40.60.81
1.2
>0.9 >1 >1.1 >1.5 >2 >2.5Performance Gain
Fraction of Experiments
Garden-11 (2)Garden-11 (2)
Comparing CorrSeq and Heuristic-10
00.20.40.60.81
1.2
>0.85 >1 >1.1 >1.3 >1.5Performance Gain
Fraction of Experiments
ExtensionsExtensions► Probabilistic queries with confidencesProbabilistic queries with confidences► General queriesGeneral queries
E.g. Disjunctive queriesE.g. Disjunctive queries► A large class of Join queriesA large class of Join queries
E.g. “Star” queries with K-FK join predicatesE.g. “Star” queries with K-FK join predicates► Existential queriesExistential queries
E.g., “tell me k answers to this query”E.g., “tell me k answers to this query”►Can order the observations so that tuples most likely to satisfy Can order the observations so that tuples most likely to satisfy
are observed firstare observed first► Adaptive conditional planningAdaptive conditional planning
With With eddieseddies
Conditional Planning for Conditional Planning for Probabilistic QueriesProbabilistic Queries
► Basic idea similarBasic idea similar Cost computation is different; typically requires numerical Cost computation is different; typically requires numerical
integrationintegrationConditional Planning with BBQ
0
20
40
60
80
100
120
140
99.50% 99% 98% 95%
Confidence
Energy (mJ)
TinyDBBBQ-StaticBBQ-Static-Cond-5BBQ-DynamicBBQ-Dynamic-Cond-5
ConclusionsConclusions► Large-scale distributed information Large-scale distributed information
systems systems Must acquire data carefullyMust acquire data carefully
►High disparate acquisition costsHigh disparate acquisition costs Exhibit strong temporal and spatial correlationsExhibit strong temporal and spatial correlations
► Conditional planning Conditional planning Change plans based on observed attribute Change plans based on observed attribute
valuesvalues Significant benefits even for traditional tasksSignificant benefits even for traditional tasks
►Many other opportunities to exploit such Many other opportunities to exploit such correlations…correlations…
Questions ?Questions ?