adaptive query processing for data aggregation:

32
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr. Subbarao Kambhampati (chair) Dr. Huan Liu Dr. Yi Chen April 13, 2006

Upload: penny

Post on 25-Feb-2016

41 views

Category:

Documents


2 download

DESCRIPTION

Adaptive Query Processing for Data Aggregation:. Mining, Using and Maintaining Source Statistics. M.S Thesis Defense by Jianchun Fan Committee Members: Dr. Subbarao Kambhampati (chair) Dr. Huan Liu - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Adaptive Query Processing for Data Aggregation:

Adaptive Query Processing for Data Aggregation:Mining, Using and Maintaining Source Statistics

M.S Thesis Defense

by Jianchun Fan

Committee Members:

Dr. Subbarao Kambhampati (chair)

Dr. Huan Liu

Dr. Yi Chen April 13, 2006

Page 2: Adaptive Query Processing for Data Aggregation:

Introduction• Data Aggregation: Vertical Integration

Mediator

R (A1, A2, A3, A4, A5, A6)

S1 R1 (A1, A2, _, _, A5, A6)

S2 R2 (A1, _, A3, A4, A5, A6)

S3 R1 (A1, A2, A3, A4, A5, _)

A1 A2 A3 A4 A5 A6

Page 3: Adaptive Query Processing for Data Aggregation:

Introduction• Query Processing in Data Aggregation

– Sending every query to all sources ?• Increasing work load on sources• Consuming a lot of network resources• Keeping users waiting

– Primary processing task:Selecting the most relevant sources regarding difference user objectives, such as completeness and quality of the answers and response time

– Need several types of sources statistics to guide source selection

• Usually not directly available

Page 4: Adaptive Query Processing for Data Aggregation:

Introduction• Challenges

– Automatically gather various types of source statistics to optimize individual goal

• Many answers (high coverage)• Good answers (high density)• Answered quickly (short latency)

– Combine different statistics to support multi-objective query processing

– Maintain statistics dynamically

Page 5: Adaptive Query Processing for Data Aggregation:

System Overview

Page 6: Adaptive Query Processing for Data Aggregation:

System Overview• Test beds:

– Bibfinder: Online bibliography mediator system, integrating DBLP, IEEE xplore, CSB, Network Bibligraph, ACM Digital Library, etc.

– Synthetic test bed: 30 synthetic data sources (based on Yahoo! Auto database) with different coverage, density and latency characteristics.

Page 7: Adaptive Query Processing for Data Aggregation:

Outline1. Introduction & Overview2. Coverage/Overlap Statistics3. Learning Density Statistics4. Learning Latency Statistics5. Multi-Objective Query Processing6. Other Contribution7. Conclusion

Page 8: Adaptive Query Processing for Data Aggregation:

Coverage/Overlap Statistics• Coverage: how many answers a source

provides for a given query• Overlap: how many common answers a set of

sources share for a given query• Based on Nie & Kambkampati [ICDE 2004]

Page 9: Adaptive Query Processing for Data Aggregation:

Density Statistics• Coverage measures “vertical

completeness” of the answer set• “horizontal completeness” is important

too – quality of the individual answers

Density statistics measures the horizontal completeness of the individual answer tuples

Page 10: Adaptive Query Processing for Data Aggregation:

Defining Density• Density of a source w.r.t a given query:

Average of density of all answersA1 A2 A3 A4 Select A1, A2, A3, A4

From S

Where A1 > v1

Density = (1 + 0.5 + 0.5 + 0.75) / 4

= 0.675

• Learning density for every possible source/query combination? – too costly– The number of possible queries is exponential to

the number of attributes

Projection Attribute set

Selection Predicates

Page 11: Adaptive Query Processing for Data Aggregation:

Learning Density Statistics• A more realistic solution: classify the

queries and learn density statistics only w.r.t the classes

Select A1, A2, A3, A4

From S

Where A1 > v1

Projection Attribute set

Selection Predicates

• Assumption: If a tuple t represents a real world entity E, then whether or not t has missing value on attribute A is independent to E’s actual value of A.

Page 12: Adaptive Query Processing for Data Aggregation:

Learning Density Statistics• Query class for density statistics: projection

attribute set• For queries whose projection attribute set is (A1,

A2, …, Am), 2m different types of answersA1 A2 22 different density patterns:

dp1 = (A1, A2)

dp2 = (A1, ~A2)

dp3 = (~A1, A2)

dp4 = (~A1, ~A2)

Density([A1, A2] | S) = P(dp1 | S) * 1.0 + P(dp2 | S) * 0.5

+ P(dp3 | S) * 0.5 + P(dp4 | S) * 0.0

Page 13: Adaptive Query Processing for Data Aggregation:

Learning Density StatisticsR(A1, A2, …, An)

2n possible projection attribute set

(A1)(A1, A2)(A1, A3)…(A1, A2, …, Am)

2m possible density patterns

(A1, A2, …, Am)(~A1, A2, …, Am)(~A1, ~A2, …, Am)…(~A1, ~A2, …, ~Am)

For each data source S, the mediator needs to estimate joint probabilities!

Page 14: Adaptive Query Processing for Data Aggregation:

Learning Density Statistics• Independence Assumption: the probability

of tuple t having a missing value on attribute A1 is independent of whether or not t has a missing value on attribute A2.

• For queries whose projection attribute set is (A1, A2, …, Am), only need to assess m probability values for each source!

Joint distribution: P(A1, ~A2 | S) = P(A1 | S) * (1 - P(A2 | S))

Learned from a sample of the data source

Page 15: Adaptive Query Processing for Data Aggregation:

Outline1. Introduction & Overview2. Coverage/Overlap Statistics3. Learning Density Statistics4. Learning Latency Statistics5. Multi-Objective Query Processing6. Other Contribution7. Conclusion

Page 16: Adaptive Query Processing for Data Aggregation:

Latency Statistics• Existing work: source specific

measurement of response time– Variations on time, day of the week,

quantity of data, etc.• However, latency is often query specific

– For example, some attributes are indexed• How to classify queries to learn

latency?– Binding Pattern

Same

different

Page 17: Adaptive Query Processing for Data Aggregation:

Latency Statistics

Page 18: Adaptive Query Processing for Data Aggregation:

Using Latency Statistics• Learning is straightforward: average on

a group of training queries for each binding pattern

• Effectiveness of binding pattern based latency statistics

Page 19: Adaptive Query Processing for Data Aggregation:

Outline1. Introduction & Overview2. Coverage/Overlap Statistics3. Learning Density Statistics4. Learning Latency Statistics5. Multi-Objective Query Processing6. Other Contribution7. Conclusion

Page 20: Adaptive Query Processing for Data Aggregation:

Multi-Objective Query Processing• Users may not be easy to please…

– “give me some good answers fast”– “I need many good answers”– …

• These goals are often conflicting!– decoupled optimization strategy won’t work– Example:

• S1(coverage = 0.60, density = 0.10)• S2(coverage = 0.55, density = 0.15)• S3(coverage = 0.50, density = 0.50)

Page 21: Adaptive Query Processing for Data Aggregation:

Multi-Objective Query Processing• The mediator needs to select sources

that are good in many dimensions– “Overall optimality”

• Query selection plans can be viewed as 3-dimentional vectors

• Option1: Pareto Optimal Set• Option2: aggregating multi-dimension

vectors into scalar utility values

Page 22: Adaptive Query Processing for Data Aggregation:

Combining Density and Coverage

Page 23: Adaptive Query Processing for Data Aggregation:

Combining Density and Coverage

Page 24: Adaptive Query Processing for Data Aggregation:

Combining Density and Coverage

Page 25: Adaptive Query Processing for Data Aggregation:

Multi-Objective Query Processing• discount model

• weighted sum model

2D coverage

Page 26: Adaptive Query Processing for Data Aggregation:

Multi-Objective Query Processing

Page 27: Adaptive Query Processing for Data Aggregation:

Outline1. Introduction & Overview2. Coverage/Overlap Statistics3. Learning Density Statistics4. Learning Latency Statistics5. Multi-Objective Query Processing6. Other Contribution7. Conclusion

Page 28: Adaptive Query Processing for Data Aggregation:

Other Contribution• Incremental Statistics Maintenance (In

Thesis)

Page 29: Adaptive Query Processing for Data Aggregation:

Other Contribution• A snapshot of public web services (not

in Thesis) [Sigmod Record Mar. 2005]Implications and Lessons learned:

•Most publicly available web services support simple data sensing and conversion, and can be viewed as distributed data sources

•Discovery/Retrival of public web services are not beyond what the commercial search engines do.

•Composition:

•Very few services available – little correlations among them

•Most composition problems can be solved with existing data integration techniques

Page 30: Adaptive Query Processing for Data Aggregation:

Other Contribution• Query Processing over Incomplete

Autonomous Database [with Hemal Khatri]– Retrieving uncertain answers where constrained

attributes are missing

– Learning Approximate Functional Dependency and Classifiers to reformulate the original user queries

Select * from cars where model = “civic”Make Model Year Location Mileage Price Body Style Color

Honda Civic 2000 Tempe 40,000 6,000 Coupe Red

Honda Civic 1998 Phoenix 60,500 5,000 Coupe White

Honda Civic 2001 Chandler 35,000 8,000 Sedan NULL

Honda NULL 2000 Tempe 50,000 6,800 Sedan Beige

Honda NULL 1999 Mesa 55,000 5,750 Coupe Silver

(Make, Body Style) Model

Q1: select * from cars where make = Honda and BodyStyle = “sedan”

Q2: select * from cars where make = Honda and BodyStyle = “coupe”

Page 31: Adaptive Query Processing for Data Aggregation:

Conclusion• A comprehensive framework

– Automatically learns several types of source statistics

– Uses statistics to support various query processing goal

• Optimize in individual dimensions (coverage, density & latency)

• Joint Optimization over multiple objectives• Adaptive to different users’ own preferences

– Dynamically maintains source statistics

Page 32: Adaptive Query Processing for Data Aggregation: