ms thesis defense rohit raghunathan august 19 th , 2011 committee members dr. subbarao kambhampti ...

50
An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao Kambhampti (Chair) Dr. Joohyung Lee Dr. Huan Liu 1

Upload: tamarr

Post on 25-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases. MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao Kambhampti (Chair) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

1

An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

MS Thesis DefenseRohit RaghunathanAugust 19th, 2011

Committee MembersDr. Subbarao Kambhampti (Chair)

Dr. Joohyung LeeDr. Huan Liu

Page 2: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

2

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

Page 3: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

3

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

Page 4: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

4

Introduction to Web databases• Many websites allow user query through a form based interface and are

supported by backend databases• Consider used cars selling websites such as Cars.com, Yahoo! autos, etc

AutonomousDatabase

Page 5: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

5

Incompleteness in Web databases

• Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! Autos

• Web databases are being populated using automated information extraction techniques which are inherently imperfect

• Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value

Website # of attributes

# of tuples

incomplete tuples

Autotrader.com 13 25127 33.67%

Carsdirect.com 14 32564 98.74%

Page 6: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

6

Problem Statement

• Many entities corresponding to tuples with missing values might be relevant to the user query

• Traditional query processing does not retrieve such tuples

Null Accord 2003 Sedan

Q: Make = Honda

Page 7: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

7

Dimensions of the problem• Single vs Multiple missing values

– Multiple missing values requires capturing the correlations between them

• Imputation vs Query Rewriting– Imputation can look at all available evidence– Query Rewriting requires finding the smallest number of evidences

• Looking at all evidences -> reduces throughput• Looking at very few evidences -> reduction in precision• Need to find middle ground

1 Audi Sedan 20000

2 Audi A8 Sedan 15000

3 Audi 2005 Sedan 23000

User Q: Model = A8Rewritten QueryMake = Audi ^ Body = Sedan

Page 8: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

8

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

Page 9: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

9

Approximate Functional Dependencies (AFDs)

• AFDs are Functional Dependencies that hold on all but a small fraction of the databaseMake Model Body

Honda Civic Sedan

Honda Civic Coupe

Honda Civic Sedan

Honda Civic Sedan

Model Body : 0.75

• An AFD is of the form XA where X is a set of attributes and A is a single attribute• An attribute can have multiple rules

Model Make : 1.0

Make Body : 0.75

Page 10: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

10

Overview of QPIAD

• QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answers• When mediator has access privileges to modify the underlying data source

– Missing values can be completed by a simple classification task. (Imputation) – After which Traditional query processing will suffice

• When mediators do not have such privileges– Generate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)Issuing Q1 : Model = TlQ2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.

• QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting• Techniques for combining multiple AFDs shown to be ineffective

ID Make Model Year Body Mileage

1 BMW 745 2005 Sedan 200002 Acura Tl 2003 350003 BMW 645 2002 Convt 450004 BMW 745 2001 350005 Acura Tl 2002 Sedan 24000

Q: Body = Sedan

Relevant incomplete answers

Model Body : 0.75

Page 11: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

11

Shortcomings of AFD-based approaches

• Principles of locality and detachment do not hold for uncertain reasoning

• Model Body (0.7)• This intuitively means that model of a car

determines the body of a car with a probability of 0.7 when no other evidence is available.

• When other evidences are present, there is no easy way to combine the probabilities

Page 12: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

12

Shortcomings of AFD-based approachesID Make Model Year Body Mileage

1 Audi Sedan 20000

2 Audi A8 Sedan 15000

3 BMW 745 2002 Sedan 40000

4 Audi 2005 Sedan 20000

5 Audi A8 2005 Sedan 20000

6 1999 Convt 25000

• Imputing the missing values in T2 using a single AFD; ignore influence from other attributes

• Imputing missing values in T1 ignores the correlations between the attributes Model and Year

• Imputing missing values in T6 will get AFDs into cyclesModel Make Make Model

Page 13: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

13

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 14: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

14

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 15: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

15

Bayes network

• A Bayes network is a DAG representing the probabilistic dependencies between attributes

• It is a compact representation of the full joint distribution– Therefore influence from all

variables are accounted• It represents the generative

model of the autonomous database

Year

Model

Make Body

Mileage

ModelMake

Civic …

Honda 0.8 ..

… .. ..

CPDs model the strength of the probabilistic dependencies

Page 16: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

16

Challenges in using Bayes networks for handling incompleteness in Autonomous databases

• Learning and inference with Bayes networks is computationally harder than AFDs– Learning the topology and parameters from data

involves searching over search the space of topologies• But can be done offline

– Inference in a general Bayes network is intractable.• But can use approximate inference

Question: Can we get benefits of exact inference while containing costs?

Page 17: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

17

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 18: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

18

Learning a Bayes network model

• Structure & Parameter Learning From Data– Challenge: Involves searching over topologies– Use Banjo Software Package as black-box.– Experiments show learned topology is robust w.r.t• Sample size(5-20%) – same topology• Search time(5-30 minutes) – same topology• Max parent count (2-4) – same topology; significantly

higher networks examined in case of 2.

Page 19: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

19

Inference in Bayes networks

• Exact Techniques – NP-hard, in the general case. Therefore, do not scale well

with increase in incompleteness– Junction Tree (fastest; but inapplicable when query variables

do not form a clique)– Variable Elimination

• Approximate Techniques (Scales well; retaining accuracy of exact methods)– Gibbs Sampling– Using Infer.net package allows us to use Expectation

Propagation inference

Page 20: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

20

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 21: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

21

Imputation

• Experimental Setup– Test Databases: Cars.com database containing 8K

tuples and Adult Database from UCI repository containing 15K tuples

– Bayes net inference • Exact inference: Junction Tree, Variable Elimination• Approximate inference: Gibbs Sampling

Page 22: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

22

Imputation

• Remove all the values for the attribute being predicted

• Substitute missing value with most likely value• AFD-approach– Use only highest confidence AFD (Use all attributes if

confidence is low, e.g., mileage(Cars)). Called Hybrid-one by authors of QPIAD.

• Bayes net– Infer the posterior distribution of missing attribute, given

evidences of the other attributes in the tuple

Page 23: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

23

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 24: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

24

Imputation- single missing attribute

• Significant difference for attributes Model and Year. • AFDs using only the highest confidence rule, and ignore others.

– Attempts at combining evidences from multiple rules have been ineffective.• Bayes nets systematically combines all evidences.

MakeModel

YearPric

e

Mileage

Body0

0.20.40.60.8

1BN-Exact BN-Gibbs AFDs

Accu

racy

ID Make Model Year Body

1 Audi A8 Sedan

2 BMW 745 2002 Sedan

3 Audi 2005 Sedan

4 Audi A8 2005 Sedan

Page 25: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

25

Imputation- multiple missing attributes

• AFD-approach– Predict each missing value independently– Can get in cycles

• Bayes net– Computes the Joint distribution over the missing

attributes. Make Model Year Body

BMW Sedan

BMW 2003

BMW 745 2004 Sedan

Make ModelModel Make

Page 26: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

26

Imputation- multiple missing attributes

• When missing attributes are correlated, they often get into cycles– Only 9 out of 20 combinations could be predicted when 3 attributes are missing

• AFD accuracies are lower as they use a single rule independently for prediction – BNs systematically combine evidences from multiple sources and capture correlations by

finding the joint distribution• When attributes are D-separated and involve attributes which have similar prediction

accuracies for both methods, there is no difference in accuracy

Year, Mile

age

Body, Model

Make, M

odel

Year, Model

Year, Make

Mileage

, Make

Mileage

, Model

00.20.40.60.8

AFD BN-Gibbs BN-Exact

Accu

racy

Year

Model

Make Body

Mileage

Price

Page 27: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

27

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 28: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

28

Imputation- Increase in incompleteness in test data

• Evidence for predicting missing values reduces with increase in incompleteness

• AFD-approach– Chain missing values in determining set of AFD

• Bayes net– No change. Just compute posterior distribution of

the attributes to be imputed given the evidence.Q: Model = 745AFDs: Make, Body Model Year Body

Make Model Year Body

BMW Sedan

BMW 2003

BMW 745 2004 Sedan

Page 29: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

29

Imputation- Increase in incompleteness in test data

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.050.1

0.150.2

0.250.3

Race-Occupation

AFD BN-Gibbs BN-Exact

Percentage of Incompleteness

Acc

urac

y

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.10.20.30.40.50.60.70.8

Model

Percentage of Incompleteness

Pred

ictio

n Ac

cura

cy

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.20.40.60.8 Year-Body

AFDBN-GibbsBN-Exact

Percentage of IncompletenessPred

ictio

n Ac

cura

cy

Page 30: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

30

Time Taken For Imputation% incompleteness

AFD (Sec.)

BN-Gibbs(Sec.)(250 Samples)

BN-Exact(Sec.)

0 0.271 44.46 16.2310 0.267 47.15 44.8820 0.205 52.02 82.5230 0.232 54.86 128.2640 0.231 56.19 182.3350 0.234 58.12 248.7560 0.232 60.09 323.7870 0.235 61.52 402.1380 0.262 63.69 490.3190 0.219 66.19 609.65

BN-Gibbs retains the accuracy edge of BN-Exact while containing costs

Page 31: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

31

Overview of the talk• Introduction to Incomplete Autonomous Databases• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting– Introduction– Learning Bayes network models from data– Imputation

• Single and multiple missing values• Varying levels of incompleteness in test data

– Query Rewriting• Bayes network based rewriting • Comparison of Bayes network based rewriting and AFDs

Page 32: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

32

Query Rewriting

• When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.

• Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.

Page 33: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

33

Query Rewriting– Single-attribute queriesID Make Model Year Body Mileage

1 BMW 745 2005 Sedan 200002 Acura Tl 2003 350003 BMW 645 2002 Convt 450004 BMW 745 2001 350005 Acura Tl 2002 Sedan 24000

Can retrieve T2 with Q’1: Model = Tl

T4 with Q’2: Model = 745

Q: Body = Sedan

1 BMW 745 2005 Sedan 20000

5 Acura Tl 2002 Sedan 24000

CERTAIN ANSWERS (BASE RESULT SET)

Relevant incomplete answers

Page 34: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

34

Generating Rewritten QueriesID Make Model Year Body Mileage

1 BMW 745 2005 Sedan 20000

5 Acura Tl 2002 Sedan 24000

CERTAIN ANSWERS (BASE RESULT SET)

Bayes NetworksATTRIBUTES: ALL ATTRIBUTES IN

MARKOV BLANKET(BN-ALL-MB)Q’1: Model = 745Q’2: Model = Tl

Year

Model

Make Body

Mileage

Given evidence of all attributes in MARKOV BLANKET, an attribute is independent of ALL other attributes

AFDsATTRIBUTES:

DETERMINING SET OF AFD

Model Body : 0.9 Q’1: Model = 745Q’2: Model = Tl

Q: Body = Sedan

Page 35: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

35

Ranking Rewritten queries• All queries may not be equally good in retrieving relevant answers

– “tl” model cars are more likely to be sedans than a car with “745” model• Rank queries based on their expected precision (ExpPrec)

Bayes NetworksInference in bayes network

AFDsUse Naïve Bayes Classifiers

ExpPrec(Q) = P(Am=vm|ti) where ti ε ПMB(Am)(RS(Q)) for Bayes nets

where ti ε ПdtrSet(Am)(RS(Q)) for AFDs

Q1’: Model = ‘tl’.ExpPrec(Q1’)= P(Body=Sedan|Model=tl) = 1

Q2’= Model = ‘745’.ExpPrec(Q2’)= P(Body=Sedan|Model=745) = 0.6

Page 36: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

36

Ranking Rewritten Queries- only K queries

• When database or network resources are limited, the mediator can choose to issue the top-K queries to get the most relevant uncertain answers– It is important to carefully trade precision with throughput

• Use F-measure metric (idea borrowed from QPIAD)

P – expected precision (e.g. P(Model=745|Make =BMW) )R – expected recall

R = expected precision * expected selectivityexpected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

=0 – only precision

Page 37: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

37

Experimental Setup• Test databases: Cars database consisting of 55K tuples

and Adult database consisting of 15K tuples• Training set 15% of the database. • Test data split in two halves-

– One half contains no incompleteness and is used to return the base result set

– In the other half all query-constrained attributes are made null– A copy of test data is used as the ground truth to compute

precision and recall– This is an aggressive setup since most databases have <50%

incompleteness

Page 38: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

38

BN-All-MB vs AFD

BN-All-MB: P(Make=bmw|model= 330)AFD: P(Make=bmw|model=330)

• When size of determining set > 1 Expected Precision values represented of AFDs (represented by NBCs) are inaccurate

• Actual precision is lower for AFDs because their expected precisions are inaccurate

Q: Make

Page 39: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

39

Shortcoming of BN-All-MB• Throughput of queries reduces

drastically as markov blanket size increasesUse F-measure based ranking to

increase recallWhen almost all queries have very low throughput there is simply no way to increase recall

Year

Model

Make Body

Mileage

Q: Model = 745

Q’1: MakeᴧBodyᴧYearQ’2: MakeᴧBodyᴧYearQ’3: MakeᴧBodyᴧYear

Page 40: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

40

BN-Beam (Single-attribute queries)

Q: Model = 745

Year

Model

Make Body

Mileage

ID Make Model Year Body Mileage

1 BMW 745 2005 Sedan 200002 BMW 2005 Sedan 350003 BMW 645 2002 Convt 450004 BMW 745 2001 350005 Acura Tl 2002 Sedan 240006 BMW 2001 Sedan 20000

Candidate Attribute Set = {Year, Make, Body}

Page 41: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

41

BN-BeamLevel 1

Make = BMW

Year = 2001

Body = Sedan

Pick Top-K queries at each level based on F-measure metric

P – expected precision (e.g. P(Model=745|Make =BMW) )R – expected recall R = expected precision * expected selectivityExpected selectivity = Sample Selectivity * Sample RatioSample Ratio estimated from cardinalities result sets from sample and original database

Level 2Make = BMW ^ Year = 2001

Make = BMW ^ Year = 2005

Body = Sedan

Level L

Q’1

Q’2

Q’3

Issue to database in the increasing order of expected precision

At Level L all (partial) queries have ≤ L attributes constrained

Year

Body

Best rewritten queries of size 1

Page 42: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

42

BN-Beam vs BN-All-MB

• Increasing α does not increase recall of BN-All-MB

• BN-Beam increases recall without a catastrophic reduction in precision

Results for Top-10 queries for user query Year = 2002

Recall Plot

Precision Plot

Page 43: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

43

Multi-attribute queries

• Contribution to QPIAD• Aim: To retrieve relevant uncertain answers

with multiple-missing values on query-constrained attributes.

Page 44: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

44

Multi-attribute queriesID Make Model Year Body Mileage

1 645 2002 Coupe 40000

2 BMW 645 2002 Convt

3 745 2001 Sedan

4 645 2002 Coupe

5 BMW 745 2001 Coupe 40000

6 BMW 645 2002 Convt 40000

Q: Make = BMW ʌ Mileage = 40000Base result set = T5, T6QPIAD retrieves T1 and T2.BN-Beam can also retrieve T3 and T4. Candidate attribute set: union of attributes in the markov

blanket of all constrained attributes All other steps same as single-attribute query case

Base result set

QPIADBN-Beam

Page 45: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

45

Comparison over multi-attribute queries

• Two AFD approaches1. AFD-All-Attributes: Creates a conjunctive query

by joining all attributes in the determining set of the AFDs of the constrained attributes.

Consider AFDsModel Make Year MileageQ: Make = BMW ʌ Mileage = 40000 Make = BMW

Model = 745

Model = 645

Mileage = 40000

Year = 2001

Year = 2002

Q’1: Model=745ᴧYear=2001Q’2: Model=645ᴧYear=2001Q’3: Model=745ᴧYear=2002Q’4: Model=645ᴧYear=2002

Expected Precision = Product of individual query’s expected precision

Page 46: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

46

BN-Beam vs AFD-All-Attributes

Precision of BN-Beam is competitive with AFD-All Attributes

Recall of BN-Beam is higher

• AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.

• Leads to low throughput or even empty queries

Results for top-10 queriesQ: Make ^ Mileage

Page 47: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

47

Comparison of multi-attribute queries

2. AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting

Q: Make = Dodge ᴧ Year = 2004IGNORE all attributes other than MakeAFD : Model Make

Q’1: Model=ramQ’2: Model= intrepid

Page 48: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

48

BN-Beam vs AFD-Highest-ConfidenceResults for top-10 queriesQ:Make ʌ Year(Car database)

AFD-Highest-Confidence increases recall but NOT WITHOUT a CATASTROPHIC drop in precision

Page 49: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

49

Summary• A comparison of cost and accuracy tradeoffs of using Bayes

network models and AFDs for handling incompleteness in autonomous databases

• Bayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data.

• Presented two approaches- BN-All-MB and BN-Beam for generating rewritten queries using Bayes networks. We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall

Page 50: MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao  Kambhampti  (Chair) Dr.  Joohyung  Lee Dr.  Huan  Liu

50

Deviations From the Thesis Draft

• CAVEAT: I found two bugs in my code (Query Rewriting section)

• Corrected one bug (related to BN-based rewriting)

• Will correct the other one (related to AFD-based rewriting) after the defense

THANK YOU

QUESTIONS?