1 on provenance of non-answers for queries over extracted data jiansheng huang ting chen anhai doan...

44
1 On Provenance of Non- Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

Upload: monica-lyons

Post on 30-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

1

On Provenance of Non-Answers for Queries over Extracted Data

Jiansheng Huang

Ting Chen

AnHai Doan

Jeffrey F. Naughton

Page 2: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

2

Imprecise Data in Information Extraction (IE)

Source-1

Source-2

……

integrate

Source-n extract

Imprecise

Fuzzy

query

Answers

Database

Imprecise

IncorrectIncomplete

Examples:DBLife (WISC)Avatar (IBM)etc.

Page 3: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

3

State of the Art

Answers Non-Answers

ProbabilityCavallo et al., 1987;Barbara et al., 1992;

Lakshmanan et al., 1997; Dalvi et al., 2004.

ProvenanceWoodruff et al., 1997;Cui et al., 2000, 2001;Buneman et al., 2001;Bhagwat et al., 2004;

Probability + ProvenanceBenjelloun et al., 2006

Page 4: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

4

Motivating Example

• Crawl web to extract information related to academic job openings.

• Store result of extraction in an RDBMS.

• Ask SQL queries and try to interpret results

Page 5: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

5

Extracted Jobs

CS Dept. web sites

yescaucsc

………

nocauc merced

nocaucsd

yescaberkeley

ma

school_state

yesharvard

job_openingschool_name

Page 6: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

6

Extracted Ranking

CS Ranking Web Site

……

uc merced

23ucsd

3berkeley

11harvard

rankschool_name

Page 7: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

7

Question Answering

What are the CS PhD programs in California (CA) that have job openings and are in the top 25?

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Ranking.rank <= 25

AND Jobs.job_opening = yesAND Jobs.school_name = Ranking.school_name;

AND Jobs.school_state = ca

Page 8: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

8

Answer

• berkeley …

yescaberkeley

school_state job_openingschool_name

3berkeley

rankschool_name justifies

Page 9: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

9

Non-Answers

• ucsd, uc merced, harvard, ucsc,…

• But why?

• Data exists.

• No mechanism.

yescaucsc

nocauc merced

nocaucsd

ma

school_state

yesharvard

job _openingschool_name

uc merced

23ucsd

11harvard

rankschool_name

Page 10: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

10

Assumptions

• Relational data model

• Subset of SQL – Selection (e.g., R.a = 2)

– Projection (e.g., return R.a)

– Join (e.g., R.a = S.b)

• Conjunctive predicates (e.g., a = 2 and b = 3)

• Satisfiable (e.g., no “a = 2 and a = 3”)

Page 11: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

11

Provenance of Non-Answers

query

z is a potential answer, (x, y) and (x´, y) are the provenance of z.

x

samequery

zy

updates

y

(x -› x´, y) explains why z is not an answer and how z can become an answer.

Page 12: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

12

Example

nocaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

ucsd is a potential answer, The set of base tuples is a provenance of ucsd.

Jobs Ranking

Page 13: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

13

Another Example

yesmaharvard

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

Jobs Ranking

Page 14: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

14

Trust and Constraints

Untrusted Trustedquery

x

samequery

zy

valid updates

y

Satisfy constraints

Don’t consider updates

Page 15: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

15

Example: Using Trust

yesmaharvard

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

trust Jobs Ranking

Page 16: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

16

Factors DeterminingProvenance of Non-Answers

• Trusted Data

• Constraints

• Query specification

Page 17: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

17

Algorithm

• Start from a user query and a specific non-answer

• Add predicates derived from the non-answer

• Add constraint predicates

• Retain only predicates on trusted attributes

• For attributes of a potential tuple • Determine equivalent constant value (e.g., a = 2)

• If none, return a variable

• Evaluate the provenance query

Page 18: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

18

Example: Why is UCSD Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs and Ranking

Page 19: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

19

Computing Provenance of UCSDSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening , R.school_name, R.rankFROM Jobs AS J, Ranking AS R

Trusted

Specifyingnon-answer

Hypotheticalupdate

WHERE Jobs.school_name = ucsdAND J.school_state = caAND R.rank <= 25AND J.school_name = R.school_name;

-› yes

Page 20: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

20

Provenance of UCSD

no -› yesucsd

job_openingJobs.school_name

23ucsd

rankRanking.school_name

UCSD is a potential answer.

Why not an answer? because job_opening = no.

How to become an answer? job_opening: no -› yes.

Page 21: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

21

Provenance-Assisted Debugging

• While implementing our job extraction example, actually used provenance of non-answers to find a bug.

• Specifically, noticed UCSD is not an answer to “find all dept. in top 25 with job openings”

• Informed by provenance, we checked UCSD web page and found it does have a job opening.

• What happened?

Page 22: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

22

Our Bug

• UCSD web page has a job opening

• Debugged extraction for UCSD instance

• Bug: a line in source longer than the line buffer for read

• Fix: increase line buffer size

• Re-extract and re-query produces UCSD as answer

Page 23: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

23

• New records can be inserted.– Use an all-null tuple as a proxy in our provenance

report (not actually inserted).

• The join expression for a provenance query depends on the trust and constraints on the joined tables.

Deeper Issues

Page 24: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

24

More on Join Expression

• if S.c2 is trusted,– if S is complete, R join S;– if S.c2 is unique, R =x S;– Otherwise, (R join S) union (R x {null, …});

• if S.c2 is not trusted– If S is complete, R x S;– Otherwise, R x (S union {null, …});

Given a join between R and S on R.c1 = S.c2 andassuming R.c1 is trusted and R is complete, the join expression for the provenance query is:

Page 25: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

25

Example: Why is UC Santa Cruz (UCSC) Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs

• Ranking.school_name is unique

Page 26: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

26

Computing Provenance of UCSCSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name -› J.school_name, R.rank -› X FROM Jobs AS J LEFT OUTER JOIN Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = ucsc;

Trusted

Specifyingnon-answer

Hypotheticalupdate

Page 27: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

27

Provenance of UCSC

yesucsc

job_openingJobs.school_name

null -› X <=25null -› ucsc

rankRanking.school_name

UCSC is a potential answer.

Why not an answer? Because no ranking for ucsc.

How to become an answer? a new ranking tuple is inserted: (null -› ucsc, null -› X <= 25)

Page 28: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

28

Dataset for Experiment

• Extracted CS Ph.D. program ranks from the CRA web site (108 schools).

• Extracted job openings from department web sites (108 schools).

• Assumption: Trust Jobs(school_name, school_state) and Jobs’ completeness.

Page 29: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

29

Impact of Trust/Constraints on Provenance of UCSD

• No trust/constraints on Ranking: 109 provenance tuples

• Trust Ranking.school: 2 provenance tuples

• Trust Ranking.school and Ranking.school is unique: 1 provenance tuple

Page 30: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

30

Impact of Trust/Constraints on Provenance Scalability

• Scale up the database by a factor of 100, compare the number of provenance of tuples of UCSD.

• No trust/constraints on Ranking: x 100.

• Trust Ranking.school: no change.

Page 31: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

31

Conclusion

• Proposed a mechanism for explaining a non-answer by using data, constraints, and query.

• Showed that trust and constraints are critical for getting focused provenance.

• Some opportunities for future work– Formal theory (e.g., in relational algebra)

– Provenance ranking, etc.

Page 32: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

32

THANKS!

Page 33: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

33

Original Context: Condor Project

• Distributed computing research project

• Develops and maintains the Condor system software, and supports a production distributed computing facility at UW-Madison.

Page 34: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

34

O(10) ~ O(1000)O(1) ~ O(1000)

Data Management in Condor

Execute MachineSubmit Machine

Scheduler Executor

Central Manager

CollectorNegotiator

job

job

param

machinemachine job

Job and system state in

local log files!

Page 35: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

35

CondorDB to the Rescue

CondorDBDatabase

O(10) ~ O(1000)O(1) ~ O(1000)

Execute MachineSchedule Machine

Scheduler Executor

Central Manager

CollectorNegotiator

job

param

machinemachine job

Query

Page 36: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

36

CondorDB Deployments

many more …

Page 37: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

37

Imprecise Data in CondorDB

Database Answersquery

Condornodes

uncontrollableunpredictable

Autonomous

Out of date,Inconsistent,

IncorrectIncorrect,

Incomplete

Page 38: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

38

Does this problem also occur in any other application?

Page 39: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

39

Example: Why is UC Merced Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs and Ranking

Page 40: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

40

Computing Provenance of UC Merced

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rankFROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = uc merced;

Trusted

Specifyingnon-answer

Hypotheticalupdate

Page 41: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

41

Provenance of UC Merced

job_openingJobs.school_name rankRanking.school_name

UC Merced is not a potential answer.

Why not a potential answer?because the trusted data (rank) does not satisfy the query.

Page 42: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

42

Example: Relaxing Trust for Provenance of UC Merced

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name)

– Completeness of Jobs and Ranking

Page 43: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

43

Computing Provenance of UC Merced

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank -› X <=25FROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND J.school_name = R.school_nameAND Jobs.school_name = uc merced;

Trusted

Specifyingnon-answer

Hypotheticalupdate

Page 44: 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

44

Provenance of UC Merced

no -› yesuc merced

job_openingJobs.school_name

null -› X <= 25uc merced

rankRanking.school_name

UC Merced is a potential answer.

Why not an answer? because job_opening = no and rank = null.

How to become an answer? job_opening: no -› yes and rank: null -› X <= 25.