1 on provenance of non-answers for queries over extracted data jiansheng huang ting chen anhai doan...

Post on 30-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

On Provenance of Non-Answers for Queries over Extracted Data

Jiansheng Huang

Ting Chen

AnHai Doan

Jeffrey F. Naughton

2

Imprecise Data in Information Extraction (IE)

Source-1

Source-2

……

integrate

Source-n extract

Imprecise

Fuzzy

query

Answers

Database

Imprecise

IncorrectIncomplete

Examples:DBLife (WISC)Avatar (IBM)etc.

3

State of the Art

Answers Non-Answers

ProbabilityCavallo et al., 1987;Barbara et al., 1992;

Lakshmanan et al., 1997; Dalvi et al., 2004.

ProvenanceWoodruff et al., 1997;Cui et al., 2000, 2001;Buneman et al., 2001;Bhagwat et al., 2004;

Probability + ProvenanceBenjelloun et al., 2006

4

Motivating Example

• Crawl web to extract information related to academic job openings.

• Store result of extraction in an RDBMS.

• Ask SQL queries and try to interpret results

5

Extracted Jobs

CS Dept. web sites

yescaucsc

………

nocauc merced

nocaucsd

yescaberkeley

ma

school_state

yesharvard

job_openingschool_name

6

Extracted Ranking

CS Ranking Web Site

……

uc merced

23ucsd

3berkeley

11harvard

rankschool_name

7

Question Answering

What are the CS PhD programs in California (CA) that have job openings and are in the top 25?

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Ranking.rank <= 25

AND Jobs.job_opening = yesAND Jobs.school_name = Ranking.school_name;

AND Jobs.school_state = ca

8

Answer

• berkeley …

yescaberkeley

school_state job_openingschool_name

3berkeley

rankschool_name justifies

9

Non-Answers

• ucsd, uc merced, harvard, ucsc,…

• But why?

• Data exists.

• No mechanism.

yescaucsc

nocauc merced

nocaucsd

ma

school_state

yesharvard

job _openingschool_name

uc merced

23ucsd

11harvard

rankschool_name

10

Assumptions

• Relational data model

• Subset of SQL – Selection (e.g., R.a = 2)

– Projection (e.g., return R.a)

– Join (e.g., R.a = S.b)

• Conjunctive predicates (e.g., a = 2 and b = 3)

• Satisfiable (e.g., no “a = 2 and a = 3”)

11

Provenance of Non-Answers

query

z is a potential answer, (x, y) and (x´, y) are the provenance of z.

x

samequery

zy

updates

y

(x -› x´, y) explains why z is not an answer and how z can become an answer.

12

Example

nocaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

ucsd is a potential answer, The set of base tuples is a provenance of ucsd.

Jobs Ranking

13

Another Example

yesmaharvard

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

Jobs Ranking

14

Trust and Constraints

Untrusted Trustedquery

x

samequery

zy

valid updates

y

Satisfy constraints

Don’t consider updates

15

Example: Using Trust

yesmaharvard

school state

jobopening

schoolname

23ucsd

rankschoolname

yescaucsd

school state

jobopening

schoolname

23ucsd

rankschoolname ucsd

trust Jobs Ranking

16

Factors DeterminingProvenance of Non-Answers

• Trusted Data

• Constraints

• Query specification

17

Algorithm

• Start from a user query and a specific non-answer

• Add predicates derived from the non-answer

• Add constraint predicates

• Retain only predicates on trusted attributes

• For attributes of a potential tuple • Determine equivalent constant value (e.g., a = 2)

• If none, return a variable

• Evaluate the provenance query

18

Example: Why is UCSD Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs and Ranking

19

Computing Provenance of UCSDSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening , R.school_name, R.rankFROM Jobs AS J, Ranking AS R

Trusted

Specifyingnon-answer

Hypotheticalupdate

WHERE Jobs.school_name = ucsdAND J.school_state = caAND R.rank <= 25AND J.school_name = R.school_name;

-› yes

20

Provenance of UCSD

no -› yesucsd

job_openingJobs.school_name

23ucsd

rankRanking.school_name

UCSD is a potential answer.

Why not an answer? because job_opening = no.

How to become an answer? job_opening: no -› yes.

21

Provenance-Assisted Debugging

• While implementing our job extraction example, actually used provenance of non-answers to find a bug.

• Specifically, noticed UCSD is not an answer to “find all dept. in top 25 with job openings”

• Informed by provenance, we checked UCSD web page and found it does have a job opening.

• What happened?

22

Our Bug

• UCSD web page has a job opening

• Debugged extraction for UCSD instance

• Bug: a line in source longer than the line buffer for read

• Fix: increase line buffer size

• Re-extract and re-query produces UCSD as answer

23

• New records can be inserted.– Use an all-null tuple as a proxy in our provenance

report (not actually inserted).

• The join expression for a provenance query depends on the trust and constraints on the joined tables.

Deeper Issues

24

More on Join Expression

• if S.c2 is trusted,– if S is complete, R join S;– if S.c2 is unique, R =x S;– Otherwise, (R join S) union (R x {null, …});

• if S.c2 is not trusted– If S is complete, R x S;– Otherwise, R x (S union {null, …});

Given a join between R and S on R.c1 = S.c2 andassuming R.c1 is trusted and R is complete, the join expression for the provenance query is:

25

Example: Why is UC Santa Cruz (UCSC) Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs

• Ranking.school_name is unique

26

Computing Provenance of UCSCSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name -› J.school_name, R.rank -› X FROM Jobs AS J LEFT OUTER JOIN Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = ucsc;

Trusted

Specifyingnon-answer

Hypotheticalupdate

27

Provenance of UCSC

yesucsc

job_openingJobs.school_name

null -› X <=25null -› ucsc

rankRanking.school_name

UCSC is a potential answer.

Why not an answer? Because no ranking for ucsc.

How to become an answer? a new ranking tuple is inserted: (null -› ucsc, null -› X <= 25)

28

Dataset for Experiment

• Extracted CS Ph.D. program ranks from the CRA web site (108 schools).

• Extracted job openings from department web sites (108 schools).

• Assumption: Trust Jobs(school_name, school_state) and Jobs’ completeness.

29

Impact of Trust/Constraints on Provenance of UCSD

• No trust/constraints on Ranking: 109 provenance tuples

• Trust Ranking.school: 2 provenance tuples

• Trust Ranking.school and Ranking.school is unique: 1 provenance tuple

30

Impact of Trust/Constraints on Provenance Scalability

• Scale up the database by a factor of 100, compare the number of provenance of tuples of UCSD.

• No trust/constraints on Ranking: x 100.

• Trust Ranking.school: no change.

31

Conclusion

• Proposed a mechanism for explaining a non-answer by using data, constraints, and query.

• Showed that trust and constraints are critical for getting focused provenance.

• Some opportunities for future work– Formal theory (e.g., in relational algebra)

– Provenance ranking, etc.

32

THANKS!

33

Original Context: Condor Project

• Distributed computing research project

• Develops and maintains the Condor system software, and supports a production distributed computing facility at UW-Madison.

34

O(10) ~ O(1000)O(1) ~ O(1000)

Data Management in Condor

Execute MachineSubmit Machine

Scheduler Executor

Central Manager

CollectorNegotiator

job

job

param

machinemachine job

Job and system state in

local log files!

35

CondorDB to the Rescue

CondorDBDatabase

O(10) ~ O(1000)O(1) ~ O(1000)

Execute MachineSchedule Machine

Scheduler Executor

Central Manager

CollectorNegotiator

job

param

machinemachine job

Query

36

CondorDB Deployments

many more …

37

Imprecise Data in CondorDB

Database Answersquery

Condornodes

uncontrollableunpredictable

Autonomous

Out of date,Inconsistent,

IncorrectIncorrect,

Incomplete

38

Does this problem also occur in any other application?

39

Example: Why is UC Merced Not Answer?

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name, rank)

– Completeness of Jobs and Ranking

40

Computing Provenance of UC Merced

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rankFROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = uc merced;

Trusted

Specifyingnon-answer

Hypotheticalupdate

41

Provenance of UC Merced

job_openingJobs.school_name rankRanking.school_name

UC Merced is not a potential answer.

Why not a potential answer?because the trusted data (rank) does not satisfy the query.

42

Example: Relaxing Trust for Provenance of UC Merced

• Assume that we trust– Jobs(school_name, school_state)

– Ranking(school_name)

– Completeness of Jobs and Ranking

43

Computing Provenance of UC Merced

SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;

SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank -› X <=25FROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND J.school_name = R.school_nameAND Jobs.school_name = uc merced;

Trusted

Specifyingnon-answer

Hypotheticalupdate

44

Provenance of UC Merced

no -› yesuc merced

job_openingJobs.school_name

null -› X <= 25uc merced

rankRanking.school_name

UC Merced is a potential answer.

Why not an answer? because job_opening = no and rank = null.

How to become an answer? job_opening: no -› yes and rank: null -› X <= 25.

top related