1 on provenance of non-answers for queries over extracted data jiansheng huang ting chen anhai doan...
TRANSCRIPT
1
On Provenance of Non-Answers for Queries over Extracted Data
Jiansheng Huang
Ting Chen
AnHai Doan
Jeffrey F. Naughton
2
Imprecise Data in Information Extraction (IE)
Source-1
Source-2
……
integrate
Source-n extract
Imprecise
Fuzzy
query
Answers
Database
Imprecise
IncorrectIncomplete
Examples:DBLife (WISC)Avatar (IBM)etc.
3
State of the Art
Answers Non-Answers
ProbabilityCavallo et al., 1987;Barbara et al., 1992;
Lakshmanan et al., 1997; Dalvi et al., 2004.
ProvenanceWoodruff et al., 1997;Cui et al., 2000, 2001;Buneman et al., 2001;Bhagwat et al., 2004;
Probability + ProvenanceBenjelloun et al., 2006
4
Motivating Example
• Crawl web to extract information related to academic job openings.
• Store result of extraction in an RDBMS.
• Ask SQL queries and try to interpret results
5
Extracted Jobs
CS Dept. web sites
yescaucsc
………
nocauc merced
nocaucsd
yescaberkeley
ma
school_state
yesharvard
job_openingschool_name
6
Extracted Ranking
CS Ranking Web Site
……
uc merced
23ucsd
3berkeley
11harvard
rankschool_name
7
Question Answering
What are the CS PhD programs in California (CA) that have job openings and are in the top 25?
SELECT Jobs.school_nameFROM Jobs, RankingWHERE Ranking.rank <= 25
AND Jobs.job_opening = yesAND Jobs.school_name = Ranking.school_name;
AND Jobs.school_state = ca
8
Answer
• berkeley …
yescaberkeley
school_state job_openingschool_name
3berkeley
rankschool_name justifies
9
Non-Answers
• ucsd, uc merced, harvard, ucsc,…
• But why?
• Data exists.
• No mechanism.
yescaucsc
nocauc merced
nocaucsd
ma
school_state
yesharvard
job _openingschool_name
uc merced
23ucsd
11harvard
rankschool_name
10
Assumptions
• Relational data model
• Subset of SQL – Selection (e.g., R.a = 2)
– Projection (e.g., return R.a)
– Join (e.g., R.a = S.b)
• Conjunctive predicates (e.g., a = 2 and b = 3)
• Satisfiable (e.g., no “a = 2 and a = 3”)
11
Provenance of Non-Answers
query
z is a potential answer, (x, y) and (x´, y) are the provenance of z.
x
samequery
zy
updates
x´
y
(x -› x´, y) explains why z is not an answer and how z can become an answer.
12
Example
nocaucsd
school state
jobopening
schoolname
23ucsd
rankschoolname
yescaucsd
school state
jobopening
schoolname
23ucsd
rankschoolname ucsd
ucsd is a potential answer, The set of base tuples is a provenance of ucsd.
Jobs Ranking
13
Another Example
yesmaharvard
school state
jobopening
schoolname
23ucsd
rankschoolname
yescaucsd
school state
jobopening
schoolname
23ucsd
rankschoolname ucsd
Jobs Ranking
14
Trust and Constraints
Untrusted Trustedquery
x
samequery
zy
valid updates
x´
y
Satisfy constraints
Don’t consider updates
15
Example: Using Trust
yesmaharvard
school state
jobopening
schoolname
23ucsd
rankschoolname
yescaucsd
school state
jobopening
schoolname
23ucsd
rankschoolname ucsd
trust Jobs Ranking
16
Factors DeterminingProvenance of Non-Answers
• Trusted Data
• Constraints
• Query specification
17
Algorithm
• Start from a user query and a specific non-answer
• Add predicates derived from the non-answer
• Add constraint predicates
• Retain only predicates on trusted attributes
• For attributes of a potential tuple • Determine equivalent constant value (e.g., a = 2)
• If none, return a variable
• Evaluate the provenance query
18
Example: Why is UCSD Not Answer?
• Assume that we trust– Jobs(school_name, school_state)
– Ranking(school_name, rank)
– Completeness of Jobs and Ranking
19
Computing Provenance of UCSDSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;
SELECT J.school_name, J.job_opening , R.school_name, R.rankFROM Jobs AS J, Ranking AS R
Trusted
Specifyingnon-answer
Hypotheticalupdate
WHERE Jobs.school_name = ucsdAND J.school_state = caAND R.rank <= 25AND J.school_name = R.school_name;
-› yes
20
Provenance of UCSD
no -› yesucsd
job_openingJobs.school_name
23ucsd
rankRanking.school_name
UCSD is a potential answer.
Why not an answer? because job_opening = no.
How to become an answer? job_opening: no -› yes.
21
Provenance-Assisted Debugging
• While implementing our job extraction example, actually used provenance of non-answers to find a bug.
• Specifically, noticed UCSD is not an answer to “find all dept. in top 25 with job openings”
• Informed by provenance, we checked UCSD web page and found it does have a job opening.
• What happened?
22
Our Bug
• UCSD web page has a job opening
• Debugged extraction for UCSD instance
• Bug: a line in source longer than the line buffer for read
• Fix: increase line buffer size
• Re-extract and re-query produces UCSD as answer
23
• New records can be inserted.– Use an all-null tuple as a proxy in our provenance
report (not actually inserted).
• The join expression for a provenance query depends on the trust and constraints on the joined tables.
Deeper Issues
24
More on Join Expression
• if S.c2 is trusted,– if S is complete, R join S;– if S.c2 is unique, R =x S;– Otherwise, (R join S) union (R x {null, …});
• if S.c2 is not trusted– If S is complete, R x S;– Otherwise, R x (S union {null, …});
Given a join between R and S on R.c1 = S.c2 andassuming R.c1 is trusted and R is complete, the join expression for the provenance query is:
25
Example: Why is UC Santa Cruz (UCSC) Not Answer?
• Assume that we trust– Jobs(school_name, school_state)
– Ranking(school_name, rank)
– Completeness of Jobs
• Ranking.school_name is unique
26
Computing Provenance of UCSCSELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;
SELECT J.school_name, J.job_opening -› yes, R.school_name -› J.school_name, R.rank -› X FROM Jobs AS J LEFT OUTER JOIN Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = ucsc;
Trusted
Specifyingnon-answer
Hypotheticalupdate
27
Provenance of UCSC
yesucsc
job_openingJobs.school_name
null -› X <=25null -› ucsc
rankRanking.school_name
UCSC is a potential answer.
Why not an answer? Because no ranking for ucsc.
How to become an answer? a new ranking tuple is inserted: (null -› ucsc, null -› X <= 25)
28
Dataset for Experiment
• Extracted CS Ph.D. program ranks from the CRA web site (108 schools).
• Extracted job openings from department web sites (108 schools).
• Assumption: Trust Jobs(school_name, school_state) and Jobs’ completeness.
29
Impact of Trust/Constraints on Provenance of UCSD
• No trust/constraints on Ranking: 109 provenance tuples
• Trust Ranking.school: 2 provenance tuples
• Trust Ranking.school and Ranking.school is unique: 1 provenance tuple
30
Impact of Trust/Constraints on Provenance Scalability
• Scale up the database by a factor of 100, compare the number of provenance of tuples of UCSD.
• No trust/constraints on Ranking: x 100.
• Trust Ranking.school: no change.
31
Conclusion
• Proposed a mechanism for explaining a non-answer by using data, constraints, and query.
• Showed that trust and constraints are critical for getting focused provenance.
• Some opportunities for future work– Formal theory (e.g., in relational algebra)
– Provenance ranking, etc.
32
THANKS!
33
Original Context: Condor Project
• Distributed computing research project
• Develops and maintains the Condor system software, and supports a production distributed computing facility at UW-Madison.
34
O(10) ~ O(1000)O(1) ~ O(1000)
Data Management in Condor
Execute MachineSubmit Machine
Scheduler Executor
Central Manager
CollectorNegotiator
job
job
param
machinemachine job
Job and system state in
local log files!
35
CondorDB to the Rescue
CondorDBDatabase
O(10) ~ O(1000)O(1) ~ O(1000)
Execute MachineSchedule Machine
Scheduler Executor
Central Manager
CollectorNegotiator
job
param
machinemachine job
Query
36
CondorDB Deployments
many more …
37
Imprecise Data in CondorDB
Database Answersquery
Condornodes
uncontrollableunpredictable
Autonomous
Out of date,Inconsistent,
IncorrectIncorrect,
Incomplete
38
Does this problem also occur in any other application?
39
Example: Why is UC Merced Not Answer?
• Assume that we trust– Jobs(school_name, school_state)
– Ranking(school_name, rank)
– Completeness of Jobs and Ranking
40
Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;
SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rankFROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND R.rank <= 25AND J.school_name = R.school_nameAND Jobs.school_name = uc merced;
Trusted
Specifyingnon-answer
Hypotheticalupdate
41
Provenance of UC Merced
job_openingJobs.school_name rankRanking.school_name
UC Merced is not a potential answer.
Why not a potential answer?because the trusted data (rank) does not satisfy the query.
42
Example: Relaxing Trust for Provenance of UC Merced
• Assume that we trust– Jobs(school_name, school_state)
– Ranking(school_name)
– Completeness of Jobs and Ranking
43
Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE Jobs.job_opening = yesAND Jobs.school_state = caAND Ranking.rank <= 25AND Jobs.school_name = Ranking.school_name;
SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank -› X <=25FROM Jobs AS J, Ranking AS RWHERE J.school_state = caAND J.school_name = R.school_nameAND Jobs.school_name = uc merced;
Trusted
Specifyingnon-answer
Hypotheticalupdate
44
Provenance of UC Merced
no -› yesuc merced
job_openingJobs.school_name
null -› X <= 25uc merced
rankRanking.school_name
UC Merced is a potential answer.
Why not an answer? because job_opening = no and rank = null.
How to become an answer? job_opening: no -› yes and rank: null -› X <= 25.