Download - Probabilities in Databases and Logics I
Probabilities inDatabases and Logics
INilesh Dalvi and Dan SuciuUniversity of Washington
2
Two Lectures•Today:
probabilistic database to model imprecisions
probabilistic logics
•Tomorrow:
probabilistic database to model incompletness
random graphs
3
Motivation
Record reconciliation
Information extraction
Constraint violations
Schema matching
5
name rating p
Monkey Love good .5
fair .2
fair .6
poor .9
Review
Queries:A(x,y) :- A(x,y) :- Review(x,y),Review(x,y), Movie(x,z), z Movie(x,z), z > 1991> 1991
Problem SettingTables
:
title year p
Twelve Monkeys 1995 .8
Monkey Love 1997 .4
Monkey Love 1935 .9
Monkey Love Pl 2005 .7
Answers:title rating p
Twelve Monkeys fair .53
Monkey Love good .42
Monkey Love Pl fair .15
Movie
Top k
Problem: complexity of
query evaluation
6
Two Problems
Fix answer tuple (a,b)Given database I, compute Pr(Q(a,b))
Query evaluation problem
Fixed schema S, conjunctive query Q(x,y)
Fix k > 0Given database I, find k answer tuples with highest probabilities
Top-k answering problem
7
Related Work: DB
Cavallo&Pitarelli:1987
Barbara,Garcia-Molina, Porter:1992
Lakshmanan,Leone,Ross&Subrahmanian:1997
Fuhr&Roellke:1997
Dalvi&S:2004
Widom:2005
8
Related Work: Logic
Query reliability [Gradel,Gurevitch,Hirsch’98]
Degrees of belief [Bacchus,Grove,Halpern,Koller’96]
Probabilistic Logic [Nielson]
Probabilistic model checking [Kwiatkowska’02]
Probabilistic Relational Model[Taskar,Abbeel,Koller’02]
9
Outline
Definitions
Query Evaluation
Top-k answering (joint with Chris Re)
Conclusions
23
Pr : Inst Pr : Inst →→ [0,1], ∑ [0,1], ∑II Pr[I] = 1 Pr[I] = 1
Probabilistic Database
•Schema S, Domain D, Set of instances Inst
•DefinitionProbabilistic database is a probability distribution
If Pr[I] > 0 then I is called “possible world”
24
Probabilistic Database
•Representation:
•Independent tuples:I-database DB over some schema Si
•Independent and disjoint tuples:ID-database DB over some schema Sid
Semantics:
DB “means” probability distribution Pr over schema S
26
I-DatabasesMovie Score P
m42 good p1
m99 good p2
m76 poor p3
Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1
Reviewsi(M,S,p)
Mov Scor
m76 poor
Mov Scor
m42 good
m76 poor
Mov Scor
m42 1995
m76 poor
Mov Scor
m42 good
Mov Scor
m99 good
Mov Scor
(1-p1)*(1-
p2)*(1-p
3)
Pr[I1]=
Mov Scor
m42 good
m99 good
p1*p
2*(1-
p3)
Pr[I4]=
Mov Scor
m42 good
m99 good
m76 poor
p1*p
2*p
3Pr[I8]=
Representation
Possible worlds semantics
Reviews(M,S)
28
ID-DatabasesTimed Activit
yP
t walk p1
t run p2
t+1 walk p3
Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1
Activitiesid
Time Act Time Act
t run
Time Act
t walk
t+1 walk
Time Act
t walk
Time Act
t+1 walk
Time Act
t run
t+1 walk(1-p1-
p2)*(1-p
3)
Pr[I1
]=p2*(1-p
3)
Pr[I3
]= p1*p
3
Pr[I5
]=
Activities
29
ID subsumes I
Movied Scored P
m42 good p1
m99 good p2
m76 poor p3
Reviewsid
Movie Score P
m42 good p1
m99 good p2
m76 poor p3
Reviewsi
=
Note: Movie Score P
m42 good p1
m99 good p2
m76 poor p3
Reviewsid means alltuples aredisjoint
30
Queries
id year Pm42 1995 0.95
m99 2002 0.65m76 2002 0.1m05 2005 0.7
mid rating p
m42 4 0.7
m42 5 0.45m99 5 0.82
m99 4 0.68
m05 5 0.79
Moviei Reviewi
Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z), z >= 3Review(x,z), z >= 3
•Syntax: conjunctive queries over schema S
31
Two Query Semantics
•Possible answer sets
Given set A:
Used for views
•Possible tuples
Given tuple t:
Used for query evaluation and top-k
Pr[{t | I Pr[{t | I ⊨⊨ Q(t)} = A]Q(t)} = A]
Pr[I Pr[I ⊨⊨ Q(t)]Q(t)]
ThisThistalktalk
35
p1id year
m42 2004
m99 1901
m76 1902
p2id year
m99 1935
m05 1903
p4id year
m87 1934
m44 1904
p3id year
m76 1995
m99 1935
m05 2004
Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z)Review(x,z)
top k
year p
1935 p2 + p3 = 0.6
2004 p1 + p3 = 0.5
1995 p3 = 0.2
. . . . . .
Query Semantics
Tupleprobabilities
38
Summary on Data Model
Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases
Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities
39
Outline
Definitions
Query evaluation
Top-k answering
Conclusions
40
Problem Definition
•Fix schema S, query Q, answer tuple t
•Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)]•Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2
Pr[Q(t)]Pr[Q(t)]notation:
41
Query Evaluationon I-Databases
•Outline
Intuition
Extensional plans: PTIME case
Hard queries: #P-complete case
Dichotomy Theorem
42
Intuition
Year p
1995
2002
p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3))
1 - (1 - ) × (1 - )
p2 × (1 - (1 - q4)×(1 - q5))p3 × q6
id year p
m42 1995 p1
m99 2002 p2
m76 2002 p3
m05 2005 p4
mid rate pm42 4 q1
m42 2 q2
m42 3 q3
m99 1 q4
m99 3 q5
m76 5 q6
Moviei Reviewi
Answer
Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)
43
Add Join ⋈ p = p1 * p2
Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p
Note: data complexity is PTIME
p
I-Extensional Plans
[Barbara92,Lakshmanan97]
46
⋈
∏
Movie Review
CORRECTINCORRECT!
1995 m1 pq1
1995 m1 pq2
1995 m1 pq3
19951-(1-pq1)(1-pq2)
(1-pq3)
⋈
∏
∏
MovieReview
m11 - (1-q1)(1-
q2)(1-q3)
1995 m1p(1-(1-q1)(1-q2)
(1-q3))
m1 q1
m1 q2
m1 q3
1995 p
m1 q1
m1 q2
m1 q3
1995 p
Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)
48
QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)
A pp1
p2
p3
p4
B pq1
q2
q3
q4
A B
Ri S Ti
TheoremTheorem: Data complexity is #P-: Data complexity is #P-completecomplete
#P-Complete Queries
49
Proof:
A px1 1/2
x2 1/2x3 1/2x4 1/2
B py1 1/2
y2 1/2y3 1/2
A Bx2 y3
x1 y2
x4 y3
x3 y1
Ri S Ti
Reduction:x2y3 V x1y2 V x4y3 V x3y1
QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)
TheoremTheorem [Provan&Ball83] Counting the [Provan&Ball83] Counting the number of satisfying assignments for number of satisfying assignments for bipartite DNF is #P-completebipartite DNF is #P-complete
I-Dichotomy
Definition 1. For each Definition 1. For each variable x:variable x: goals(x) = set of goals that goals(x) = set of goals that contain xcontain x
Q = boolean conjunctive query
Definition 2. Q is hierarchical Definition 2. Q is hierarchical if forall x, y:if forall x, y: (a) goals(x) (a) goals(x) ∩∩ goals(y) = goals(y) = ∅∅, or, or (b) goals(x) (b) goals(x) ⊆⊆ goals(y), or goals(y), or (c) goals(y) (c) goals(y) ⊆⊆ goals(x) goals(x)
52
Q :- Q :- R(x),S(x,y),T(x,y,z),KR(x),S(x,y),T(x,y,z),K(x,v)(x,v)
QQ :- R(x), :- R(x), S(x,y), T(y)S(x,y), T(y)
x yz
RRSS
TTv KK
x y
RR SS TT
“hierarchical”
“non-hierarchical”
53
I-Dichotomy[Dalvi&S.’04]
Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:
Q is in PTIMEQ is in PTIMEQ has a correct Q has a correct extensional planextensional planQ is hierarchicalQ is hierarchicalor:Q is #P-completeQ is #P-completeQ has subgoals Q has subgoals R(x,...),S(x,y,...),T(y,...)R(x,...),S(x,y,...),T(y,...)
Schema Si = {R1i, R2
i, . . ., Rmi}
54
ProofLemma 1. Lemma 1. If Q is non-hierarchical, then If Q is non-hierarchical, then #P-complete#P-completeProof:
x y
RR SS TTz
KKv
Q :- RQ :- Rii(v,(v,xx), S), Sii((x,yx,y), ), TTii((yy,z), K,z), Kii(z)(z) rest is like for
Qbad
55
ProofLemma 2. If Q is hierarchical, Lemma 2. If Q is hierarchical, then PTIMEthen PTIMEProof:
Case 1: has no Case 1: has no rootroot
Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3)
This is extensional join ⋈
56
Proof
Case 2: has root Case 2: has root xx
x
Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x)))
This is an extensional projection: ∏
Dom={a1, a2, . . ., an}
QED
57
Query Evaluationon ID-Databases
ID-extensional plans
#P-complete queries
Dichotomoy Theorem
58
Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn
Extensional Plans for ID-DBs
59
#P-Complete Queries
QQ22 :- R :- Rdd(x(xdd,y), ,y), SSdd(y(ydd,z),z)
QQ11 :- R :- Rii(x), S(x), Sii(x,y), (x,y), TTii(y)(y)
QQ33 :- R :- Rdd(x(xdd,y), ,y), SSdd(z(zdd,y),y)
60
I-DB Dichotomy[Dalvi&S.’04]
Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:Q is in PTIMEQ is in PTIME
Q has a correct Q has a correct extensional planextensional plan
or:Q is #P-completeQ is #P-completeQ has one of QQ has one of Q11, Q, Q22, Q, Q33 as as subqueriessubqueries
Schema Sid s.t. each table is either Ri or Rid
61
Extensions
•Extensions of the dichotomoy theorem exists for:
Mixed schemas (some relations are deterministic)
Functional dependencies
62
Summary on Query Evaluation
•Extensional plans: popular, efficient, BUT
“Equivalent” plans lead to different results
Some queries admit “correct” plans
•Some simple queries: #P-complete complexity
•Dichotomy theorem
•Future work: remove ‘no-self-join’ restriction
65
Outline
Definitions
Query evaluation
Top-k answering (joint with Chris Re)
Conclusions
69
Top-k Ranking Problem
•Fix schema S, query Q, number k > 0
•Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities
•Note: Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation
Pr[Q(tPr[Q(t11)] > Pr[Q(t)] > Pr[Q(t22)] > .... > )] > .... > Pr[Q(tPr[Q(tkk)] > ...)] > ...
70
Probabilities of Boolean
ExpressionsWhat is the probability of e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3?
(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3
e1 e2 e3 Pr
0 0 0 (1-p1)(1-p2)(1-p3)
0 0 1 (1-p1)(1-p2)p3
0 1 0 (1-p1)p2(1-p3)
0 1 1 (1-p1)p2p3
1 0 0 p1(1-p2)(1-p3)
1 0 1 p1(1-p2)p3
1 1 0 p1p2(1-p3)
1 1 1 p1p2p3
Theorem #P-hard [Valiant]Theorem #P-hard [Valiant]
A p
e1 p1
e2 p2
e3 p3
71
Monte Carlo Simulation
Better: PTAS
Pr( |p’-p| < Pr( |p’-p| < ε ε ) ) > 1-> 1-δδ
[Karp&Luby’83]
Algorithm:Algorithm:
radomly pick each eradomly pick each e11, e, e22, e, e33 = false = false or trueor true compute ecompute e11∧e∧e2 2 ∨ e∨ e11∧e∧e3 3 ∨ e∨ e22∧e∧e33: true or : true or false ?false ? repeatrepeat
Approximate probability p with Approximate probability p with frequency p’frequency p’
p’p’p’- p’- εε p’+ p’+ εε
pp
72
Monte Carlo Simulation
N=0
0 1p
N=1
N=2
N=3
73
The Multisimulation
Problem
Year P
1995 ??
2002 ??
1933 ??
1984 ??
Schedule simulation steps to find top-k
0 1
74
Multisimulation
How to find the top k out of n ?
Example: looking for top k=2;
0 1
12
345
Which one simulate next ?
p5
p1p4
p2
p3
75
Multisimulation
Critical region: (k’th left, k+1’th right)
0 1
k=2
76
Multisimulation Algorithm
Case 1: pick a “double crosser” and simulate it
0 1
this this
k=2
77
Multisimulation Algorithm
Case 2: pick both a “left” AND a “right” crosser
k=2
0 1
thi thiss
and and this this
78
Multisimulation Algorithm
Case 3: pick a “max crosser” and simulate it
0 1
thi thiss
k=2
79
Multisimulation Algorithm
End: when critical region is “empty”
0 1
k=2
To sort the top k, find the top k-1, etc
80
Multisimulation Algorithm
Theorem Theorem (1) It runs in < 2 Optimal # steps(1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm (2) no other deterministic algorithm does betterdoes better
82
Experiments
83
Summary on Top-k Answering
Simple algorithm, optimal (x2) w.r.t. a very powerful standard
Marriage of probabilistic and top-k answers make probabilistic databases practical
85
Outline
Definitions
Query evaluation
Top-k answering
Conclusions
87
Conclusions
Strong motivation from practical applicationsOpportunity to merge query and search technologies
Probabilistic DB’s are hard !Great opportunity for impactful theory work
Tomorrow: applications of random graphs to model incompleteness in databases
Thank you !
Questions ?