Download - Probabilities in Databases and Logics I

Probabilities inDatabases and Logics

INilesh Dalvi and Dan SuciuUniversity of Washington

2

Two Lectures•Today:

probabilistic database to model imprecisions

probabilistic logics

•Tomorrow:

probabilistic database to model incompletness

random graphs

3

Motivation

Record reconciliation

Information extraction

Constraint violations

Schema matching

5

name rating p

Monkey Love good .5

fair .2

fair .6

poor .9

Review

Queries:A(x,y) :- A(x,y) :- Review(x,y),Review(x,y), Movie(x,z), z Movie(x,z), z > 1991> 1991

Problem SettingTables

:

title year p

Twelve Monkeys 1995 .8

Monkey Love 1997 .4

Monkey Love 1935 .9

Monkey Love Pl 2005 .7

Answers:title rating p

Twelve Monkeys fair .53

Monkey Love good .42

Monkey Love Pl fair .15

Movie

Top k

Problem: complexity of

query evaluation

6

Two Problems

Fix answer tuple (a,b)Given database I, compute Pr(Q(a,b))

Query evaluation problem

Fixed schema S, conjunctive query Q(x,y)

Fix k > 0Given database I, find k answer tuples with highest probabilities

Top-k answering problem

7

Related Work: DB

Cavallo&Pitarelli:1987

Barbara,Garcia-Molina, Porter:1992

Lakshmanan,Leone,Ross&Subrahmanian:1997

Fuhr&Roellke:1997

Dalvi&S:2004

Widom:2005

8

Related Work: Logic

Query reliability [Gradel,Gurevitch,Hirsch’98]

Degrees of belief [Bacchus,Grove,Halpern,Koller’96]

Probabilistic Logic [Nielson]

Probabilistic model checking [Kwiatkowska’02]

Probabilistic Relational Model[Taskar,Abbeel,Koller’02]

9

Outline

Definitions

Query Evaluation

Top-k answering (joint with Chris Re)

Conclusions

23

Pr : Inst Pr : Inst →→ [0,1], ∑ [0,1], ∑II Pr[I] = 1 Pr[I] = 1

Probabilistic Database

•Schema S, Domain D, Set of instances Inst

•DefinitionProbabilistic database is a probability distribution

If Pr[I] > 0 then I is called “possible world”

24

Probabilistic Database

•Representation:

•Independent tuples:I-database DB over some schema Si

•Independent and disjoint tuples:ID-database DB over some schema Sid

Semantics:

DB “means” probability distribution Pr over schema S

26

I-DatabasesMovie Score P

m42 good p1

m99 good p2

m76 poor p3

Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1

Reviewsi(M,S,p)

Mov Scor

m76 poor

Mov Scor

m42 good

m76 poor

Mov Scor

m42 1995

m76 poor

Mov Scor

m42 good

Mov Scor

m99 good

Mov Scor

(1-p1)*(1-

p2)*(1-p

3)

Pr[I1]=

Mov Scor

m42 good

m99 good

p1*p

2*(1-

p3)

Pr[I4]=

Mov Scor

m42 good

m99 good

m76 poor

p1*p

2*p

3Pr[I8]=

Representation

Possible worlds semantics

Reviews(M,S)

28

ID-DatabasesTimed Activit

yP

t walk p1

t run p2

t+1 walk p3

Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1

Activitiesid

Time Act Time Act

t run

Time Act

t walk

t+1 walk

Time Act

t walk

Time Act

t+1 walk

Time Act

t run

t+1 walk(1-p1-

p2)*(1-p

3)

Pr[I1

]=p2*(1-p

3)

Pr[I3

]= p1*p

3

Pr[I5

]=

Activities

29

ID subsumes I

Movied Scored P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid

Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsi

=

Note: Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid means alltuples aredisjoint

30

Queries

id year Pm42 1995 0.95

m99 2002 0.65m76 2002 0.1m05 2005 0.7

mid rating p

m42 4 0.7

m42 5 0.45m99 5 0.82

m99 4 0.68

m05 5 0.79

Moviei Reviewi

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z), z >= 3Review(x,z), z >= 3

•Syntax: conjunctive queries over schema S

31

Two Query Semantics

•Possible answer sets

Given set A:

Used for views

•Possible tuples

Given tuple t:

Used for query evaluation and top-k

Pr[{t | I Pr[{t | I ⊨⊨ Q(t)} = A]Q(t)} = A]

Pr[I Pr[I ⊨⊨ Q(t)]Q(t)]

ThisThistalktalk

35

p1id year

m42 2004

m99 1901

m76 1902

p2id year

m99 1935

m05 1903

p4id year

m87 1934

m44 1904

p3id year

m76 1995

m99 1935

m05 2004

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z)Review(x,z)

top k

year p

1935 p2 + p3 = 0.6

2004 p1 + p3 = 0.5

1995 p3 = 0.2

. . . . . .

Query Semantics

Tupleprobabilities

38

Summary on Data Model

Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases

Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities

39

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

40

Problem Definition

•Fix schema S, query Q, answer tuple t

•Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)]•Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2

Pr[Q(t)]Pr[Q(t)]notation:

41

Query Evaluationon I-Databases

•Outline

Intuition

Extensional plans: PTIME case

Hard queries: #P-complete case

Dichotomy Theorem

42

Intuition

Year p

1995

2002

p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3))

1 - (1 - ) × (1 - )

p2 × (1 - (1 - q4)×(1 - q5))p3 × q6

id year p

m42 1995 p1

m99 2002 p2

m76 2002 p3

m05 2005 p4

mid rate pm42 4 q1

m42 2 q2

m42 3 q3

m99 1 q4

m99 3 q5

m76 5 q6

Moviei Reviewi

Answer

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

43

Add Join ⋈ p = p1 * p2

Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p

Note: data complexity is PTIME

p

I-Extensional Plans

[Barbara92,Lakshmanan97]

46

⋈

∏

Movie Review

CORRECTINCORRECT!

1995 m1 pq1

1995 m1 pq2

1995 m1 pq3

19951-(1-pq1)(1-pq2)

(1-pq3)

⋈

∏

∏

MovieReview

m11 - (1-q1)(1-

q2)(1-q3)

1995 m1p(1-(1-q1)(1-q2)

(1-q3))

m1 q1

m1 q2

m1 q3

1995 p

m1 q1

m1 q2

m1 q3

1995 p

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

48

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

A pp1

p2

p3

p4

B pq1

q2

q3

q4

A B

Ri S Ti

TheoremTheorem: Data complexity is #P-: Data complexity is #P-completecomplete

#P-Complete Queries

49

Proof:

A px1 1/2

x2 1/2x3 1/2x4 1/2

B py1 1/2

y2 1/2y3 1/2

A Bx2 y3

x1 y2

x4 y3

x3 y1

Ri S Ti

Reduction:x2y3 V x1y2 V x4y3 V x3y1

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

TheoremTheorem [Provan&Ball83] Counting the [Provan&Ball83] Counting the number of satisfying assignments for number of satisfying assignments for bipartite DNF is #P-completebipartite DNF is #P-complete

I-Dichotomy

Definition 1. For each Definition 1. For each variable x:variable x: goals(x) = set of goals that goals(x) = set of goals that contain xcontain x

Q = boolean conjunctive query

Definition 2. Q is hierarchical Definition 2. Q is hierarchical if forall x, y:if forall x, y: (a) goals(x) (a) goals(x) ∩∩ goals(y) = goals(y) = ∅∅, or, or (b) goals(x) (b) goals(x) ⊆⊆ goals(y), or goals(y), or (c) goals(y) (c) goals(y) ⊆⊆ goals(x) goals(x)

52

Q :- Q :- R(x),S(x,y),T(x,y,z),KR(x),S(x,y),T(x,y,z),K(x,v)(x,v)

QQ :- R(x), :- R(x), S(x,y), T(y)S(x,y), T(y)

x yz

RRSS

TTv KK

x y

RR SS TT

“hierarchical”

“non-hierarchical”

53

I-Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:

Q is in PTIMEQ is in PTIMEQ has a correct Q has a correct extensional planextensional planQ is hierarchicalQ is hierarchicalor:Q is #P-completeQ is #P-completeQ has subgoals Q has subgoals R(x,...),S(x,y,...),T(y,...)R(x,...),S(x,y,...),T(y,...)

Schema Si = {R1i, R2

i, . . ., Rmi}

54

ProofLemma 1. Lemma 1. If Q is non-hierarchical, then If Q is non-hierarchical, then #P-complete#P-completeProof:

x y

RR SS TTz

KKv

Q :- RQ :- Rii(v,(v,xx), S), Sii((x,yx,y), ), TTii((yy,z), K,z), Kii(z)(z) rest is like for

Qbad

55

ProofLemma 2. If Q is hierarchical, Lemma 2. If Q is hierarchical, then PTIMEthen PTIMEProof:

Case 1: has no Case 1: has no rootroot

Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3)

This is extensional join ⋈

56

Proof

Case 2: has root Case 2: has root xx

x

Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x)))

This is an extensional projection: ∏

Dom={a1, a2, . . ., an}

QED

57

Query Evaluationon ID-Databases

ID-extensional plans

#P-complete queries

Dichotomoy Theorem

58

Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn

Extensional Plans for ID-DBs

59

#P-Complete Queries

QQ22 :- R :- Rdd(x(xdd,y), ,y), SSdd(y(ydd,z),z)

QQ11 :- R :- Rii(x), S(x), Sii(x,y), (x,y), TTii(y)(y)

QQ33 :- R :- Rdd(x(xdd,y), ,y), SSdd(z(zdd,y),y)

60

I-DB Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:Q is in PTIMEQ is in PTIME

Q has a correct Q has a correct extensional planextensional plan

or:Q is #P-completeQ is #P-completeQ has one of QQ has one of Q11, Q, Q22, Q, Q33 as as subqueriessubqueries

Schema Sid s.t. each table is either Ri or Rid

61

Extensions

•Extensions of the dichotomoy theorem exists for:

Mixed schemas (some relations are deterministic)

Functional dependencies

62

Summary on Query Evaluation

•Extensional plans: popular, efficient, BUT

“Equivalent” plans lead to different results

Some queries admit “correct” plans

•Some simple queries: #P-complete complexity

•Dichotomy theorem

•Future work: remove ‘no-self-join’ restriction

65

Outline

Definitions

Query evaluation

Top-k answering (joint with Chris Re)

Conclusions

69

Top-k Ranking Problem

•Fix schema S, query Q, number k > 0

•Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities

•Note: Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation

Pr[Q(tPr[Q(t11)] > Pr[Q(t)] > Pr[Q(t22)] > .... > )] > .... > Pr[Q(tPr[Q(tkk)] > ...)] > ...

70

Probabilities of Boolean

ExpressionsWhat is the probability of e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3?

(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

e1 e2 e3 Pr

0 0 0 (1-p1)(1-p2)(1-p3)

0 0 1 (1-p1)(1-p2)p3

0 1 0 (1-p1)p2(1-p3)

0 1 1 (1-p1)p2p3

1 0 0 p1(1-p2)(1-p3)

1 0 1 p1(1-p2)p3

1 1 0 p1p2(1-p3)

1 1 1 p1p2p3

Theorem #P-hard [Valiant]Theorem #P-hard [Valiant]

A p

e1 p1

e2 p2

e3 p3

71

Monte Carlo Simulation

Better: PTAS

Pr( |p’-p| < Pr( |p’-p| < ε ε ) ) > 1-> 1-δδ

[Karp&Luby’83]

Algorithm:Algorithm:

radomly pick each eradomly pick each e11, e, e22, e, e33 = false = false or trueor true compute ecompute e11∧e∧e2 2 ∨ e∨ e11∧e∧e3 3 ∨ e∨ e22∧e∧e33: true or : true or false ?false ? repeatrepeat

Approximate probability p with Approximate probability p with frequency p’frequency p’

p’p’p’- p’- εε p’+ p’+ εε

pp

72

Monte Carlo Simulation

N=0

0 1p

N=1

N=2

N=3

73

The Multisimulation

Problem

Year P

1995 ??

2002 ??

1933 ??

1984 ??

Schedule simulation steps to find top-k

0 1

74

Multisimulation

How to find the top k out of n ?

Example: looking for top k=2;

0 1

12

345

Which one simulate next ?

p5

p1p4

p2

p3

75

Multisimulation

Critical region: (k’th left, k+1’th right)

0 1

k=2

76

Multisimulation Algorithm

Case 1: pick a “double crosser” and simulate it

0 1

this this

k=2

77


Case 2: pick both a “left” AND a “right” crosser

k=2

0 1

thi thiss

and and this this

78


Case 3: pick a “max crosser” and simulate it

0 1

thi thiss

k=2

79


End: when critical region is “empty”

0 1

k=2

To sort the top k, find the top k-1, etc

80


Theorem Theorem (1) It runs in < 2 Optimal # steps(1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm (2) no other deterministic algorithm does betterdoes better

82

Experiments

83

Summary on Top-k Answering

Simple algorithm, optimal (x2) w.r.t. a very powerful standard

Marriage of probabilistic and top-k answers make probabilistic databases practical

85

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

87

Conclusions

Strong motivation from practical applicationsOpportunity to merge query and search technologies

Probabilistic DB’s are hard !Great opportunity for impactful theory work

Tomorrow: applications of random graphs to model incompleteness in databases

Thank you !

Questions ?

Download - Probabilities in Databases and Logics I

Top Related