probabilities in databases and logics i

55
Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington

Upload: darva

Post on 08-Jan-2016

28 views

Category:

Documents


4 download

DESCRIPTION

Probabilities in Databases and Logics I. Nilesh Dalvi and Dan Suciu University of Washington. Two Lectures. Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs. Motivation. Record reconciliation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilities in Databases and Logics I

Probabilities inDatabases and Logics

INilesh Dalvi and Dan SuciuUniversity of Washington

Page 2: Probabilities in Databases and Logics I

2

Two Lectures•Today:

probabilistic database to model imprecisions

probabilistic logics

•Tomorrow:

probabilistic database to model incompletness

random graphs

Page 3: Probabilities in Databases and Logics I

3

Motivation

Record reconciliation

Information extraction

Constraint violations

Schema matching

Page 4: Probabilities in Databases and Logics I

5

name rating p

Monkey Love good .5

fair .2

fair .6

poor .9

Review

Queries:A(x,y) :- A(x,y) :- Review(x,y),Review(x,y), Movie(x,z), z Movie(x,z), z > 1991> 1991

Problem SettingTables

:

title year p

Twelve Monkeys 1995 .8

Monkey Love 1997 .4

Monkey Love 1935 .9

Monkey Love Pl 2005 .7

Answers:title rating p

Twelve Monkeys fair .53

Monkey Love good .42

Monkey Love Pl fair .15

Movie

Top k

Problem: complexity of

query evaluation

Page 5: Probabilities in Databases and Logics I

6

Two Problems

Fix answer tuple (a,b)Given database I, compute Pr(Q(a,b))

Query evaluation problem

Fixed schema S, conjunctive query Q(x,y)

Fix k > 0Given database I, find k answer tuples with highest probabilities

Top-k answering problem

Page 6: Probabilities in Databases and Logics I

7

Related Work: DB

Cavallo&Pitarelli:1987

Barbara,Garcia-Molina, Porter:1992

Lakshmanan,Leone,Ross&Subrahmanian:1997

Fuhr&Roellke:1997

Dalvi&S:2004

Widom:2005

Page 7: Probabilities in Databases and Logics I

8

Related Work: Logic

Query reliability [Gradel,Gurevitch,Hirsch’98]

Degrees of belief [Bacchus,Grove,Halpern,Koller’96]

Probabilistic Logic [Nielson]

Probabilistic model checking [Kwiatkowska’02]

Probabilistic Relational Model[Taskar,Abbeel,Koller’02]

Page 8: Probabilities in Databases and Logics I

9

Outline

Definitions

Query Evaluation

Top-k answering (joint with Chris Re)

Conclusions

Page 9: Probabilities in Databases and Logics I

23

Pr : Inst Pr : Inst →→ [0,1], ∑ [0,1], ∑II Pr[I] = 1 Pr[I] = 1

Probabilistic Database

•Schema S, Domain D, Set of instances Inst

•DefinitionProbabilistic database is a probability distribution

If Pr[I] > 0 then I is called “possible world”

Page 10: Probabilities in Databases and Logics I

24

Probabilistic Database

•Representation:

•Independent tuples:I-database DB over some schema Si

•Independent and disjoint tuples:ID-database DB over some schema Sid

Semantics:

DB “means” probability distribution Pr over schema S

Page 11: Probabilities in Databases and Logics I

26

I-DatabasesMovie Score P

m42 good p1

m99 good p2

m76 poor p3

Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1

Reviewsi(M,S,p)

Mov Scor

m76 poor

Mov Scor

m42 good

m76 poor

Mov Scor

m42 1995

m76 poor

Mov Scor

m42 good

Mov Scor

m99 good

Mov Scor

(1-p1)*(1-

p2)*(1-p

3)

Pr[I1]=

Mov Scor

m42 good

m99 good

p1*p

2*(1-

p3)

Pr[I4]=

Mov Scor

m42 good

m99 good

m76 poor

p1*p

2*p

3Pr[I8]=

Representation

Possible worlds semantics

Reviews(M,S)

Page 12: Probabilities in Databases and Logics I

28

ID-DatabasesTimed Activit

yP

t walk p1

t run p2

t+1 walk p3

Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1

Activitiesid

Time Act Time Act

t run

Time Act

t walk

t+1 walk

Time Act

t walk

Time Act

t+1 walk

Time Act

t run

t+1 walk(1-p1-

p2)*(1-p

3)

Pr[I1

]=p2*(1-p

3)

Pr[I3

]= p1*p

3

Pr[I5

]=

Activities

Page 13: Probabilities in Databases and Logics I

29

ID subsumes I

Movied Scored P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid

Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsi

=

Note: Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid means alltuples aredisjoint

Page 14: Probabilities in Databases and Logics I

30

Queries

id year Pm42 1995 0.95

m99 2002 0.65m76 2002 0.1m05 2005 0.7

mid rating p

m42 4 0.7

m42 5 0.45m99 5 0.82

m99 4 0.68

m05 5 0.79

Moviei Reviewi

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z), z >= 3Review(x,z), z >= 3

•Syntax: conjunctive queries over schema S

Page 15: Probabilities in Databases and Logics I

31

Two Query Semantics

•Possible answer sets

Given set A:

Used for views

•Possible tuples

Given tuple t:

Used for query evaluation and top-k

Pr[{t | I Pr[{t | I ⊨⊨ Q(t)} = A]Q(t)} = A]

Pr[I Pr[I ⊨⊨ Q(t)]Q(t)]

ThisThistalktalk

Page 16: Probabilities in Databases and Logics I

35

p1id year

m42 2004

m99 1901

m76 1902

p2id year

m99 1935

m05 1903

p4id year

m87 1934

m44 1904

p3id year

m76 1995

m99 1935

m05 2004

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z)Review(x,z)

top k

year p

1935 p2 + p3 = 0.6

2004 p1 + p3 = 0.5

1995 p3 = 0.2

. . . . . .

Query Semantics

Tupleprobabilities

Page 17: Probabilities in Databases and Logics I

38

Summary on Data Model

Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases

Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities

Page 18: Probabilities in Databases and Logics I

39

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

Page 19: Probabilities in Databases and Logics I

40

Problem Definition

•Fix schema S, query Q, answer tuple t

•Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)]•Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2

Pr[Q(t)]Pr[Q(t)]notation:

Page 20: Probabilities in Databases and Logics I

41

Query Evaluationon I-Databases

•Outline

Intuition

Extensional plans: PTIME case

Hard queries: #P-complete case

Dichotomy Theorem

Page 21: Probabilities in Databases and Logics I

42

Intuition

Year p

1995

2002

p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3))

1 - (1 - ) × (1 - )

p2 × (1 - (1 - q4)×(1 - q5))p3 × q6

id year p

m42 1995 p1

m99 2002 p2

m76 2002 p3

m05 2005 p4

mid rate pm42 4 q1

m42 2 q2

m42 3 q3

m99 1 q4

m99 3 q5

m76 5 q6

Moviei Reviewi

Answer

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

Page 22: Probabilities in Databases and Logics I

43

Add Join ⋈ p = p1 * p2

Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p

Note: data complexity is PTIME

p

I-Extensional Plans

[Barbara92,Lakshmanan97]

Page 23: Probabilities in Databases and Logics I

46

Movie Review

CORRECTINCORRECT!

1995 m1 pq1

1995 m1 pq2

1995 m1 pq3

19951-(1-pq1)(1-pq2)

(1-pq3)

MovieReview

m11 - (1-q1)(1-

q2)(1-q3)

1995 m1p(1-(1-q1)(1-q2)

(1-q3))

m1 q1

m1 q2

m1 q3

1995 p

m1 q1

m1 q2

m1 q3

1995 p

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

Page 24: Probabilities in Databases and Logics I

48

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

A pp1

p2

p3

p4

B pq1

q2

q3

q4

A B

Ri S Ti

TheoremTheorem: Data complexity is #P-: Data complexity is #P-completecomplete

#P-Complete Queries

Page 25: Probabilities in Databases and Logics I

49

Proof:

A px1 1/2

x2 1/2x3 1/2x4 1/2

B py1 1/2

y2 1/2y3 1/2

A Bx2 y3

x1 y2

x4 y3

x3 y1

Ri S Ti

Reduction:x2y3 V x1y2 V x4y3 V x3y1

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

TheoremTheorem [Provan&Ball83] Counting the [Provan&Ball83] Counting the number of satisfying assignments for number of satisfying assignments for bipartite DNF is #P-completebipartite DNF is #P-complete

Page 26: Probabilities in Databases and Logics I

I-Dichotomy

Definition 1. For each Definition 1. For each variable x:variable x: goals(x) = set of goals that goals(x) = set of goals that contain xcontain x

Q = boolean conjunctive query

Definition 2. Q is hierarchical Definition 2. Q is hierarchical if forall x, y:if forall x, y: (a) goals(x) (a) goals(x) ∩∩ goals(y) = goals(y) = ∅∅, or, or (b) goals(x) (b) goals(x) ⊆⊆ goals(y), or goals(y), or (c) goals(y) (c) goals(y) ⊆⊆ goals(x) goals(x)

Page 27: Probabilities in Databases and Logics I

52

Q :- Q :- R(x),S(x,y),T(x,y,z),KR(x),S(x,y),T(x,y,z),K(x,v)(x,v)

QQ :- R(x), :- R(x), S(x,y), T(y)S(x,y), T(y)

x yz

RRSS

TTv KK

x y

RR SS TT

“hierarchical”

“non-hierarchical”

Page 28: Probabilities in Databases and Logics I

53

I-Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:

Q is in PTIMEQ is in PTIMEQ has a correct Q has a correct extensional planextensional planQ is hierarchicalQ is hierarchicalor:Q is #P-completeQ is #P-completeQ has subgoals Q has subgoals R(x,...),S(x,y,...),T(y,...)R(x,...),S(x,y,...),T(y,...)

Schema Si = {R1i, R2

i, . . ., Rmi}

Page 29: Probabilities in Databases and Logics I

54

ProofLemma 1. Lemma 1. If Q is non-hierarchical, then If Q is non-hierarchical, then #P-complete#P-completeProof:

x y

RR SS TTz

KKv

Q :- RQ :- Rii(v,(v,xx), S), Sii((x,yx,y), ), TTii((yy,z), K,z), Kii(z)(z) rest is like for

Qbad

Page 30: Probabilities in Databases and Logics I

55

ProofLemma 2. If Q is hierarchical, Lemma 2. If Q is hierarchical, then PTIMEthen PTIMEProof:

Case 1: has no Case 1: has no rootroot

Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3)

This is extensional join ⋈

Page 31: Probabilities in Databases and Logics I

56

Proof

Case 2: has root Case 2: has root xx

x

Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x)))

This is an extensional projection: ∏

Dom={a1, a2, . . ., an}

QED

Page 32: Probabilities in Databases and Logics I

57

Query Evaluationon ID-Databases

ID-extensional plans

#P-complete queries

Dichotomoy Theorem

Page 33: Probabilities in Databases and Logics I

58

Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn

Extensional Plans for ID-DBs

Page 34: Probabilities in Databases and Logics I

59

#P-Complete Queries

QQ22 :- R :- Rdd(x(xdd,y), ,y), SSdd(y(ydd,z),z)

QQ11 :- R :- Rii(x), S(x), Sii(x,y), (x,y), TTii(y)(y)

QQ33 :- R :- Rdd(x(xdd,y), ,y), SSdd(z(zdd,y),y)

Page 35: Probabilities in Databases and Logics I

60

I-DB Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:Q is in PTIMEQ is in PTIME

Q has a correct Q has a correct extensional planextensional plan

or:Q is #P-completeQ is #P-completeQ has one of QQ has one of Q11, Q, Q22, Q, Q33 as as subqueriessubqueries

Schema Sid s.t. each table is either Ri or Rid

Page 36: Probabilities in Databases and Logics I

61

Extensions

•Extensions of the dichotomoy theorem exists for:

Mixed schemas (some relations are deterministic)

Functional dependencies

Page 37: Probabilities in Databases and Logics I

62

Summary on Query Evaluation

•Extensional plans: popular, efficient, BUT

“Equivalent” plans lead to different results

Some queries admit “correct” plans

•Some simple queries: #P-complete complexity

•Dichotomy theorem

•Future work: remove ‘no-self-join’ restriction

Page 38: Probabilities in Databases and Logics I

65

Outline

Definitions

Query evaluation

Top-k answering (joint with Chris Re)

Conclusions

Page 39: Probabilities in Databases and Logics I

69

Top-k Ranking Problem

•Fix schema S, query Q, number k > 0

•Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities

•Note: Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation

Pr[Q(tPr[Q(t11)] > Pr[Q(t)] > Pr[Q(t22)] > .... > )] > .... > Pr[Q(tPr[Q(tkk)] > ...)] > ...

Page 40: Probabilities in Databases and Logics I

70

Probabilities of Boolean

ExpressionsWhat is the probability of e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3?

(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

e1 e2 e3 Pr

0 0 0 (1-p1)(1-p2)(1-p3)

0 0 1 (1-p1)(1-p2)p3

0 1 0 (1-p1)p2(1-p3)

0 1 1 (1-p1)p2p3

1 0 0 p1(1-p2)(1-p3)

1 0 1 p1(1-p2)p3

1 1 0 p1p2(1-p3)

1 1 1 p1p2p3

Theorem #P-hard [Valiant]Theorem #P-hard [Valiant]

A p

e1 p1

e2 p2

e3 p3

Page 41: Probabilities in Databases and Logics I

71

Monte Carlo Simulation

Better: PTAS

Pr( |p’-p| < Pr( |p’-p| < ε ε ) ) > 1-> 1-δδ

[Karp&Luby’83]

Algorithm:Algorithm:

radomly pick each eradomly pick each e11, e, e22, e, e33 = false = false or trueor true compute ecompute e11∧e∧e2 2 ∨ e∨ e11∧e∧e3 3 ∨ e∨ e22∧e∧e33: true or : true or false ?false ? repeatrepeat

Approximate probability p with Approximate probability p with frequency p’frequency p’

p’p’p’- p’- εε p’+ p’+ εε

pp

Page 42: Probabilities in Databases and Logics I

72

Monte Carlo Simulation

N=0

0 1p

N=1

N=2

N=3

Page 43: Probabilities in Databases and Logics I

73

The Multisimulation

Problem

Year P

1995 ??

2002 ??

1933 ??

1984 ??

Schedule simulation steps to find top-k

0 1

Page 44: Probabilities in Databases and Logics I

74

Multisimulation

How to find the top k out of n ?

Example: looking for top k=2;

0 1

12

345

Which one simulate next ?

p5

p1p4

p2

p3

Page 45: Probabilities in Databases and Logics I

75

Multisimulation

Critical region: (k’th left, k+1’th right)

0 1

k=2

Page 46: Probabilities in Databases and Logics I

76

Multisimulation Algorithm

Case 1: pick a “double crosser” and simulate it

0 1

this this

k=2

Page 47: Probabilities in Databases and Logics I

77

Multisimulation Algorithm

Case 2: pick both a “left” AND a “right” crosser

k=2

0 1

thi thiss

and and this this

Page 48: Probabilities in Databases and Logics I

78

Multisimulation Algorithm

Case 3: pick a “max crosser” and simulate it

0 1

thi thiss

k=2

Page 49: Probabilities in Databases and Logics I

79

Multisimulation Algorithm

End: when critical region is “empty”

0 1

k=2

To sort the top k, find the top k-1, etc

Page 50: Probabilities in Databases and Logics I

80

Multisimulation Algorithm

Theorem Theorem (1) It runs in < 2 Optimal # steps(1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm (2) no other deterministic algorithm does betterdoes better

Page 51: Probabilities in Databases and Logics I

82

Experiments

Page 52: Probabilities in Databases and Logics I

83

Summary on Top-k Answering

Simple algorithm, optimal (x2) w.r.t. a very powerful standard

Marriage of probabilistic and top-k answers make probabilistic databases practical

Page 53: Probabilities in Databases and Logics I

85

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

Page 54: Probabilities in Databases and Logics I

87

Conclusions

Strong motivation from practical applicationsOpportunity to merge query and search technologies

Probabilistic DB’s are hard !Great opportunity for impactful theory work

Tomorrow: applications of random graphs to model incompleteness in databases

Page 55: Probabilities in Databases and Logics I

Thank you !

Questions ?