probabilities in databases and logics i

Probabilities inDatabases and Logics

INilesh Dalvi and Dan SuciuUniversity of Washington

Two Lectures•Today:

probabilistic database to model imprecisions

probabilistic logics

•Tomorrow:

probabilistic database to model incompletness

random graphs

Motivation

Record reconciliation

Information extraction

Constraint violations

Schema matching

name rating p

Monkey Love good .5

fair .2

fair .6

poor .9

Review

Queries:A(x,y) :- A(x,y) :- Review(x,y),Review(x,y), Movie(x,z), z Movie(x,z), z > 1991> 1991

Problem SettingTables

title year p

Twelve Monkeys 1995 .8

Monkey Love 1997 .4

Monkey Love 1935 .9

Monkey Love Pl 2005 .7

Answers:title rating p

Twelve Monkeys fair .53

Monkey Love good .42

Monkey Love Pl fair .15

Problem: complexity of

query evaluation

Two Problems

Fix answer tuple (a,b)Given database I, compute Pr(Q(a,b))

Query evaluation problem

Fixed schema S, conjunctive query Q(x,y)

Fix k > 0Given database I, find k answer tuples with highest probabilities

Top-k answering problem

Related Work: DB

Cavallo&Pitarelli:1987

Barbara,Garcia-Molina, Porter:1992

Lakshmanan,Leone,Ross&Subrahmanian:1997

Fuhr&Roellke:1997

Dalvi&S:2004

Widom:2005

Related Work: Logic

Query reliability [Gradel,Gurevitch,Hirsch’98]

Degrees of belief [Bacchus,Grove,Halpern,Koller’96]

Probabilistic Logic [Nielson]

Probabilistic model checking [Kwiatkowska’02]

Probabilistic Relational Model[Taskar,Abbeel,Koller’02]

Outline

Definitions

Query Evaluation

Top-k answering (joint with Chris Re)

Conclusions

Pr : Inst Pr : Inst →→ [0,1], ∑ [0,1], ∑II Pr[I] = 1 Pr[I] = 1

Probabilistic Database

•Schema S, Domain D, Set of instances Inst

•DefinitionProbabilistic database is a probability distribution

If Pr[I] > 0 then I is called “possible world”

Probabilistic Database

•Representation:

•Independent tuples:I-database DB over some schema Si

•Independent and disjoint tuples:ID-database DB over some schema Sid

Semantics:

DB “means” probability distribution Pr over schema S

I-DatabasesMovie Score P

m42 good p1

m99 good p2

m76 poor p3

Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1

Reviewsi(M,S,p)

Mov Scor

m76 poor

Mov Scor

m42 good

m76 poor

Mov Scor

m42 1995

m76 poor

Mov Scor

m42 good

Mov Scor

m99 good

Mov Scor

(1-p1)*(1-

p2)*(1-p

Pr[I1]=

Mov Scor

m42 good

m99 good

Pr[I4]=

Mov Scor

m42 good

m99 good

m76 poor

3Pr[I8]=

Representation

Possible worlds semantics

Reviews(M,S)

ID-DatabasesTimed Activit

t walk p1

t run p2

t+1 walk p3

Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1

Activitiesid

Time Act Time Act

Time Act

t walk

t+1 walk

Time Act

t walk

Time Act

t+1 walk

Time Act

t+1 walk(1-p1-

p2)*(1-p

]=p2*(1-p

]= p1*p

Activities

ID subsumes I

Movied Scored P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid

Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsi

Note: Movie Score P

m42 good p1

m99 good p2

m76 poor p3

Reviewsid means alltuples aredisjoint

Queries

id year Pm42 1995 0.95

m99 2002 0.65m76 2002 0.1m05 2005 0.7

mid rating p

m42 4 0.7

m42 5 0.45m99 5 0.82

m99 4 0.68

m05 5 0.79

Moviei Reviewi

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z), z >= 3Review(x,z), z >= 3

•Syntax: conjunctive queries over schema S

Two Query Semantics

•Possible answer sets

Given set A:

Used for views

•Possible tuples

Given tuple t:

Used for query evaluation and top-k

Pr[{t | I Pr[{t | I ⊨⊨ Q(t)} = A]Q(t)} = A]

Pr[I Pr[I ⊨⊨ Q(t)]Q(t)]

ThisThistalktalk

p1id year

m42 2004

m99 1901

m76 1902

p2id year

m99 1935

m05 1903

p4id year

m87 1934

m44 1904

p3id year

m76 1995

m99 1935

m05 2004

Q(y) :- Movie(x,y), Q(y) :- Movie(x,y), Review(x,z)Review(x,z)

year p

1935 p2 + p3 = 0.6

2004 p1 + p3 = 0.5

1995 p3 = 0.2

. . . . . .

Query Semantics

Tupleprobabilities

Summary on Data Model

Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases

Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

Problem Definition

•Fix schema S, query Q, answer tuple t

•Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)]•Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2

Pr[Q(t)]Pr[Q(t)]notation:

Query Evaluationon I-Databases

•Outline

Intuition

Extensional plans: PTIME case

Hard queries: #P-complete case

Dichotomy Theorem

Intuition

Year p

p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3))

1 - (1 - ) × (1 - )

p2 × (1 - (1 - q4)×(1 - q5))p3 × q6

id year p

m42 1995 p1

m99 2002 p2

m76 2002 p3

m05 2005 p4

mid rate pm42 4 q1

m42 2 q2

m42 3 q3

m99 1 q4

m99 3 q5

m76 5 q6

Moviei Reviewi

Answer

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

Add Join ⋈ p = p1 * p2

Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p

Note: data complexity is PTIME

I-Extensional Plans

[Barbara92,Lakshmanan97]

Movie Review

CORRECTINCORRECT!

1995 m1 pq1

1995 m1 pq2

1995 m1 pq3

19951-(1-pq1)(1-pq2)

(1-pq3)

MovieReview

m11 - (1-q1)(1-

q2)(1-q3)

1995 m1p(1-(1-q1)(1-q2)

(1-q3))

1995 p

Q(y) :- Q(y) :- Movie(x,y),Movie(x,y), Review(x,z)Review(x,z)

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

Ri S Ti

TheoremTheorem: Data complexity is #P-: Data complexity is #P-completecomplete

#P-Complete Queries

Proof:

A px1 1/2

x2 1/2x3 1/2x4 1/2

B py1 1/2

y2 1/2y3 1/2

A Bx2 y3

Ri S Ti

Reduction:x2y3 V x1y2 V x4y3 V x3y1

QQbadbad :- R :- Rii(x), (x), S(x,y), TS(x,y), Tii(y)(y)

TheoremTheorem [Provan&Ball83] Counting the [Provan&Ball83] Counting the number of satisfying assignments for number of satisfying assignments for bipartite DNF is #P-completebipartite DNF is #P-complete

I-Dichotomy

Definition 1. For each Definition 1. For each variable x:variable x: goals(x) = set of goals that goals(x) = set of goals that contain xcontain x

Q = boolean conjunctive query

Definition 2. Q is hierarchical Definition 2. Q is hierarchical if forall x, y:if forall x, y: (a) goals(x) (a) goals(x) ∩∩ goals(y) = goals(y) = ∅∅, or, or (b) goals(x) (b) goals(x) ⊆⊆ goals(y), or goals(y), or (c) goals(y) (c) goals(y) ⊆⊆ goals(x) goals(x)

Q :- Q :- R(x),S(x,y),T(x,y,z),KR(x),S(x,y),T(x,y,z),K(x,v)(x,v)

QQ :- R(x), :- R(x), S(x,y), T(y)S(x,y), T(y)

TTv KK

RR SS TT

“hierarchical”

“non-hierarchical”

I-Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:

Q is in PTIMEQ is in PTIMEQ has a correct Q has a correct extensional planextensional planQ is hierarchicalQ is hierarchicalor:Q is #P-completeQ is #P-completeQ has subgoals Q has subgoals R(x,...),S(x,y,...),T(y,...)R(x,...),S(x,y,...),T(y,...)

Schema Si = {R1i, R2

i, . . ., Rmi}

ProofLemma 1. Lemma 1. If Q is non-hierarchical, then If Q is non-hierarchical, then #P-complete#P-completeProof:

RR SS TTz

Q :- RQ :- Rii(v,(v,xx), S), Sii((x,yx,y), ), TTii((yy,z), K,z), Kii(z)(z) rest is like for

ProofLemma 2. If Q is hierarchical, Lemma 2. If Q is hierarchical, then PTIMEthen PTIMEProof:

Case 1: has no Case 1: has no rootroot

Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3)

This is extensional join ⋈

Case 2: has root Case 2: has root xx

Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x)))

This is an extensional projection: ∏

Dom={a1, a2, . . ., an}

Query Evaluationon ID-Databases

ID-extensional plans

#P-complete queries

Dichotomoy Theorem

Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn

Extensional Plans for ID-DBs

#P-Complete Queries

QQ22 :- R :- Rdd(x(xdd,y), ,y), SSdd(y(ydd,z),z)

QQ11 :- R :- Rii(x), S(x), Sii(x,y), (x,y), TTii(y)(y)

QQ33 :- R :- Rdd(x(xdd,y), ,y), SSdd(z(zdd,y),y)

I-DB Dichotomy[Dalvi&S.’04]

Theorem Let Q = conjunctive query w/o self-joins.Then one of the following holds:Q is in PTIMEQ is in PTIME

Q has a correct Q has a correct extensional planextensional plan

or:Q is #P-completeQ is #P-completeQ has one of QQ has one of Q11, Q, Q22, Q, Q33 as as subqueriessubqueries

Schema Sid s.t. each table is either Ri or Rid

Extensions

•Extensions of the dichotomoy theorem exists for:

Mixed schemas (some relations are deterministic)

Functional dependencies

Summary on Query Evaluation

•Extensional plans: popular, efficient, BUT

“Equivalent” plans lead to different results

Some queries admit “correct” plans

•Some simple queries: #P-complete complexity

•Dichotomy theorem

•Future work: remove ‘no-self-join’ restriction

Outline

Definitions

Query evaluation

Top-k answering (joint with Chris Re)

Conclusions

Top-k Ranking Problem

•Fix schema S, query Q, number k > 0

•Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities

•Note: Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation

Pr[Q(tPr[Q(t11)] > Pr[Q(t)] > Pr[Q(t22)] > .... > )] > .... > Pr[Q(tPr[Q(tkk)] > ...)] > ...

Probabilities of Boolean

ExpressionsWhat is the probability of e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3?

(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

e1 e2 e3 Pr

0 0 0 (1-p1)(1-p2)(1-p3)

0 0 1 (1-p1)(1-p2)p3

0 1 0 (1-p1)p2(1-p3)

0 1 1 (1-p1)p2p3

1 0 0 p1(1-p2)(1-p3)

1 0 1 p1(1-p2)p3

1 1 0 p1p2(1-p3)

1 1 1 p1p2p3

Theorem #P-hard [Valiant]Theorem #P-hard [Valiant]

Monte Carlo Simulation

Better: PTAS

Pr( |p’-p| < Pr( |p’-p| < ε ε ) ) > 1-> 1-δδ

[Karp&Luby’83]

Algorithm:Algorithm:

radomly pick each eradomly pick each e11, e, e22, e, e33 = false = false or trueor true compute ecompute e11∧e∧e2 2 ∨ e∨ e11∧e∧e3 3 ∨ e∨ e22∧e∧e33: true or : true or false ?false ? repeatrepeat

Approximate probability p with Approximate probability p with frequency p’frequency p’

p’p’p’- p’- εε p’+ p’+ εε

Monte Carlo Simulation

The Multisimulation

Problem

Year P

1995 ??

2002 ??

1933 ??

1984 ??

Schedule simulation steps to find top-k

Multisimulation

How to find the top k out of n ?

Example: looking for top k=2;

Which one simulate next ?

Multisimulation

Critical region: (k’th left, k+1’th right)

Multisimulation Algorithm

Case 1: pick a “double crosser” and simulate it

this this

Case 2: pick both a “left” AND a “right” crosser

thi thiss

and and this this

Case 3: pick a “max crosser” and simulate it

thi thiss

End: when critical region is “empty”

To sort the top k, find the top k-1, etc

Theorem Theorem (1) It runs in < 2 Optimal # steps(1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm (2) no other deterministic algorithm does betterdoes better

Experiments

Summary on Top-k Answering

Simple algorithm, optimal (x2) w.r.t. a very powerful standard

Marriage of probabilistic and top-k answers make probabilistic databases practical

Outline

Definitions

Query evaluation

Top-k answering

Conclusions

Strong motivation from practical applicationsOpportunity to merge query and search technologies

Probabilistic DB’s are hard !Great opportunity for impactful theory work

Tomorrow: applications of random graphs to model incompleteness in databases

Thank you !

Questions ?

probabilities in databases and logics i

monkey love pl2005

monkey love planet20050

monkey boy0

monkey loveplanet

monkey love19350

monkey love planetmonkey

seen12 monkeys buti

fuzzy object matchingqueries

Documents

sentence final-word completion norms for children and...

script logics

proof theory for hybrid(ised) logics i - connecting...

logics (mid)

mathematical logics - unitrento

ldk r logics for data and knowledge representation...

normal modal logics - open logic project · part i normal...

description logics logics and ontologies

sap table logics

logics of varieties, logics of semilattices ...logics of...

description logics: alc

knowledge representation...

basic description logics

regulation, institutional logics and capital structure:...

04 - logics scripts

liable logics language using not logics inside allowable job...

blocs logics

unique enterpreneur logics

using transportation accident databases to investigate...

engineering of logics for the content-based representation...