scrubbing query results from probabilistic databases

25
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue

Upload: shaeleigh-newman

Post on 01-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Scrubbing Query Results from Probabilistic Databases. Jianwen Chen, Ling Feng, Wenwei Xue. A skeleton of scrubbing probabilistic database query results. Three probabilistic relation examples. Query 1: look for the year(s) where at least one movie was liked by people from northern regions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scrubbing Query Results from Probabilistic Databases

Scrubbing Query Results from Probabilistic Databases

Jianwen Chen, Ling Feng, Wenwei Xue

Page 2: Scrubbing Query Results from Probabilistic Databases

A skeleton of scrubbing probabilistic database query results

Page 3: Scrubbing Query Results from Probabilistic Databases

Three probabilistic relation examples

Page 4: Scrubbing Query Results from Probabilistic Databases

Query 1: look for the year(s) where at least one movie was liked by people from northern regionsThe user gets the following answer from the probabilistic database:

User: Where is the probability derived?System: It is based on the two assumptions: Pr(x4) = 0.9 and Pr(x5) = 0.2User: I think the movie of MovieID = 4 is not actually liked by people from northern regions. Pr(x4) should be 0.1 but not 0.9! System: The new probability is 0.28!

How to identify the top-kuncertain assumptions for user clarification?

How to recompute the probability?

Page 5: Scrubbing Query Results from Probabilistic Databases

Pr(ee)=Pr(x4 x5)∨=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.9 + 0.2 – 0.9 * 0.2 = 0.92

1.09.01)Pr(1)Pr(

)Pr(

8.02.01)Pr(1)Pr(

)Pr(

45

54

xx

ee

xx

ee

Top-k assumptions

Pr(ee)=Pr(x4 x∨ 5)=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.1 + 0.2 – 0.1 * 0.2 = 0.28

0.1EventID Prob. Rate

x4 0.9 0.8

x5 0.2 0.1

Page 6: Scrubbing Query Results from Probabilistic Databases

Basic algorithm to compute top-k assumptions

For an event expression ee, to compute its probability Pr(ee), one can first convert it into an equivalent disjunctive normalform, and then apply the inclusion-exclusion formula.

disjunctive norm form:ee = C1 ∨ C2∨ …∨ Cm

where C1= e11∧ e12∧ …∧ e1 s1,C2= e21∧ e22∧ …∧ e2 s2,...,Cm= em1∧ em2∧ …∧ em sm,m ≥1,s1,s2,…,sm≥1

inclusion-exclusion formula:

)Pr()1(

)Pr(

)Pr()Pr(

)Pr(

)Pr(

21

1

21

mm

kjikji

m

i jijii

m

CCC

CCC

CCC

CCC

ee

Page 7: Scrubbing Query Results from Probabilistic Databases

Basic algorithm to compute top-k assumptions

To compute ,)Pr(

)Pr(

ie

ee

one can rewrite Pr(ee) as

Pr(ee)=α*Pr(ei)+β

where α and β are two sub-expressions irrelevant to Pr(ei)and

)Pr(

)Pr(

ie

ee

The time complexity is O(2m), where m is the number of conjuncts in the disjunctive normal form of ee.

Page 8: Scrubbing Query Results from Probabilistic Databases

Optimization

Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB Journal 16(4) (2007) 523–544

We restrict the event expression ee to the situation where basic events e1,e2, …, en are independent and moreover they do not occur repeatedly in ee, which can be obtained for most of the queries (80% of the TPC/H queries ) by using the well-researched optimization technique adopted in

Page 9: Scrubbing Query Results from Probabilistic Databases

Three probabilistic relation examples

Page 10: Scrubbing Query Results from Probabilistic Databases

Query 2: look for the year(s) where at least one movie was liked by people from northern regions but not by people from southern regions

The user gets the following answer from the uncertain database:

Page 11: Scrubbing Query Results from Probabilistic Databases

ee=(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)Pr(e1)=0.2Pr(e2)=0.7Pr(e3)=0.1Pr(e4)=0.9Pr(e5)=0.7Pr(e6)=0.2

Pr(ee)?

Pr(~ee) = 1 –Pr(ee)

Pr(ee1 ee∧ 2) = Pr(ee1) * Pr(ee2)

Pr(ee1 ee∨ 2) = Pr(ee1) + Pr(ee2) – Pr(ee1) * Pr(ee2)

Pr(ee)=f(Pr(e1),Pr(e2),…,Pr(e6))

)Pr(

)Pr(,...,)Pr(

)Pr(,)Pr(

)Pr(

621 e

ee

e

ee

e

ee

Page 12: Scrubbing Query Results from Probabilistic Databases
Page 13: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(e1)=0.2

Pr(e2)=0.7

Pr(e3)=0.1

Pr(e4)=0.9

Pr(e5)=0.7

Pr(e6)=0.2

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1

Page 14: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(e1)=0.2

Pr(e2)=0.7

Pr(e3)=0.1

Pr(e4)=0.9

Pr(e5)=0.7

Pr(e6)=0.2

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1

Page 15: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1

Page 16: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56

Page 17: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Page 18: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591

Page 19: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Page 20: Scrubbing Query Results from Probabilistic Databases

Second Optimization

Page 21: Scrubbing Query Results from Probabilistic Databases

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

top-2 assumptions

Page 22: Scrubbing Query Results from Probabilistic Databases

Scrub the query result

Recompute Pr((e1∧~ e2) (e∨ 3∧~ e4) (e∨ 5∧~ e6)) with modified Pr(e2) and pr(e5)

Page 23: Scrubbing Query Results from Probabilistic Databases

Performance Study

Page 24: Scrubbing Query Results from Probabilistic Databases

Performance Study

Page 25: Scrubbing Query Results from Probabilistic Databases

Conclusion