on the semantics and evaluation of top-k queries in probabilistic databases

54
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th , 2008

Upload: vondra

Post on 25-Feb-2016

37 views

Category:

Documents


1 download

DESCRIPTION

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. Presented by Xi Zhang Feburary 8 th , 2008. Outline. Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion. Outline. Background Probabilistic database model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Presented by Xi Zhang

Feburary 8th, 2008

Page 2: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Page 3: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background

Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Page 4: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Probabilistic Databases Motivation

Uncertainty/vagueness/imprecision in data History

Imcomplete information in relational DB [Imielinski & Lipski 1984]

Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.]

Comeback Flourish of uncertain data in real world application

Examples: WWW, Biological data, Sensor network etc.

Page 5: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model

A generalizaiton of relational DB Probabilistic Relational Algebra (PRA)

A generalization of standard relational algebra

Page 6: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

DocNo Term12334

IRDBIRDBAI

Prob0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

A Table in Probabilistic Database

Event expression

Independent events

Page 7: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Probabilistic Relational Algebra Just like in Relational Algebra…

Selection Projection Join Union Difference -

Page 8: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Probabilistic Relational Algebra Just like in Relational Algebra…

Selection Projection Join Union Difference -

Page 9: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Selection

DocNo Term13

IRIR

Prob0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

In derived table

Propositional expression of basic events

Page 10: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Projection

Term

IRDBAI

Prob

0.980.850.80

Complex Event

eDT(1, IR) eDT(3, IR)

eDT(2, DB) eDT(2, DB)

eDT(4, AI)

Page 11: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Join

DocNo Term

12

IRDB

Prob

0.90.7

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

DocNo AName

12

BauerMeier

Prob

0.90.8

Basic Event

eDU(1, Bauer)

eDU(2, Meier)

DocAu:

DocAu.DocNo

AName DocTerm.DocNo

Term

1122

BauerBauerMeierMeier

1212

IRDBIRDB

Prob

0.9*0.90.9*0.70.8*0.90.8*0.7

Complex Event

eDU(1, Bauer) eDT(1, IR)

eDU(1, Bauer) eDT(2, DB)

eDU(2, Meier) eDT(1, IR)

eDU(2, Meier) eDT(2, DB)

Page 12: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Join + Projection

DocNo

13

Prob

0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

23

Prob

0.70.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1222344

BauerBauerMeier

SchmidtSchmidt

KochBauer

Prob

0.90.30.90.80.70.90.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

BauerSchimdt

AName

BauerMeier

Schmidt

Prob

0.810.56

Complex Event

eDU(1, Bauer) eDT(1, IR)

eDU(3, S) eDT(3, IR)

Prob

0.210.630.91

Complex Event

eDU(2, Bauer) eDT(2, DB)

eDU(2, Meier) eDT(2, DB)

(eDU(2, S) eDT(2, DB))(eDU(3, S) eDT(3, DB) )

AName

BauerSchmidt

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Prob Complex Event

(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))(eDU(3, S) eDT(3, IR) )( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368

Page 13: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

DocNo

13

Prob

0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

23

Prob

0.70.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1222344

BauerBauerMeier

SchmidtSchmidt

KochBauer

Prob

0.90.30.90.80.70.90.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

BauerSchimdt

AName

BauerMeier

Schmidt

Prob

0.810.56

Complex Event

eDU(1, Bauer) eDT(1, IR)

eDU(3, S) eDT(3, IR)

Prob

0.210.630.91

Complex Event

eDU(2, Bauer) eDT(2, DB)

eDU(2, Meier) eDT(2, DB)

(eDU(2, S) eDT(2, DB))(eDU(3, S) eDT(3, DB) )

AName

BauerSchmidt

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Prob Complex Event

(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))(eDU(3, S) eDT(3, IR) )( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368

Intensional Semanticsv.s.

Extensional Semantics

Join + Projection

Page 14: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Intensional v.s Extensional Intensional Semantics

Assume data independence of base tables Keeps track of data dependence during the

evaluation Extensional Semantics

Assume data independence during the evaluation Could be WRONG with probability computation!

Page 15: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

When Intensional = Extensional? No identical underlying basic events in the

event expression

AName

BauerSchmidt

Prob

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Complex Event

(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))(eDU(3, S) eDT(3, IR) )( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368

Identical basic event

Page 16: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Fubr & Rölleke 1997 Summary

Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression

Probabilistic Relational Algebra Just like in Relational Algebra…

Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event

expressions

Page 17: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background

Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Page 18: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Top-k Queries Traditonally, given

Objects: o1, o2, …, on

An non-negative integer: kA scoring function s:

Question:What are the k objects with the highest score?

Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

Page 19: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Scoring Function A scoring function s over a deterministic

relation R is

For any ti and tj from R,

Page 20: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background Motivation Examples

Smart Enviroment Example Sensor Network Example

Top-k Queries in Probabilistic Databases Conclusion

Page 21: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Motivating Example I Smart Environment

Sample Question “Who were the two visitors in the lab last Saturday night?”

Data Biometric data from sensors

We would be able to see how those data match the profile of every candidate -- a scoring function

Historical statistics e. g. Probability of a certain candidate being in lab on Saturday

nights

Page 22: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Motivating Example I (cont.) Face Voice Detection, Detection,

Aiden score( 0.70 , 0.60, … ) = 0.65

Bob score( 0.50 , 0.60, … ) = 0.55

Chris score( 0.50 , 0.40, … ) = 0.45

Probability of being in lab on Saturday nights

0.3

0.9

0.4

PersonnelBiometrics

score( … )

Question: Find two people in the lab last Saturday night

a Top-2 query over the above probabilistic database under the above scoring function

Page 23: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Motivating Example II Sensor Network in a Habitat

Sample Question “What is the temperature of the warmest spot?”

Data Sensor readings from different sensors At a sampling time, only one “real” reading from a

sensor Each sensor reading comes with a confidence value

Page 24: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Motivating Example II (cont.)

Temp (F)

2210

2515

Prob

0.6

0.1

Question: What is the temperature of the warmest spot?

a Top-1 query over the above probabilistic database under the scoring function proportional to temperature

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

Page 25: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Page 26: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Models A probabilistic relation Rp=<R, p, >

R: the support deterministic relation p: probability function : a partition of R, such that

Simple v.s. General probabilistic relation Simple

Assume tuple independence, i.e. ||=|R| E.g. smart environment example

General Tuples can be independent or exclusive, i.e. ||<|R| E.g. sensor network example

Page 27: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

ChallengesGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

Page 28: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

What is a “Good” Semantics? Desired Properties

Exact-k Faithfulness Stability

Page 29: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Properties Exact-k

If R has at least k tuples, then exactly k tuples are returned as the top-k answer

Faithfulness A “better” tuple, i.e. higher in score and probability, is more

likely to be in the top-k answer, compared to a “worse” one Stability

Raising the score/prob. of a winning tuple will not cause it to lose

Lowering the score/prob. of a losing tuple will not cause it to win

Page 30: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk SemanticsGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics) Global-Topk

Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds

Global-Topk satisfies aforementioned three properties

Page 31: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Smart Environment Example

Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden

Face Voice Detection, Detection,

Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )

Query: Find two people in lab on last Saturday night

AidenBob

Chris

Aiden Bob ChrisAidenBob

AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292

Global-Topk Semantics:

Pr(Aiden in top-2) = 0.3Pr(Bob in top-2) = 0.9 Top-2 Answer

Page 32: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics

U-Topk U-kRanks

Page 33: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

U-Topk SemanticsGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics) U-Topk

Return the most probable top-k answer set that belongs to possible worlds

U-Topk does not satisfies all three properties

Page 34: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Smart Environment Example

Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden

Face Voice Detection, Detection,

Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )

Query: Find two people in lab on last Saturday night

AidenBob

Chris

Aiden Bob ChrisAidenBob

AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27

U-Topk Semantics:

…Pr({Bob}) = 0.378 Top-2 Answer

Page 35: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

U-kRanks SemanticsGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics) U-kRanks

For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds

U-kRanks does not satisfies all three properties

Page 36: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Smart Environment Example

Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden

Face Voice Detection, Detection,

Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )

Query: Find two people in lab on last Saturday night

AidenBob

Chris

Aiden Bob ChrisAidenBob

AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292

U-kRanks Semantics:

Top-2 Answer{Bob}

Aiden BobRank-1Rank-2

0.30

0.630.27

0.0280.264

Chris Highest at rank-1

Highest at rank-2

Page 37: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Properties

Semantics Exact-k Faithfulness Stability

Global-TopkU-Topk

U-kRanks

YesNoNo

YesYes/No*

No

YesYesNo

* Yes when the relation is simple, No otherwise

A better sementics

Page 38: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

ChallengesGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

Global-Topk

Page 39: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk in Simple Relation Given Rp=<R, p, >, a scoring function s, a

non-negative integer k Assumptions

Tuples are independent, i.e. ||=|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.

Page 40: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk in Simple Relation Query Evaluation

Recursion Pk,s(ti): Global-Topk probability of tuple ti

Dynamic Programming

Page 41: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Optimization Threshold Algorithm (TA)

[Fagin & Lotem 2001] Given a system of objects, such that

For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute

An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm)

f is monotonic f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i

TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g.

top-k, skyline, etc.

Page 42: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Applying TA Optimization Global-Topk

Two attributes: probability & score Aggregation function: Global-Topk probability

Page 43: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk in General Relation Given Rp=<R, p, >, a scoring function s, a

non-negative integer k Assumptions

Tuples are independent or exclusive, i.e. ||<|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.

Page 44: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk in General Relation Induced Event Relation

For each tuple in R, there is a probabilistic relation Ep=<E, pE, E> generated by the following two rules

Ep is simple

Page 45: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Sensor Network Example Temp (F)

2210

2515

Prob

0.6

0.1

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

15 0.6

EventteC1

tet

0.6 =

0.6 = p(t)

For example:

Induced Event Relation (simple)

t=

where i=1

Prob

Rule 1

Rule 2

Prob. Relation (general)

Page 46: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Global-Topk in General Relation

Page 47: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Evaluating Global-Topk in General Relation For each tuple t, generate corresponding

induced event relation Compute the Global-Topk probability of t by

Theorem 4.3 Pick the k tuples with the highest Global-Topk

probability

Page 48: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Summary on Query Evaluation Simple (Independent Tuples)

Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k

General (Independent/Exclusive Tuples) Polynomial reduction to simple cases

Page 49: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Complexity

Global-Topk U-Topk U-kRanks

Simple O(kn) O(kn) O(kn)

General O(kn2) Θ(mknk-1 lg n)* Ω(mnk-1)*

* m is a rule engine related factor m represents how complicated the relationship between tuples could be

Page 50: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Page 51: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Conclusion Three intuitive semantic properties for top-k

queries in probability databases Global-Topk semantics which satisfies all the

properties above Query evaluation algorithm for Global-Topk in

simple and general probabilistic databases

Page 52: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Future Problems Weak order scoring function

Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary

tie breaker”) Preference Strength

Sensitivity to Score Given a prob. relation Rp, if the DB is sufficiently large, by

manipulating the scores of tuples, we would be able to get different answers

NOT satisfied by our semantics NOT satisfied by any semantics in literature

Need to consider preference strength in the semantics

Page 53: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Thank you !

Page 54: On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Related Works Introduction to Probabilistic Databases

Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997]

Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k

Queries in Probabilistic Databases [Zhang & Chomicki 2008]

Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]