on the semantics and evaluation of top-k queries in probabilistic databases

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Presented by Xi Zhang

Feburary 8th, 2008

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Outline Background

Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Probabilistic Databases Motivation

Uncertainty/vagueness/imprecision in data History

Imcomplete information in relational DB [Imielinski & Lipski 1984]

Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.]

Comeback Flourish of uncertain data in real world application

Examples: WWW, Biological data, Sensor network etc.

Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model

A generalizaiton of relational DB Probabilistic Relational Algebra (PRA)

A generalization of standard relational algebra

DocNo Term12334

IRDBIRDBAI

Prob0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

A Table in Probabilistic Database

Event expression

Independent events

Probabilistic Relational Algebra Just like in Relational Algebra…

Selection Projection Join Union Difference -

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Selection

DocNo Term13

IRIR

Prob0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

In derived table

Propositional expression of basic events

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Projection

Term

IRDBAI

Prob

0.980.850.80

Complex Event

eDT(1, IR) eDT(3, IR)

eDT(2, DB) eDT(2, DB)

eDT(4, AI)

Join

DocNo Term

12

IRDB

Prob

0.90.7

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

DocNo AName

12

BauerMeier

Prob

0.90.8

Basic Event

eDU(1, Bauer)

eDU(2, Meier)

DocAu:

DocAu.DocNo

AName DocTerm.DocNo

Term

1122

BauerBauerMeierMeier

1212

IRDBIRDB

Prob

0.9*0.90.9*0.70.8*0.90.8*0.7

Complex Event

eDU(1, Bauer) eDT(1, IR)

eDU(1, Bauer) eDT(2, DB)

eDU(2, Meier) eDT(1, IR)

eDU(2, Meier) eDT(2, DB)

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Join + Projection

DocNo

13

Prob

0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

23

Prob

0.70.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1222344

BauerBauerMeier

SchmidtSchmidt

KochBauer

Prob

0.90.30.90.80.70.90.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

BauerSchimdt

AName

BauerMeier

Schmidt

Prob

0.810.56

Complex Event


eDU(3, S) eDT(3, IR)

Prob

0.210.630.91

Complex Event



(eDU(2, S) eDT(2, DB))(eDU(3, S) eDT(3, DB) )

AName

BauerSchmidt

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Prob Complex Event

(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))(eDU(3, S) eDT(3, IR) )( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368

DocNo Term

12334

IRDBIRDBAI

Prob

0.90.70.80.50.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

DocNo

13

Prob

0.90.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

23

Prob

0.70.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1222344

BauerBauerMeier

SchmidtSchmidt

KochBauer

Prob

0.90.30.90.80.70.90.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

BauerSchimdt

AName

BauerMeier

Schmidt

Prob

0.810.56

Complex Event


eDU(3, S) eDT(3, IR)

Prob

0.210.630.91

Complex Event



(eDU(2, S) eDT(2, DB))(eDU(3, S) eDT(3, DB) )

AName

BauerSchmidt

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Prob Complex Event


Intensional Semanticsv.s.

Extensional Semantics

Join + Projection

Intensional v.s Extensional Intensional Semantics

Assume data independence of base tables Keeps track of data dependence during the

evaluation Extensional Semantics

Assume data independence during the evaluation Could be WRONG with probability computation!

When Intensional = Extensional? No identical underlying basic events in the

event expression

AName

BauerSchmidt

Prob

0.81 * 0.21 = 0.17010.56 * 0.91 = 0.5096

Complex Event


Identical basic event

Fubr & Rölleke 1997 Summary

Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression

Probabilistic Relational Algebra Just like in Relational Algebra…

Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event

expressions

Outline Background

Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Top-k Queries Traditonally, given

Objects: o1, o2, …, on

An non-negative integer: kA scoring function s:

Question:What are the k objects with the highest score?

Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

Scoring Function A scoring function s over a deterministic

relation R is

For any ti and tj from R,

Outline Background Motivation Examples

Smart Enviroment Example Sensor Network Example

Top-k Queries in Probabilistic Databases Conclusion

Motivating Example I Smart Environment

Sample Question “Who were the two visitors in the lab last Saturday night?”

Data Biometric data from sensors

We would be able to see how those data match the profile of every candidate -- a scoring function

Historical statistics e. g. Probability of a certain candidate being in lab on Saturday

nights

Motivating Example I (cont.) Face Voice Detection, Detection,

Aiden score( 0.70 , 0.60, … ) = 0.65

Bob score( 0.50 , 0.60, … ) = 0.55

Chris score( 0.50 , 0.40, … ) = 0.45

Probability of being in lab on Saturday nights

0.3

0.9

0.4

PersonnelBiometrics

score( … )

Question: Find two people in the lab last Saturday night

a Top-2 query over the above probabilistic database under the above scoring function

Motivating Example II Sensor Network in a Habitat

Sample Question “What is the temperature of the warmest spot?”

Data Sensor readings from different sensors At a sampling time, only one “real” reading from a

sensor Each sensor reading comes with a confidence value

Motivating Example II (cont.)

Temp (F)

2210

2515

Prob

0.6

0.1

Question: What is the temperature of the warmest spot?

a Top-1 query over the above probabilistic database under the scoring function proportional to temperature

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Models A probabilistic relation Rp=<R, p, >

R: the support deterministic relation p: probability function : a partition of R, such that

Simple v.s. General probabilistic relation Simple

Assume tuple independence, i.e. ||=|R| E.g. smart environment example

General Tuples can be independent or exclusive, i.e. ||<|R| E.g. sensor network example

ChallengesGiven

A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

What is a “Good” Semantics? Desired Properties

Exact-k Faithfulness Stability

Properties Exact-k

If R has at least k tuples, then exactly k tuples are returned as the top-k answer

Faithfulness A “better” tuple, i.e. higher in score and probability, is more

likely to be in the top-k answer, compared to a “worse” one Stability

Raising the score/prob. of a winning tuple will not cause it to lose

Lowering the score/prob. of a losing tuple will not cause it to win

Global-Topk SemanticsGiven



What is the top-k answer set over Rp ? (Semantics) Global-Topk

Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds

Global-Topk satisfies aforementioned three properties

Smart Environment Example

Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden

Face Voice Detection, Detection,

Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )

Query: Find two people in lab on last Saturday night

AidenBob

Chris

Aiden Bob ChrisAidenBob

AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292

Global-Topk Semantics:

Pr(Aiden in top-2) = 0.3Pr(Bob in top-2) = 0.9 Top-2 Answer

Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics

U-Topk U-kRanks

U-Topk SemanticsGiven



What is the top-k answer set over Rp ? (Semantics) U-Topk

Return the most probable top-k answer set that belongs to possible worlds

U-Topk does not satisfies all three properties


Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden


Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )


AidenBob

Chris


AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27

U-Topk Semantics:

…Pr({Bob}) = 0.378 Top-2 Answer

U-kRanks SemanticsGiven



What is the top-k answer set over Rp ? (Semantics) U-kRanks

For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds

U-kRanks does not satisfies all three properties


Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden


Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )


AidenBob

Chris


AidenChris

BobChris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292

U-kRanks Semantics:

Top-2 Answer{Bob}

Aiden BobRank-1Rank-2

0.30

0.630.27

0.0280.264

Chris Highest at rank-1

Highest at rank-2

Properties

Semantics Exact-k Faithfulness Stability

Global-TopkU-Topk

U-kRanks

YesNoNo

YesYes/No*

No

YesYesNo

* Yes when the relation is simple, No otherwise

A better sementics

ChallengesGiven



What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

Global-Topk

Global-Topk in Simple Relation Given Rp=<R, p, >, a scoring function s, a

non-negative integer k Assumptions

Tuples are independent, i.e. ||=|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.

Global-Topk in Simple Relation Query Evaluation

Recursion Pk,s(ti): Global-Topk probability of tuple ti

Dynamic Programming

Optimization Threshold Algorithm (TA)

[Fagin & Lotem 2001] Given a system of objects, such that

For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute

An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm)

f is monotonic f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i

TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g.

top-k, skyline, etc.

Applying TA Optimization Global-Topk

Two attributes: probability & score Aggregation function: Global-Topk probability

Global-Topk in General Relation Given Rp=<R, p, >, a scoring function s, a

non-negative integer k Assumptions

Tuples are independent or exclusive, i.e. ||<|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.

Global-Topk in General Relation Induced Event Relation

For each tuple in R, there is a probabilistic relation Ep=<E, pE, E> generated by the following two rules

Ep is simple

Sensor Network Example Temp (F)

2210

2515

Prob

0.6

0.1

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

15 0.6

EventteC1

tet

0.6 =

0.6 = p(t)

For example:

Induced Event Relation (simple)

t=

where i=1

Prob

Rule 1

Rule 2

Prob. Relation (general)

Global-Topk in General Relation

Evaluating Global-Topk in General Relation For each tuple t, generate corresponding

induced event relation Compute the Global-Topk probability of t by

Theorem 4.3 Pick the k tuples with the highest Global-Topk

probability

Summary on Query Evaluation Simple (Independent Tuples)

Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k

General (Independent/Exclusive Tuples) Polynomial reduction to simple cases

Complexity

Global-Topk U-Topk U-kRanks

Simple O(kn) O(kn) O(kn)

General O(kn2) Θ(mknk-1 lg n)* Ω(mnk-1)*

* m is a rule engine related factor m represents how complicated the relationship between tuples could be

Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Conclusion Three intuitive semantic properties for top-k

queries in probability databases Global-Topk semantics which satisfies all the

properties above Query evaluation algorithm for Global-Topk in

simple and general probabilistic databases

Future Problems Weak order scoring function

Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary

tie breaker”) Preference Strength

Sensitivity to Score Given a prob. relation Rp, if the DB is sufficiently large, by

manipulating the scores of tuples, we would be able to get different answers

NOT satisfied by our semantics NOT satisfied by any semantics in literature

Need to consider preference strength in the semantics

Thank you !

Related Works Introduction to Probabilistic Databases

Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997]

Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k

Queries in Probabilistic Databases [Zhang & Chomicki 2008]

Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]

on the semantics and evaluation of top-k queries in probabilistic databases

Documents

db edt3

db edt2

complex eventedt1

basic eventedt1

s edt3

ir edt3

s edt2

db edu2