probabilistic ranking

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Vagelis Hristidis VLDB 2004 2

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion


Motivation

SQL Returns Unordered Sets of Results

Overwhelms Users of Information Discovery Applications

How Can Ranking be Introduced, Given that ALL Results Satisfy Query?


Example – Realtor Database House Attributes: Price, City, Bedrooms,

Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results! Intuitively, Houses with lower Price,

more Bedrooms, or BoatDock are generally preferable


Rank According to Unspecified Attributes

Score of a Result Tuple t depends on Global Score: Global Importance of

Unspecified Attribute Values [CIDR2003] E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values E.g., Waterfront BoatDock

Many Bedrooms Good School District


Key Problems

Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.


Roadmap



Architecture


Roadmap



PIR Review Bayes’ Rule Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents


Ranking Function – Adapt PIR

Query Q: X1=x1 AND … AND Xs=xs, X ={X1, …, Xs}

Result-Tuple t(X,Y), where X: Specified Attributes, Y: Unspecified Attributes

( | ) ( , | ) ( | ) ( | , ) ( | )( )

( | ) ( , | ) ( | ) ( | , ) ( | , )

p t R p X Y R p X R p Y X R p Y RScore t

p t D p X Y D p X D p Y X D p Y X D

RD X, R, D common.R satisfies X.


Ranking Function – Limited Conditional Independence Given a query Q and a tuple t, the X (and Y)

values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

( | ) ( | )x X

p X C p x C

( | ) ( | )

y Y

p Y C p y C

Yy XxYy DyxpDyp

RyptScore

),|(

1

)|(

)|()(

Use Data

, C involves Y, R, D

, C involves X, R, DUse Workload


Atomic Probabilities Estimation Using Workload W

If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X.

Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.

Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.

( | ) ( | , ) ( | ) ( | , )x X

p y R p y X W p y W p x y W

Using Limited Conditional

IndependenceGlobal Part

Conditional Part

Probabilities p(x | y, W) (p(x | y, D)) Calculated Using Standard Association Rule Mining Techniques on W (D)


Roadmap



Performance


Scan Algorithm

Preprocessing - Atomic Probabilities Module

Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each

Result-Tuple Return Top-K Tuples


List Merge Algorithm

Preprocessing For Each Distinct Value x of Database, Calculate and

Store the Conditional (Cx) and the Global (Gx) Lists as follows

For Each Tuple t Containing x Calculate

tz Dzxp

WzxpCondScore

),|(

),|(

tz Dzp

WzpGlobScore

)|(

)|(

Execution Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores

Yy XxYy Dyxp

Wyxp

Dyp

WyptScore

),|(

),|(

)|(

)|()(Final

Formula:


Roadmap



Quality Experiments Compare our Conditional Ranking Method with

the Global Method [CIDR03] Surveyed 14 MSR employees Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com) Each User Behaved According to Various

Profiles. E.g.: singles, middle-class family, rich retirees… teenage males, people interested in comedies of the

80s… First Collect Workloads, Then Compare Results

of 2 Methods for a Set of Queries


Quality Experiments – Average Precision

Seattle Homes Movies COND GLOB COND GLOB

Q1 0.70 0.26 0.48 0.35 Q2 0.76 0.62 0.53 0.43 Q3 0.90 0.54 0.58 0.20 Q4 0.84 0.32 0.45 0.48 Q5 0.44 0.48 0.43 0.40

For 5 queries, ask users to Mark 10 out of a Set of 30

likely results containing: the Top-10 results of both the Conditional and Global plus a few randomly selected tuples.

Precision = Recall


Quality Experiments - Fraction of Users Preferring Each Algorithm

00.10.20.30.40.50.60.70.80.9

1

Q1 Q2 Q3 Q4 Q5

Query

Frac

tionB

ette

r

CONDITIONAL GLOBAL

00.10.20.30.40.50.60.70.80.9

1

Q1 Q2 Q3 Q4 Q5

Query

Frac

tionB

ette

r

CONDITIONAL GLOBAL

Seattle Homes and Movies Datasets 5 new queries Top-5 Result-lists


Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Microsoft SQL Server 2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO Datasets

Compared Algorithms: LM: List Merge Scan


Performance Experiments - Precomputation

Time and Space Consumed by Index Module

Datasets Lists Building Time Lists Size

Seattle 1500 msec 7.8 MB

US 80000 msec 457.6 MB


Performance Experiments - Execution

Varying Number of Tuples Satisfying Selection Conditions

#Selected Tuples

LM Time (msec)

Scan Time (msec)

350 800 6515

2000 700 39234

5000 600 115282

30000 550 566516

80000 500 3806531

US Homes Database 2-Attributes Queries



0

2000

4000

6000

8000

10000

12000

14000

1 2 3

NumSpecifiedAttributes

Tim

e (

msec)

LM

Scan

US Homes Database


Roadmap



Related Work [CIDR2003]

Use Workload, Focus on Empty-Answer Problem.

Drawback: Global Ranking Regardless of Query. E.g.: Tram is desirable to be away from expensive houses, but close to cheap.

Collaborative Filtering Require Training Data of Queries and their

Ranked Results Relevance-Feedback Techniques for

Learning Similarity in Multimedia and Relational Databases


Roadmap



Conclusions – Future Work

Conclusions Completely Automated Approach for the

Many-Answers Problem which Leverages Data and Workload Statistics and Correlations

Based on PIRFuture Work Empty-Answer Problem Handle Plain Text Attributes


Questions?



0

10000

20000

30000

40000

50000

60000

0 1000 2000 3000 4000

NumSelectedTuples

Tim

e (m

sec)

LM

LMM

Scan

LMM: List Merge where lists for one of the two specified attributes are missing, halving space

Seattle Homes Database

probabilistic ranking

Technology

theglobalg x

unspecified attributes

distinct value x of

store theconditionalc

shortest list amongg

global score

conditional scores

quality experiments