probabilistic ranking of database query results

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Weimin HeCSE@UTA

04/19/23 Weimin He CSE@UTA 2

Outline

Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems


Motivating example

Realtor DB: Table D=(TID, Price , City, Bedrooms,

Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:Select * From D Where City=Seattle AND View=Waterfront


Motivation

Many-answers problem Two alternative solutions:

Query reformulation Automatic ranking Apply probabilistic model in IR to

DB tuple ranking


Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of

m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its

domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

},...,{ 1 nttS


System Architecture


Intuition for Ranking Function Select * From D Where City=“Seattle” And

View=“Waterfront”

Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified

Attribute Values E.g., Homes with good school districts are

globally desirable Conditional Score: Correlations between

Specified and Unspecified Attribute Values E.g., Waterfront BoatDock


Probabilistic Model in IR Bayes’ Rule Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Vagelis Hristidis

Let's see how by adapting PIR techniques to our problem we can create a ranking function.


Adaptation of PIR to DB

Tuple t is considered as a document

Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function

until final ranking function is obtained


Preliminary Derivation


Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(


Continuing Derivation


Workload-based Estimation of )( Ryp

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

),()( WXypRyp


Final Ranking Function


Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in W

Relative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D


Example for Computing Atomic Probabilities

Select * From D Where City=“Seattle” And View=“Waterfront”

Y={SchoolDistrict, BoatDock, …}

D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5

p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005


Indexing Atomic Probabilities

)( Wyp

)( Dyp

),( Dyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

),( Wyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)


Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities

P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-

Tuple Return Top-K Tuples


Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set Another extreme

Pre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples


Two kinds of Ranked List CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)


Index Module


List Merge Algorithm


Experimental Setup Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO


Quality Experiments

Conducted on Seattle Homes and Movies tables

Collect a workload from users Compare Conditional Ranking

Method in the paper with the Global Method [CIDR03]


Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm


Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results


Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm


Performance Experiments – Pre-computation Time


Performance Experiments – Execution Time


Conclusion and Open Problems

Automatic ranking for many-answers

Adaptation of PIR to DB

Mutiple-table query Non-categorical attributes

probabilistic ranking of database query results

Documents