probabilistic ranking of database query results

1

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Raghunath Ravi

Sivaramakrishnan SubramaniCSE@UTA

2

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

3

MotivationMany-answers problemTwo alternative solutions:

Query reformulation Automatic rankingApply probabilistic model in IR to

DB tuple ranking

4

Example – Realtor DatabaseHouse Attributes: Price, City,

Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!

Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

5

Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of

Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock

Many Bedrooms Good School District

6


7

Key ProblemsGiven a Query Q, How to

Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

8


9

System Architecture

10

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking

FunctionImplementationExperimentsConclusion and open problems

11

PIR ReviewBayes’ RuleProduct Rule

)()()|()|(

bpapabpbap

),|()|()|,( cabpcapcbap

)|()|(

)()()|(

)()()|(

)|()|()(

RtpRtp

tpRpRtp

tpRpRtp

tRptRptScore

Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

Vagelis Hristidis

Let's see how by adapting PIR techniques to our problem we can create a ranking function.

12

Adaptation of PIR to DBTuple t is considered as a

documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and

YDerive from initial scoring

function until final ranking function is obtained

13

Preliminary Derivation

14

Limited Independence AssumptionsGiven a query Q and a tuple t,

the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(

15

Continuing Derivation

16

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in WRelative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Yy XxYy DyxpDyp

RyptScore),|(

1)|()|()(

Use Workload

Use Data

17


18

Architecture of Ranking Systems

19

Scan Algorithm

Preprocessing - Atomic Probabilities Module

Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each

Result-TupleReturn Top-K Tuples

20

Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set

Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

21

Output from Index Module CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx {AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

22

Index Module

23

Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and

Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate

and add to Cx and Gx respectively Sort Cx, Gx by decreasing scoresExecution Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

24

List Merge Algorithm

25


26

Experimental Setup Datasets:

◦ MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

◦ Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

27

Quality ExperimentsConducted on Seattle Homes and

Movies tablesCollect a workload from usersCompare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

28

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

29

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

30

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

31

Performance Experiments – Pre-computation Time

32

Performance Experiments – Execution Time

33


34

Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-

Answers Problem which Leverages Data and Workload Statistics and Correlations

Based on PIR

DrawbacksMutiple-table queryNon-categorical attributes

Future WorkEmpty-Answer ProblemHandle Plain Text Attributes

35

Questions?