probabilistic ranking of database query results
DESCRIPTION
Probabilistic Ranking of Database Query Results. Surajit Chaudhuri , Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis , Florida International University Gerhard Weikum , MPI Informatik. Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA. Roadmap. - PowerPoint PPT PresentationTRANSCRIPT
1
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Presented by Raghunath Ravi
Sivaramakrishnan SubramaniCSE@UTA
2
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
3
MotivationMany-answers problemTwo alternative solutions:
Query reformulation Automatic rankingApply probabilistic model in IR to
DB tuple ranking
4
Example – Realtor DatabaseHouse Attributes: Price, City,
Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results!
Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable
5
Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of
Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred
Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock
Many Bedrooms Good School District
6
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
7
Key ProblemsGiven a Query Q, How to
Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
8
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
9
System Architecture
10
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking
FunctionImplementationExperimentsConclusion and open problems
11
PIR ReviewBayes’ RuleProduct Rule
)()()|()|(
bpapabpbap
),|()|()|,( cabpcapcbap
)|()|(
)()()|(
)()()|(
)|()|()(
RtpRtp
tpRpRtp
tpRpRtp
tRptRptScore
Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
12
Adaptation of PIR to DBTuple t is considered as a
documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and
YDerive from initial scoring
function until final ranking function is obtained
13
Preliminary Derivation
14
Limited Independence AssumptionsGiven a query Q and a tuple t,
the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Xx
CxpCXp )()(
Yy
CypCYp )()(
15
Continuing Derivation
16
Pre-computing Atomic Probabilities in Ranking Function
)( Wyp
)( Dyp
),( Dyxp
Relative frequency in WRelative frequency in D
),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W
(#of tuples in D that conatains x, y)/total # of tuples in D
Yy XxYy DyxpDyp
RyptScore),|(
1)|()|()(
Use Workload
Use Data
17
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
18
Architecture of Ranking Systems
19
Scan Algorithm
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each
Result-TupleReturn Top-K Tuples
20
Beyond Scan Algorithm Scan algorithm is Inefficient
Many tuples in the answer set
Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice
Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples
21
Output from Index Module CondList Cx
{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)
GlobList Gx {AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)
22
Index Module
23
Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and
Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate
and add to Cx and Gx respectively Sort Cx, Gx by decreasing scoresExecution Query Q: X1=x1 AND … AND Xs=xs
Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
24
List Merge Algorithm
25
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
26
Experimental Setup Datasets:
◦ MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
◦ Internet Movie Database (http://www.imdb.com)
Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
27
Quality ExperimentsConducted on Seattle Homes and
Movies tablesCollect a workload from usersCompare Conditional Ranking
Method in the paper with the Global Method [CIDR03]
28
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
29
Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results
30
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Datasets
Compare 2 Algorithms: Scan algorithm List Merge algorithm
31
Performance Experiments – Pre-computation Time
32
Performance Experiments – Execution Time
33
RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
34
Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-
Answers Problem which Leverages Data and Workload Statistics and Correlations
Based on PIR
DrawbacksMutiple-table queryNon-categorical attributes
Future WorkEmpty-Answer ProblemHandle Plain Text Attributes
35
Questions?