probabilistic ranking of database query results
DESCRIPTION
Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition - PowerPoint PPT PresentationTRANSCRIPT
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Presented by Weimin HeCSE@UTA
04/19/23 Weimin He CSE@UTA 2
Outline
Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems
04/19/23 Weimin He CSE@UTA 3
Motivating example
Realtor DB: Table D=(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)
SQL query:Select * From D Where City=Seattle AND View=Waterfront
04/19/23 Weimin He CSE@UTA 4
Motivation
Many-answers problem Two alternative solutions:
Query reformulation Automatic ranking Apply probabilistic model in IR to
DB tuple ranking
04/19/23 Weimin He CSE@UTA 5
Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of
m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its
domain.
The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes
Let be the answer set of Q
How to rank tuples in S and return top-k tuples to the user ?
},...,{ 1 nttS
04/19/23 Weimin He CSE@UTA 6
System Architecture
04/19/23 Weimin He CSE@UTA 7
Intuition for Ranking Function Select * From D Where City=“Seattle” And
View=“Waterfront”
Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified
Attribute Values E.g., Homes with good school districts are
globally desirable Conditional Score: Correlations between
Specified and Unspecified Attribute Values E.g., Waterfront BoatDock
04/19/23 Weimin He CSE@UTA 8
Probabilistic Model in IR Bayes’ Rule Product Rule
)(
)()|()|(
bp
apabpbap
),|()|()|,( cabpcapcbap
)|(
)|(
)(
)()|()(
)()|(
)|(
)|()(
Rtp
Rtp
tp
RpRtptp
RpRtp
tRp
tRptScore
Document t, Query QR: Relevant document setR = D - R: Irrelevant document set
04/19/23 Weimin He CSE@UTA 9
Adaptation of PIR to DB
Tuple t is considered as a document
Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function
until final ranking function is obtained
04/19/23 Weimin He CSE@UTA 10
Preliminary Derivation
04/19/23 Weimin He CSE@UTA 11
Limited Independence Assumptions
Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Xx
CxpCXp )()(
Yy
CypCYp )()(
04/19/23 Weimin He CSE@UTA 12
Continuing Derivation
04/19/23 Weimin He CSE@UTA 13
Workload-based Estimation of )( Ryp
Assume a collection of “past” queries existed in system
Workload W is represented as a set of “tuples”
Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X
All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X
),()( WXypRyp
04/19/23 Weimin He CSE@UTA 14
Final Ranking Function
04/19/23 Weimin He CSE@UTA 15
Pre-computing Atomic Probabilities in Ranking Function
)( Wyp
)( Dyp
),( Dyxp
Relative frequency in W
Relative frequency in D
),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W
(#of tuples in D that conatains x, y)/total # of tuples in D
04/19/23 Weimin He CSE@UTA 16
Example for Computing Atomic Probabilities
Select * From D Where City=“Seattle” And View=“Waterfront”
Y={SchoolDistrict, BoatDock, …}
D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5
p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005
04/19/23 Weimin He CSE@UTA 17
Indexing Atomic Probabilities
)( Wyp
)( Dyp
),( Dyxp
{AttName, AttVal, Prob}
B+ tree index on (AttName, AttVal)
),( Wyxp
{AttName, AttVal, Prob}
B+ tree index on (AttName, AttVal)
{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
04/19/23 Weimin He CSE@UTA 18
Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities
P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-
Tuple Return Top-K Tuples
04/19/23 Weimin He CSE@UTA 19
Beyond Scan Algorithm Scan algorithm is Inefficient
Many tuples in the answer set Another extreme
Pre-compute top-K tuples for all possible queriesStill infeasible in practice
Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples
04/19/23 Weimin He CSE@UTA 20
Two kinds of Ranked List CondList Cx
{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)
04/19/23 Weimin He CSE@UTA 21
Index Module
04/19/23 Weimin He CSE@UTA 22
List Merge Algorithm
04/19/23 Weimin He CSE@UTA 23
Experimental Setup Datasets:
MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
Internet Movie Database (http://www.imdb.com)
Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
04/19/23 Weimin He CSE@UTA 24
Quality Experiments
Conducted on Seattle Homes and Movies tables
Collect a workload from users Compare Conditional Ranking
Method in the paper with the Global Method [CIDR03]
04/19/23 Weimin He CSE@UTA 25
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
04/19/23 Weimin He CSE@UTA 26
Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results
04/19/23 Weimin He CSE@UTA 27
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Datasets
Compare 2 Algorithms: Scan algorithm List Merge algorithm
04/19/23 Weimin He CSE@UTA 28
Performance Experiments – Pre-computation Time
04/19/23 Weimin He CSE@UTA 29
Performance Experiments – Execution Time
04/19/23 Weimin He CSE@UTA 30
Performance Experiments – Execution Time
04/19/23 Weimin He CSE@UTA 31
Performance Experiments – Execution Time
04/19/23 Weimin He CSE@UTA 32
Conclusion and Open Problems
Automatic ranking for many-answers
Adaptation of PIR to DB
Mutiple-table query Non-categorical attributes