probabilistic ranking
TRANSCRIPT
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Vagelis Hristidis VLDB 2004 2
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 3
Motivation
SQL Returns Unordered Sets of Results
Overwhelms Users of Information Discovery Applications
How Can Ranking be Introduced, Given that ALL Results Satisfy Query?
Vagelis Hristidis VLDB 2004 4
Example – Realtor Database House Attributes: Price, City, Bedrooms,
Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results! Intuitively, Houses with lower Price,
more Bedrooms, or BoatDock are generally preferable
Vagelis Hristidis VLDB 2004 5
Rank According to Unspecified Attributes
Score of a Result Tuple t depends on Global Score: Global Importance of
Unspecified Attribute Values [CIDR2003] E.g., Newer Houses are generally preferred
Conditional Score: Correlations between Specified and Unspecified Attribute Values E.g., Waterfront BoatDock
Many Bedrooms Good School District
Vagelis Hristidis VLDB 2004 6
Key Problems
Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
Vagelis Hristidis VLDB 2004 7
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 8
Architecture
Vagelis Hristidis VLDB 2004 9
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 10
PIR Review Bayes’ Rule Product Rule
)(
)()|()|(
bp
apabpbap
),|()|()|,( cabpcapcbap
)|(
)|(
)(
)()|()(
)()|(
)|(
)|()(
Rtp
Rtp
tp
RpRtptp
RpRtp
tRp
tRptScore
Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
Vagelis Hristidis VLDB 2004 11
Ranking Function – Adapt PIR
Query Q: X1=x1 AND … AND Xs=xs, X ={X1, …, Xs}
Result-Tuple t(X,Y), where X: Specified Attributes, Y: Unspecified Attributes
( | ) ( , | ) ( | ) ( | , ) ( | )( )
( | ) ( , | ) ( | ) ( | , ) ( | , )
p t R p X Y R p X R p Y X R p Y RScore t
p t D p X Y D p X D p Y X D p Y X D
RD X, R, D common.R satisfies X.
Vagelis Hristidis VLDB 2004 12
Ranking Function – Limited Conditional Independence Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
( | ) ( | )x X
p X C p x C
( | ) ( | )
y Y
p Y C p y C
Yy XxYy DyxpDyp
RyptScore
),|(
1
)|(
)|()(
Use Data
, C involves Y, R, D
, C involves X, R, DUse Workload
Vagelis Hristidis VLDB 2004 13
Atomic Probabilities Estimation Using Workload W
If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X.
Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.
Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.
( | ) ( | , ) ( | ) ( | , )x X
p y R p y X W p y W p x y W
Using Limited Conditional
IndependenceGlobal Part
Conditional Part
Probabilities p(x | y, W) (p(x | y, D)) Calculated Using Standard Association Rule Mining Techniques on W (D)
Vagelis Hristidis VLDB 2004 14
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 15
Performance
Vagelis Hristidis VLDB 2004 16
Scan Algorithm
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each
Result-Tuple Return Top-K Tuples
Vagelis Hristidis VLDB 2004 17
List Merge Algorithm
Preprocessing For Each Distinct Value x of Database, Calculate and
Store the Conditional (Cx) and the Global (Gx) Lists as follows
For Each Tuple t Containing x Calculate
tz Dzxp
WzxpCondScore
),|(
),|(
tz Dzp
WzpGlobScore
)|(
)|(
Execution Query Q: X1=x1 AND … AND Xs=xs
Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores
Yy XxYy Dyxp
Wyxp
Dyp
WyptScore
),|(
),|(
)|(
)|()(Final
Formula:
Vagelis Hristidis VLDB 2004 18
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 19
Quality Experiments Compare our Conditional Ranking Method with
the Global Method [CIDR03] Surveyed 14 MSR employees Datasets:
MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
Internet Movie Database (http://www.imdb.com) Each User Behaved According to Various
Profiles. E.g.: singles, middle-class family, rich retirees… teenage males, people interested in comedies of the
80s… First Collect Workloads, Then Compare Results
of 2 Methods for a Set of Queries
Vagelis Hristidis VLDB 2004 20
Quality Experiments – Average Precision
Seattle Homes Movies COND GLOB COND GLOB
Q1 0.70 0.26 0.48 0.35 Q2 0.76 0.62 0.53 0.43 Q3 0.90 0.54 0.58 0.20 Q4 0.84 0.32 0.45 0.48 Q5 0.44 0.48 0.43 0.40
For 5 queries, ask users to Mark 10 out of a Set of 30
likely results containing: the Top-10 results of both the Conditional and Global plus a few randomly selected tuples.
Precision = Recall
Vagelis Hristidis VLDB 2004 21
Quality Experiments - Fraction of Users Preferring Each Algorithm
00.10.20.30.40.50.60.70.80.9
1
Q1 Q2 Q3 Q4 Q5
Query
Frac
tionB
ette
r
CONDITIONAL GLOBAL
00.10.20.30.40.50.60.70.80.9
1
Q1 Q2 Q3 Q4 Q5
Query
Frac
tionB
ette
r
CONDITIONAL GLOBAL
Seattle Homes and Movies Datasets 5 new queries Top-5 Result-lists
Vagelis Hristidis VLDB 2004 22
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Microsoft SQL Server 2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO Datasets
Compared Algorithms: LM: List Merge Scan
Vagelis Hristidis VLDB 2004 23
Performance Experiments - Precomputation
Time and Space Consumed by Index Module
Datasets Lists Building Time Lists Size
Seattle 1500 msec 7.8 MB
US 80000 msec 457.6 MB
Vagelis Hristidis VLDB 2004 24
Performance Experiments - Execution
Varying Number of Tuples Satisfying Selection Conditions
#Selected Tuples
LM Time (msec)
Scan Time (msec)
350 800 6515
2000 700 39234
5000 600 115282
30000 550 566516
80000 500 3806531
US Homes Database 2-Attributes Queries
Vagelis Hristidis VLDB 2004 25
Performance Experiments - Execution
0
2000
4000
6000
8000
10000
12000
14000
1 2 3
NumSpecifiedAttributes
Tim
e (
msec)
LM
Scan
US Homes Database
Vagelis Hristidis VLDB 2004 26
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 27
Related Work [CIDR2003]
Use Workload, Focus on Empty-Answer Problem.
Drawback: Global Ranking Regardless of Query. E.g.: Tram is desirable to be away from expensive houses, but close to cheap.
Collaborative Filtering Require Training Data of Queries and their
Ranked Results Relevance-Feedback Techniques for
Learning Similarity in Multimedia and Relational Databases
Vagelis Hristidis VLDB 2004 28
Roadmap
Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion
Vagelis Hristidis VLDB 2004 29
Conclusions – Future Work
Conclusions Completely Automated Approach for the
Many-Answers Problem which Leverages Data and Workload Statistics and Correlations
Based on PIRFuture Work Empty-Answer Problem Handle Plain Text Attributes
Vagelis Hristidis VLDB 2004 30
Questions?
Vagelis Hristidis VLDB 2004 31
Performance Experiments - Execution
0
10000
20000
30000
40000
50000
60000
0 1000 2000 3000 4000
NumSelectedTuples
Tim
e (m
sec)
LM
LMM
Scan
LMM: List Merge where lists for one of the two specified attributes are missing, halving space
Seattle Homes Database