Download - Slideshow mới up nè. ^_^
![Page 1: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/1.jpg)
Automated Ranking of Database Query Results
Sanjay Agrawal, Surajit Chaudhari, Gautam Das, Aristides Gionis
Presented By: Upa Gupta
![Page 2: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/2.jpg)
Contents
Introduction IDF Similarity QF Similarity Breaking Ties Implementation
ITA Algorithm Conclusion
![Page 3: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/3.jpg)
Introduction
Database is Boolean Query Model E.g.. Select * WHERE MFR_Country =
“Germany” AND Type = “Sports” AND Manufacture = “Volkswagon”
Problems in Database Empty Answers
Too selective query leading to Null Result Set Many Answers
General query leading to too many results
![Page 4: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/4.jpg)
Introduction
Ranking of Database Query Results using IR techniques. Applying TF-IDF concept to database
that is based on the frequency of the attribute values.
Need to extend the TF-IDF to Numerical Domains
IDF Similarity is discussed in paper Collecting WORKLOAD and using it for
ranking. QF Similarity, leveraging Workload
Information
![Page 5: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/5.jpg)
Introduction
Many Answers Problem is solved using Top-K Query Processing
Index-based Threshold Algorithm (ITA) developed exploiting IDF/QF Similarity.
![Page 6: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/6.jpg)
IDF Similarity
What is TF-IDF Technique? Given a set of documents and a query,
documents are ranked based on TF and IDF of the words of the document.
Adapting IDF concept to Database containing only categorical Attributes t=<t1,……tm> values of Attribute An Number of tuples in the database
![Page 7: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/7.jpg)
IDF Similarity
For all the values of t: Frequency F(t) is defined as no. of tuples
having Attribute A = t IDF is calculated as:
IDF(t) = log(n/F(t)) For pair of values u and v in Attribute A
domainS(u,v) = IDF (u) if u=v otherwise 0
For tuple T and Query Q for all the Attributes (A1…Ak)
SIM(T,Q) = ),(
1kkk
m
k
qtS
![Page 8: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/8.jpg)
IDF Similarity
Example:
Query Q: Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND MFR = “Volkswagon”
CAR_ID
MODEL
MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo
Lamborghini Italy Sports
![Page 9: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/9.jpg)
IDF Similarityn = 4F (MFR_Country = Germany) = 3IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany))
= log(4/3) = 0.287Similarly,
IDF(MFR_Country=Italy) = 1.38 IDF(MFR = Audi) = 0.69
IDF(MFR = Lamborghini) = 1.38 IDF(MFR = Mercedes) = 1.38IDF(Type = Sports) = 0.287 IDF(Type = Executive) = 1.38
Similarity of 1st tuple with Q = SIM(T,Q) = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes, Volkswagen)= IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0= 0.287+0.287+0 = 0.574
![Page 10: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/10.jpg)
IDF Similarity Consider a Numeric Attribute in DB e.g. PRICE SIMPLE SOLUTION: Discretize the data between
ranges Consider two Range: (0, 50) and (51, 100)
Values 49 and 52 are considered completely dissimilar. Frequency of a numeric value t of an attribute is
defined as
F(t) =
IDF(t) = log(n/F(t)) h = bandwidth parameterS(t,q) = density at t of a Gaussian Distribution centered q.
S(t,q) =
22/1
n
i
htti
esum of contributions to t from every ti database.
)(2
2/1qIDFe h
tti
![Page 11: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/11.jpg)
IDF Similarity
Consider following Query: Select * where MFR IN (“Germany”,
“Italy”, ”Japan”) SIM(T,Q) = ),(max
1
qtS kk
m
kQq k
![Page 12: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/12.jpg)
QF Similarity
Problems with IDF: In a realtor database, more homes are
built in recent years such as 2007 and 2008 as compared to 1980 and 1981.Thus recent years have small IDF. Yet newer homes have higher demand.
In a bookstore DB, demand for an author is due to factor other than no. of books he has written
![Page 13: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/13.jpg)
QF Similarity
WORKLOAD: Past Queries Importance of attribute values is
determined by frequency of their occurrence in workload.
As in above eg, frequency of queries requesting homes in 2010 are more than of the year 1981
![Page 14: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/14.jpg)
QF Similarity For categorical data
RQF(q) = raw frequency of occurrence of value q of attribute A in query strings of workload
RQFMax = raw frequency of most frequently occurring value in workload
Query frequency QF(q) = RQF(q)/RQFMax
s(t, q) = QF(q), if q = t otherwise 0 QF resembles TF
![Page 15: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/15.jpg)
QF Similarity
Consider Workload containing following values of Attribute TYPE:
{Sports, Executive, Luxury, Sports, Sports, Executive}
QF(Executive) = RQF(Executive)/RQFMax = 2/3
![Page 16: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/16.jpg)
QF Similarity
Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(Audi, Mercedes)
Similarity coefficient between t and q in this case is defined by jaccard coefficient scaled by QF factor as shown below.
S(t,q)=J(W(t),W(q))/QF(q) W(t) = Subset of queries in workload W in
which categorical value t occurs in an IN clause
![Page 17: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/17.jpg)
QF-IDF
For QF-IDF SimilarityS(t,q)=QF(q) *IDF(q) when t=q
otherwise 0
![Page 18: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/18.jpg)
BREAKING TIES
IF SIM(t1, q) = SIM (t2, q) Which Should be ranked Higher?? QF and IDF partitions database into
classes
Q: SELECT * WHERE Type = “Sports” AND MFR_Country = “Germany”
CAR_ID
MODEL
MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo
Lamborghini Italy Sports
![Page 19: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/19.jpg)
Breaking Ties with QF
Determine weights of missing attribute values that reflect their “global importance” using workload.
Global Imp = tk= missing attribute
Missing Attributes for Q: MFR and Model
k
ktQF ))(log(
![Page 20: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/20.jpg)
Breaking Ties with QF Considering Workload with following values
of MFR and ModelMFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi}Model{R8, A6, Gallardo, SLR, Gallardo, A6}
QF(SLR) = ½ = 0.5 QF(Mercedes) = 1/3 = 0.33
Global Imp = log(0.5) + log(0.33). NEGATIVE VALUES of Global Imp ??
1 SLR Mercedes
Germany
Sports
![Page 21: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/21.jpg)
Breaking Ties with IDF
Tuples with large IDF(occuring infequently) of missing attributes are ranked higher Cars which are not popular are ranked
higher
Tuples with small IDF of missing attributes are ranked higher Cars having Moonroof will be ranked less
which is a desirable feature.
![Page 22: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/22.jpg)
Implementation
Pre-processing component
Query–processing component
![Page 23: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/23.jpg)
Implementation
Pre Processing Component
Compute and store a representation of similarity function(QF-IDF, QF, IDF) in auxiliary database tables
![Page 24: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/24.jpg)
Implementation
Query Processing Component Job: Retrieving Top-K results from
Database
ITA Algorithm: Use of Fagin’s Threshold Algorithm and Similarity function
Sorted Access: Along any attribute Ak, TIDs of tuples are retrieved.
Random Access: entire tuple corresponding to a TID is retrieved.
![Page 25: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/25.jpg)
ITA Algorithm Repeat Initialize Top-K Buffer to empty For each k = 1 to p
TID = Index of the next Tuple is retrieved from the ordered Lists
T = Complete Tuple is retrieved for TID Compute value of Ranking Function If Rank of T is higher than the rank of lowest ranking
tuple in Top-K Buffer, then update Top-K Buffer If Stopping Condition has been reached then Exit
End For Until all index of the tuples have been seen.
![Page 26: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/26.jpg)
ITA Algorithm
Stopping Condition Hypothetical tuple – current value a1,
…, ap for A1,… Ap, corresponding to index seeks on L1,…, Lp and qp+1,….. qm for remaining columns from the query directly.
Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.
![Page 27: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/27.jpg)
ITA for Numeric columns
Consider a query has condition Ak = qk for a numeric column Ak.
Two index scan is performed on Ak.
First retrieve TID’s > qk in incresing order. Second retrieve TID’s < qk in decreasing
order.
We then pick TID’s from the merged stream.
![Page 28: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/28.jpg)
Conclusion
Automated Ranking Infrastructure for SQL databases.
Extended TF-IDF based techniques from Information retrieval to numeric and mixed data.
Implementation of Ranking function that exploited Fagin’s TA
![Page 29: Slideshow mới up nè. ^_^](https://reader033.vdocument.in/reader033/viewer/2022060115/557c466fd8b42a23598b528d/html5/thumbnails/29.jpg)
THANK YOU