fabrizio silvestri "mining query logs to improve web search engines' operations"
DESCRIPTION
22 августа 2011, семинар "RUSSIR Summer School Best Practices" Fabrizio Silvestri "Mining query logs to improve web search engines' operations" High Performance Computing, HPC, is a research group at the ISTI institute in Pisa. One of the main activity of HPC-Lab consists in studying applications of query log mining to search. In the last years several results have been proposed by members of the lab. In this talk we will present three recent results: i) A novel effective and efficient query recommendation method based on the concept of Search Shortcuts; ii) A novel recommendation paradigm based on the concept of user task instead of the well-known concept of user query, and iii) A very efficient result diversification algorithm that is based on results from i) and ii).TRANSCRIPT
![Page 1: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/1.jpg)
Efficient Diversification of Web Search Results
G. Capannini, F. M. Nardini, R. Perego, and F. SilvestriISTI - CNR, Pisa, Italy
![Page 2: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/2.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Introduction: SE Results Diversification
•Query: “Vinci”, what’s the user’s intent?
• Information on Leonardo da Vinci?
• Information on Vinci the small village in Tuscany?
• Information on Vinci the company?
•Others?
2
![Page 3: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/3.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Introduction: SE Results Diversification
•Query: “Vinci”, what’s the user’s intent?
• Information on Leonardo da Vinci?
• Information on Vinci the small village in Tuscany?
• Information on Vinci the company?
•Others?
2
![Page 4: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/4.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Introduction: SE Results Diversification
•Query: “Vinci”, what’s the user’s intent?
• Information on Leonardo da Vinci?
• Information on Vinci the small village in Tuscany?
• Information on Vinci the company?
•Others?
2
![Page 5: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/5.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Query Diversification as a Coverage Problem
• Hypothesis:
• For each user’s query I can tell what’s the set of all possible intents
• For each document in the collection I can tell what are all the possible user’s intents it represents
• each intent for each document is, possibly, weighted by a value representing how much that intent is represented by that document (e.g., 1/2 of document D is related to the intent of “digital photography techniques”)
• Goal:
• Select the set of k documents in the collection covering the maximum amount of intent weight. I.e., maximize the number of satisfied users.
3
![Page 6: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/6.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
State-of-the-Art Methods
• IASelect:• Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In
Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09), Ricardo Baeza-Yates, Paolo Boldi, Berthier Ribeiro-Neto, and B. Barla Cambazoglu (Eds.). ACM, New York, NY, USA, 5-14.
• xQuAD:• Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. Exploiting query reformulations for Web search
result diversification. In Proceedings of the 19th International Conference on World Wide Web, pages 881-890, Raleigh, NC, USA, 2010. ACM.
4
![Page 7: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/7.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
![Page 8: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/8.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intents
![Page 9: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/9.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intentsthe weight
![Page 10: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/10.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intentsthe weight
the weight is the probability of being relative to intent c
![Page 11: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/11.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intentsthe weight
the weight is the probability of being relative to intent c
d is not pertinent to c
![Page 12: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/12.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intentsthe weight
the weight is the probability of being relative to intent c
d is not pertinent to c
no doc is pertinent to c
![Page 13: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/13.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Diversify (k)
5
intentsthe weight
the weight is the probability of being relative to intent c
d is not pertinent to c
no doc is pertinent to c
at least one doc is pertinent to c
![Page 14: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/14.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Known Results
• Diversify(k) is NP-hard:
• Reduction from max-weight coverage
• Diversify(k)’s objective function is sub-modular:
• Admits a (1-1/e)-approx. algorithm.
• The algorithm works by inserting one result at a time, we insert the result with the max marginal utility.
• Quadratic complexity in the number of results to consider:
• at each iteration scan the complete list of not-yet-inserted results.
6
![Page 15: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/15.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Known Results
• Diversify(k) is NP-hard:
• Reduction from max-weight coverage
• Diversify(k)’s objective function is sub-modular:
• Admits a (1-1/e)-approx. algorithm.
• The algorithm works by inserting one result at a time, we insert the result with the max marginal utility.
• Quadratic complexity in the number of results to consider:
• at each iteration scan the complete list of not-yet-inserted results.
6
![Page 16: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/16.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
It looks reasonable, but...
• ... we might not diversify, at all!
• Consider a query returning a set Rd={a,b,c} of documents and two possible categories g,h.
• The query is pertaining to each document with the same probability, i.e., P(g|q) = P(h|q) = 1/2.
• The optimal selection is S={a,b}, replacing either a or b with c will make the objective function decrease its value.
7
d\V V(x|q,g) V(x|q,h)abc
1 0
1 0
1/2 1/2
![Page 17: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/17.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
It looks reasonable, but...
• ... we might not diversify, at all!
• Consider a query returning a set Rd={a,b,c} of documents and two possible categories g,h.
• The query is pertaining to each document with the same probability, i.e., P(g|q) = P(h|q) = 1/2.
• The optimal selection is S={a,b}, replacing either a or b with c will make the objective function decrease its value.
7
d\V V(x|q,g) V(x|q,h)abc
1 0
1 0
1/2 1/2
![Page 18: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/18.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
xQuAD_Diversify(k)
8
![Page 19: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/19.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
xQuAD_Diversify(k)
8
![Page 20: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/20.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
xQuAD_Diversify(k)
8
Same problem as before... It may not diversify, at all.
![Page 21: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/21.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
![Page 22: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/22.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
Vinci
![Page 23: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/23.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
Vinci
Leonardo da Vinci
Vinci Town
Vinci Group
![Page 24: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/24.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
Vinci
Leonardo da Vinci
Vinci Town
Vinci Group
5/12
1/4
1/3
![Page 25: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/25.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
Vinci
Leonardo da Vinci
Vinci Town
Vinci Group
5/12
1/4
1/3
Rq S
![Page 26: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/26.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Our Proposal:MaxUtility
9
Vinci
Leonardo da Vinci
Vinci Town
Vinci Group
5/12
1/4
1/3
Rq S
![Page 27: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/27.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
MaxUtility_Diversify(k)
10
![Page 28: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/28.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
MaxUtility_Diversify(k)
10
Probability of query q’ being a specialization for query q
![Page 29: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/29.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
MaxUtility_Diversify(k)
10
Set of possible query specializations
Probability of query q’ being a specialization for query q
![Page 30: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/30.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Why it is Efficient?
• By using a simple arithmetic argument we can show that:
•Therefore we can find the optimal set S of diversified documents by using a sort-based approach.
11
![Page 31: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/31.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
OptSelect
12
![Page 32: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/32.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
OptSelect
12
![Page 33: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/33.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
The Specialization Set Sq
• It is crucial for OptSelect to have the set of specialization available for each query.
•Our method is, thus, query log-based.
• we use a query recommender system to obtain a set of queries from which Sq is built by including the most popular (i.e., freq. in query log > f(q) / s) recommendations:
13
![Page 34: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/34.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Probability Estimation
14
![Page 35: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/35.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Usefulness of a Result
15
![Page 36: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/36.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Usefulness of a Result
15
![Page 37: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/37.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Experiments: Settings
•TREC 2009 Web track's Diversity Task framework:
•ClueWeb-B, the subset of the TREC ClueWeb09 dataset
•The 50 topics (i.e., queries) provided by TREC
•We evaluate α-NDCG and IA-P
•All the tests were conducted on a Intel Core 2 Quad PC with 8Gb of RAM and Ubuntu Linux 9.10 (kernel 2.6.31-22).
16
![Page 38: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/38.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Experiments: Quality
17
![Page 39: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/39.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Experiments: Efficiency
18
![Page 40: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/40.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Conclusions and Future Work
• We studied the problem of search results diversification from an efficiency point of view
• We derived a diversification method (OptSelect):
• same (or better) quality of the state of the art
• up to 100 times faster
• Future work:
• the exploitation of users' search history for personalizing result diversification
• the use of click-through data to improve our effectiveness results, and
• the study of a search architecture performing the diversification task in parallel with the document scoring phase (Done! See DDR2011 paper)
19
![Page 41: Fabrizio Silvestri "Mining query logs to improve web search engines' operations"](https://reader033.vdocument.in/reader033/viewer/2022052412/557f676cd8b42af1298b49f6/html5/thumbnails/41.jpg)
F. Silvestri - Efficient Diversification of Web Search Results - Yandex Tech Talk 22 August 2011, Moscow
Question Time
Fabrizio SilvestriISTI-CNR, Pisa Italy
http://hpc.isti.cnr.it/[email protected]
20