![Page 1: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/1.jpg)
Less is MoreProbabilistic Model for Retrieving Fewer Relevant
Docuemtns
Harr Chen and David R. KargerMIT CSAILSIGIR2006
4/30/2007
![Page 2: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/2.jpg)
Abstract
• Probability Ranking Priciple (PRP)– Rank documents in decreasing order of probability of
relevance.
• Propose a greedy algorithm that approximately optimizes the following objectives– %no metric: the percentages of queries for which no
relevant documents are retrieved.– The diversity of results.
4/30/2007
![Page 3: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/3.jpg)
Introduction• Probability Ranking Principle
– Rule of thumb: “optimal”.• TREC robust track
– %no metric– Question answering and finding a homepage.
• Diversity– For example, “Trojan horse”– PRP-based method may choose one “most likely”
interpretation.• Greedy algorithm
– Fill each position in the ranking by assuming that all previous documents in the ranking are not relevant.
4/30/2007
![Page 4: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/4.jpg)
Introduction (Cont.)
• Other measures– Search length (SL)– Reciprocal rank (RR)– Instance recall: the number of difference subtopics in a
given result set.
• Retrieving for Diversity– The diversity automatically arises as a consequence of the
objective function.
4/30/2007
![Page 5: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/5.jpg)
Related Work
• Algorithm– Zhai and Lafferty: a risk minimization framework– Bookstein: a sequential learning retrieval system
• Diversity– Zhai et al.: novelty and redundancy– Clustering is an approach to quickly cover a diverse range
of query interpretations.
4/30/2007
![Page 6: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/6.jpg)
Evaluation Metrics
• MSL (mean search length)• MRR (mean reciprocal rank)• %no– k-call at n: 1 if at least k of the top n docs returned by
system for the given query are deemed relevant; otherwise 0.
– mean 1-call: one minus the %no metric– n-call at n: perfect precision
• Instance recall at rank n
4/30/2007
![Page 7: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/7.jpg)
Bayesian Retrieval
• Standard Bayesian Information Retrival– The documents in a corpus should be ranked by Pr[r|d]– By a monotonic transformation
– Focus on the objective function, so use Naïve Bayes framework with multinomial models (θi) as the family of distributions.
– Determine the parameters (training)– Dirichlet prior: prior probability distribution over the
parameters (θi).
– Estimate the probability of parameters of the relevant distribution (i.e., Pr[d|r]).
4/30/2007
![Page 8: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/8.jpg)
Object Function
• Considering optimizing for the k-call at n metric.– k=1: the probability that at least one of the first n
relevance variables be true
– For arbitrary k: the probability that at least k docs are relevant
4/30/2007
![Page 9: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/9.jpg)
Optimization Methods
• NP-hard Problem– To perfectly optimize the k-call of any specific set of n docs
objective function from a corpus of m docs, because
• Greedy algorithm (approximately optimize it)– Successively select each result of the result set.
1. Select first result by applying the conventional PRP.2. For the ith result, we hold results 1 throught i-1 to their
already selected value, and consider all remaining corpus documents as a possibility for document i.
3. Pick the document with highest k-call score as the ith result.
4/30/2007
![Page 10: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/10.jpg)
Applying the Greedy Approach
• k=1– First, choose the doc d0 maximizing Pr[r0|d0].
– Wish to choose d1 maximizing the below quantity:
– Choose d2 by maximizing– In general, select the optimal di that maximizes
4/30/2007
![Page 11: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/11.jpg)
Applying the Greedy Approach (Cont.)
• k=n (perfect precision)– Select the ith document according to:
• 1<k<n– The objective is to maximize the probability of having at
least k relevant docs in the top n.– Focus on k=1 and k=n cases in this paper.
4/30/2007
![Page 12: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/12.jpg)
Optimizing for Other Metrics
• Optimizing 1-call– Choose greedily conditioned on there being no previous
document relevant.– Equal to minimize expected search length and maximize
expected reciprocal rank.– Also optimize instance recall metric, which measures the
number of distinct subtopics retrieved.• If a query has t subtopics, then instance recall is
4/30/2007
![Page 13: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/13.jpg)
Google Examples
• Two ambiguous queries: “Trojan horse” and “virus”– Usd the titles, summaries, and snippets of Google’s results
to form a corpus of 1000 docs for each query.
4/30/2007
![Page 14: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/14.jpg)
Experiments
• Methods– 1-greedy, 10-greedy, and conventional PRP
• Datasets– ad hoc topics from TREC-1, TREC-2, and TREC-3 to set the
weight parameters of model appropriately.
– TREC2004 robust track– TREC-6,7,8 interactive track– TREC-4 and TREC-6 ad hoc tracks
4/30/2007
![Page 15: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/15.jpg)
Tuning the Weights
• Key weight– For the proposed model, the key weights are the strength
of the relevant distribution and irrelevant distribution priors with respect to the strength of the docs.
• TRECs 1, 2, and 3– Consisting about 724,000 docs, and 150 topics (topics 51-
200)– Used for tuning weight
4/30/2007
![Page 16: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/16.jpg)
Robust Track Experiments
• TREC2004 robust track– 249 topics in total, about 528,000 docs– 50 topics were selected by TREC as being “difficult” queries.
4/30/2007
![Page 17: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/17.jpg)
Instance Retrieval Experiments
• TREC-6, 7, and 8 interactive track– Test the performance of diversity– Total 20 topics with between 7 and 56 aspects each, and
about 210,000 docs.– Zhai et al’s LM approach is better for aspect retrieval.
4/30/2007
![Page 18: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/18.jpg)
Multiple Annotator Experiments
• TREC-4 and TREC-6– Multiple independent annotators are asked to make
relevant judgments for the same topics over the same corpus.
– TREC-6 had three annotators, TREC-6 had two.
4/30/2007
![Page 19: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/19.jpg)
Query Analysis
• A specific topic 100– The description is:
4/30/2007
![Page 20: Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007](https://reader035.vdocument.in/reader035/viewer/2022062621/551bf188550346b4588b6636/html5/thumbnails/20.jpg)
Conclusions and Future Work
• Conclusions– Identify the PRP is not optimal, and given an approach to
directly optimize other desired objective.– The approach is algorithmically feasible.
• Future work– Other objective functions– More sophisticated techniques, such as local search alg.– The likelihood of relevance collections of docs• Two-Poisson model• Language model
4/30/2007