Lecture 3: Retrieval Evaluation
Maya Ramanath
Benchmarking IR Systems
Result Quality • Data Collection– Ex: Archives of the NYTimes
• Query set– Provided by experts, identified from real
search logs, etc.
• Relevance judgements– For a given query, is the document
relevant?
Evaluation for Large Collections
• Cranfield/TREC paradigm– Pooling of results
• A/B testing– Possible for search engines
• Crowdsourcing– Let users decide
Precision and Recall
• Relevance judgements are binary – “relevant” or “not-relevant”.– Partition the collection into 2 parts.
• Precision
• Recall
Can a search engine guarantee 100% recall?
F-measure
• F-Measure: Weighted harmonic mean of Precision and Recall
Why use harmonic mean instead of arithmetic mean?
Precision-Recall Curves• Using precision and recall to evaluate
ranked retrieval
Source: Introduction to Information Retrieval. Manning, Raghavan and Schuetze, 2008
Single measures
Precision at k, P@10, P@100, etc.
and others…
Graded Relevance – NDCG
• Highly relevant documents should have more importance
• Higher the rank of a relevant document, more valuable it is to the user
Inter-judge Agreement – Fleiss’ Kappa
N – number of resultsn – number of ratings/resultk – number of gradesnij – no. of judges who agree that the ith result should have grade j.
Tests of Statistical Significance
• Wilcoxon signed rank test• Student’s paired t-test• …and more
END OF MODULE “IR FROM 20000FT”