lecture 3: retrieval evaluation maya ramanath. benchmarking ir systems result quality data...

Lecture 3: Retrieval Evaluation

Maya Ramanath

Benchmarking IR Systems

Result Quality • Data Collection– Ex: Archives of the NYTimes

• Query set– Provided by experts, identified from real

search logs, etc.

• Relevance judgements– For a given query, is the document

relevant?

Evaluation for Large Collections

• Cranfield/TREC paradigm– Pooling of results

• A/B testing– Possible for search engines

• Crowdsourcing– Let users decide

Precision and Recall

• Relevance judgements are binary – “relevant” or “not-relevant”.– Partition the collection into 2 parts.

• Precision

• Recall

Can a search engine guarantee 100% recall?

F-measure

• F-Measure: Weighted harmonic mean of Precision and Recall

Why use harmonic mean instead of arithmetic mean?

Precision-Recall Curves• Using precision and recall to evaluate

ranked retrieval

Source: Introduction to Information Retrieval. Manning, Raghavan and Schuetze, 2008

Single measures

Precision at k, P@10, P@100, etc.

and others…

Graded Relevance – NDCG

• Highly relevant documents should have more importance

• Higher the rank of a relevant document, more valuable it is to the user

Inter-judge Agreement – Fleiss’ Kappa

N – number of resultsn – number of ratings/resultk – number of gradesnij – no. of judges who agree that the ith result should have grade j.

Tests of Statistical Significance

• Wilcoxon signed rank test• Student’s paired t-test• …and more

END OF MODULE “IR FROM 20000FT”

lecture 3: retrieval evaluation maya ramanath. benchmarking ir systems result quality data...

Documents