lecture 3: retrieval evaluation maya ramanath. benchmarking ir systems result quality data...

11
Lecture 3: Retrieval Evaluation Maya Ramanath

Upload: darcy-shelton

Post on 18-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Lecture 3: Retrieval Evaluation

Maya Ramanath

Page 2: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Benchmarking IR Systems

Result Quality • Data Collection– Ex: Archives of the NYTimes

• Query set– Provided by experts, identified from real

search logs, etc.

• Relevance judgements– For a given query, is the document

relevant?

Page 3: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Evaluation for Large Collections

• Cranfield/TREC paradigm– Pooling of results

• A/B testing– Possible for search engines

• Crowdsourcing– Let users decide

Page 4: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Precision and Recall

• Relevance judgements are binary – “relevant” or “not-relevant”.– Partition the collection into 2 parts.

• Precision

• Recall

Can a search engine guarantee 100% recall?

Page 5: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

F-measure

• F-Measure: Weighted harmonic mean of Precision and Recall

Why use harmonic mean instead of arithmetic mean?

Page 6: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Precision-Recall Curves• Using precision and recall to evaluate

ranked retrieval

Source: Introduction to Information Retrieval. Manning, Raghavan and Schuetze, 2008

Page 7: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Single measures

Precision at k, P@10, P@100, etc.

and others…

Page 8: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Graded Relevance – NDCG

• Highly relevant documents should have more importance

• Higher the rank of a relevant document, more valuable it is to the user

Page 9: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Inter-judge Agreement – Fleiss’ Kappa

N – number of resultsn – number of ratings/resultk – number of gradesnij – no. of judges who agree that the ith result should have grade j.

Page 10: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

Tests of Statistical Significance

• Wilcoxon signed rank test• Student’s paired t-test• …and more

Page 11: Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided

END OF MODULE “IR FROM 20000FT”