lecture 6: evaluation - university of cambridge · 6 other types of evaluation 224. overview 1...
TRANSCRIPT
![Page 1: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/1.jpg)
Lecture 6: EvaluationInformation Retrieval
Computer Science Tripos Part II
Ronan Cummins1
Natural Language and Information Processing (NLIP) Group
2016
1Adapted from Simone Teufel’s original slides223
![Page 2: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/2.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
224
![Page 3: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/3.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 4: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/4.jpg)
Summary: Ranked retrieval
225
![Page 5: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/5.jpg)
Summary: Ranked retrieval
In VSM one represents documents and queries as weightedtf-idf vectors
225
![Page 6: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/6.jpg)
Summary: Ranked retrieval
In VSM one represents documents and queries as weightedtf-idf vectors
Compute the cosine similarity between the vectors to rank
225
![Page 7: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/7.jpg)
Summary: Ranked retrieval
In VSM one represents documents and queries as weightedtf-idf vectors
Compute the cosine similarity between the vectors to rank
Language models rank based on the probability of a documentmodel generating the query
225
![Page 8: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/8.jpg)
Today
IR SystemQuery
Document
Collection
Set of relevant
documents
Document Normalisation
Indexer
UI
Ranking/Matching ModuleQuery
Norm
.
Indexes
Today: evaluation
226
![Page 9: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/9.jpg)
Today
IR SystemQuery
Document
Collection
Set of relevant
documents
Document Normalisation
Indexer
UI
Ranking/Matching ModuleQuery
Norm
.
Indexes
Evaluation
Today: how good are the returned documents?
227
![Page 10: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/10.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 11: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/11.jpg)
Measures for a search engine
228
![Page 12: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/12.jpg)
Measures for a search engine
How fast does it index?
228
![Page 13: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/13.jpg)
Measures for a search engine
How fast does it index?
e.g., number of bytes per hour
228
![Page 14: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/14.jpg)
Measures for a search engine
How fast does it index?
e.g., number of bytes per hour
How fast does it search?
228
![Page 15: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/15.jpg)
Measures for a search engine
How fast does it index?
e.g., number of bytes per hour
How fast does it search?
e.g., latency as a function of queries per second
228
![Page 16: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/16.jpg)
Measures for a search engine
How fast does it index?
e.g., number of bytes per hour
How fast does it search?
e.g., latency as a function of queries per second
What is the cost per query?
228
![Page 17: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/17.jpg)
Measures for a search engine
How fast does it index?
e.g., number of bytes per hour
How fast does it search?
e.g., latency as a function of queries per second
What is the cost per query?
in dollars
228
![Page 18: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/18.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
229
![Page 19: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/19.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
229
![Page 20: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/20.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
229
![Page 21: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/21.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:
229
![Page 22: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/22.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of response
229
![Page 23: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/23.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of index
229
![Page 24: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/24.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UI
229
![Page 25: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/25.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
229
![Page 26: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/26.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engine
229
![Page 27: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/27.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engineWhether something was bought
229
![Page 28: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/28.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engineWhether something was boughtWhether ads were clicked
229
![Page 29: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/29.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engineWhether something was boughtWhether ads were clicked
Most important: relevance
229
![Page 30: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/30.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engineWhether something was boughtWhether ads were clicked
Most important: relevance(actually, maybe even more important: it’s free)
229
![Page 31: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/31.jpg)
Measures for a search engine
All of the preceding criteria are measurable: we can quantifyspeed / size / money
However, the key measure for a search engine is userhappiness.
What is user happiness?
Factors include:Speed of responseSize of indexUncluttered UIWe can measure
Rate of return to this search engineWhether something was boughtWhether ads were clicked
Most important: relevance(actually, maybe even more important: it’s free)
Note that none of these is sufficient: blindingly fast, butuseless answers won’t make a user happy.
229
![Page 32: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/32.jpg)
Most common definition of user happiness: Relevance
230
![Page 33: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/33.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
230
![Page 34: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/34.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
But how do you measure relevance?
230
![Page 35: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/35.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
But how do you measure relevance?
Standard methodology in information retrieval consists ofthree elements.
230
![Page 36: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/36.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
But how do you measure relevance?
Standard methodology in information retrieval consists ofthree elements.
A benchmark document collection
230
![Page 37: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/37.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
But how do you measure relevance?
Standard methodology in information retrieval consists ofthree elements.
A benchmark document collectionA benchmark suite of queries
230
![Page 38: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/38.jpg)
Most common definition of user happiness: Relevance
User happiness is equated with the relevance of search resultsto the query.
But how do you measure relevance?
Standard methodology in information retrieval consists ofthree elements.
A benchmark document collectionA benchmark suite of queriesAn assessment of the relevance of each query-document pair
230
![Page 39: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/39.jpg)
Relevance: query vs. information need
231
![Page 40: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/40.jpg)
Relevance: query vs. information need
Relevance to what? The query?
231
![Page 41: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/41.jpg)
Relevance: query vs. information need
Relevance to what? The query?
Information need i
“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”
231
![Page 42: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/42.jpg)
Relevance: query vs. information need
Relevance to what? The query?
Information need i
“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”
translated into:
Query q
[red wine white wine heart attack]
231
![Page 43: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/43.jpg)
Relevance: query vs. information need
Relevance to what? The query?
Information need i
“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”
translated into:
Query q
[red wine white wine heart attack]
So what about the following document:
Document d ′
At the heart of his speech was an attack on the wine industry lobby for
downplaying the role of red and white wine in drunk driving.
231
![Page 44: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/44.jpg)
Relevance: query vs. information need
Relevance to what? The query?
Information need i
“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”
translated into:
Query q
[red wine white wine heart attack]
So what about the following document:
Document d ′
At the heart of his speech was an attack on the wine industry lobby for
downplaying the role of red and white wine in drunk driving.
d ′ is an excellent match for query q . . .
231
![Page 45: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/45.jpg)
Relevance: query vs. information need
Relevance to what? The query?
Information need i
“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”
translated into:
Query q
[red wine white wine heart attack]
So what about the following document:
Document d ′
At the heart of his speech was an attack on the wine industry lobby for
downplaying the role of red and white wine in drunk driving.
d ′ is an excellent match for query q . . .
d ′ is not relevant to the information need i .
231
![Page 46: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/46.jpg)
Relevance: query vs. information need
232
![Page 47: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/47.jpg)
Relevance: query vs. information need
User happiness can only be measured by relevance to aninformation need, not by relevance to queries.
232
![Page 48: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/48.jpg)
Relevance: query vs. information need
User happiness can only be measured by relevance to aninformation need, not by relevance to queries.
Sloppy terminology here and elsewhere in the literature: wetalk about query–document relevance judgments even thoughwe mean information-need–document relevance judgments.
232
![Page 49: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/49.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 50: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/50.jpg)
Precision and recall
Precision (P) is the fraction of retrieved documents that arerelevant
Precision =#(relevant items retrieved)
#(retrieved items)= P(relevant|retrieved)
233
![Page 51: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/51.jpg)
Precision and recall
Precision (P) is the fraction of retrieved documents that arerelevant
Precision =#(relevant items retrieved)
#(retrieved items)= P(relevant|retrieved)
Recall (R) is the fraction of relevant documents that areretrieved
Recall =#(relevant items retrieved)
#(relevant items)= P(retrieved|relevant)
233
![Page 52: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/52.jpg)
Precision and recall
234
![Page 53: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/53.jpg)
Precision and recall
w
Relevant NonrelevantRetrieved true positives (TP) false positives (FP)Not retrieved false negatives (FN) true negatives (TN)
234
![Page 54: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/54.jpg)
Precision and recall
w THE TRUTH
Relevant NonrelevantRetrieved true positives (TP) false positives (FP)Not retrieved false negatives (FN) true negatives (TN)
234
![Page 55: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/55.jpg)
Precision and recall
w THE TRUTH
WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)
234
![Page 56: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/56.jpg)
Precision and recall
w THE TRUTH
WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)
True
Positives
True Negatives
False
Negatives
False
Positives
Relevant Retrieved
234
![Page 57: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/57.jpg)
Precision and recall
w THE TRUTH
WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)
True
Positives
True Negatives
False
Negatives
False
Positives
Relevant Retrieved
P = TP/(TP + FP)
234
![Page 58: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/58.jpg)
Precision and recall
w THE TRUTH
WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)
True
Positives
True Negatives
False
Negatives
False
Positives
Relevant Retrieved
P = TP/(TP + FP)
R = TP/(TP + FN)
234
![Page 59: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/59.jpg)
Precision/recall tradeoff
235
![Page 60: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/60.jpg)
Precision/recall tradeoff
You can increase recall by returning more docs.
235
![Page 61: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/61.jpg)
Precision/recall tradeoff
You can increase recall by returning more docs.
Recall is a non-decreasing function of the number of docsretrieved.
235
![Page 62: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/62.jpg)
Precision/recall tradeoff
You can increase recall by returning more docs.
Recall is a non-decreasing function of the number of docsretrieved.
A system that returns all docs has 100% recall!
235
![Page 63: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/63.jpg)
Precision/recall tradeoff
You can increase recall by returning more docs.
Recall is a non-decreasing function of the number of docsretrieved.
A system that returns all docs has 100% recall!
The converse is also true (usually): It’s easy to get highprecision for very low recall.
235
![Page 64: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/64.jpg)
A combined measure: F
236
![Page 65: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/65.jpg)
A combined measure: F
F allows us to trade off precision against recall.
236
![Page 66: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/66.jpg)
A combined measure: F
F allows us to trade off precision against recall.
F =1
α 1P+ (1− α) 1
R
=(β2 + 1)PR
β2P + Rwhere β2 =
1− α
α
α ∈ [0, 1] and thus β2 ∈ [0,∞]
Most frequently used: balanced F with β = 1 or α = 0.5
This is the harmonic mean of P and R : 1F= 1
2 (1P+ 1
R)
236
![Page 67: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/67.jpg)
Example for precision, recall, F1
237
![Page 68: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/68.jpg)
Example for precision, recall, F1
relevant not relevant
retrieved 20 40 60not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
237
![Page 69: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/69.jpg)
Example for precision, recall, F1
relevant not relevant
retrieved 20 40 60not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
P = 20/(20 + 40) = 1/3
237
![Page 70: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/70.jpg)
Example for precision, recall, F1
relevant not relevant
retrieved 20 40 60not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
P = 20/(20 + 40) = 1/3
R = 20/(20 + 60) = 1/4
237
![Page 71: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/71.jpg)
Example for precision, recall, F1
relevant not relevant
retrieved 20 40 60not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
P = 20/(20 + 40) = 1/3
R = 20/(20 + 60) = 1/4
F1 = 2 1113
+ 114
= 2/7
237
![Page 72: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/72.jpg)
Accuracy
238
![Page 73: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/73.jpg)
Accuracy
Why do we use complex measures like precision, recall, and F?
238
![Page 74: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/74.jpg)
Accuracy
Why do we use complex measures like precision, recall, and F?
Why not something simple like accuracy?
238
![Page 75: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/75.jpg)
Accuracy
Why do we use complex measures like precision, recall, and F?
Why not something simple like accuracy?
Accuracy is the fraction of decisions (relevant/nonrelevant)that are correct.
238
![Page 76: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/76.jpg)
Accuracy
Why do we use complex measures like precision, recall, and F?
Why not something simple like accuracy?
Accuracy is the fraction of decisions (relevant/nonrelevant)that are correct.
In terms of the contingency table above,accuracy = (TP + TN)/(TP + FP + FN + TN).
238
![Page 77: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/77.jpg)
Thought experiment
239
![Page 78: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/78.jpg)
Thought experiment
Compute precision, recall and F1 for this result set:
239
![Page 79: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/79.jpg)
Thought experiment
Compute precision, recall and F1 for this result set:
relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000
239
![Page 80: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/80.jpg)
Thought experiment
Compute precision, recall and F1 for this result set:
relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000
The snoogle search engine below always returns 0 results (“0matching results found”), regardless of the query.
239
![Page 81: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/81.jpg)
Thought experiment
Compute precision, recall and F1 for this result set:
relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000
The snoogle search engine below always returns 0 results (“0matching results found”), regardless of the query.
Snoogle demonstrates that accuracy is not a useful measure inIR.
239
![Page 82: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/82.jpg)
Why accuracy is a useless measure in IR
240
![Page 83: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/83.jpg)
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no andreturn nothing
240
![Page 84: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/84.jpg)
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no andreturn nothing
You then get 99.99% accuracy on most queries.
240
![Page 85: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/85.jpg)
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no andreturn nothing
You then get 99.99% accuracy on most queries.
Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.
240
![Page 86: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/86.jpg)
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no andreturn nothing
You then get 99.99% accuracy on most queries.
Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.
It’s better to return some bad hits as long as you returnsomething.
240
![Page 87: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/87.jpg)
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no andreturn nothing
You then get 99.99% accuracy on most queries.
Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.
It’s better to return some bad hits as long as you returnsomething.
→ We use precision, recall, and F for evaluation, not accuracy.
240
![Page 88: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/88.jpg)
Recall-criticality and precision-criticality
241
![Page 89: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/89.jpg)
Recall-criticality and precision-criticality
Inverse relationship between precision and recall forces generalsystems to go for compromise between them
241
![Page 90: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/90.jpg)
Recall-criticality and precision-criticality
Inverse relationship between precision and recall forces generalsystems to go for compromise between them
But some tasks particularly need good precision whereasothers need good recall:
241
![Page 91: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/91.jpg)
Recall-criticality and precision-criticality
Inverse relationship between precision and recall forces generalsystems to go for compromise between them
But some tasks particularly need good precision whereasothers need good recall:
Precision-criticaltask
Recall-critical task
Time matters matters lessTolerance to cases ofoverlooked informa-tion
a lot none
Information Redun-dancy
There may bemany equally goodanswers
Information is typi-cally found in onlyone document
Examples web search legal search, patentsearch
241
![Page 92: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/92.jpg)
Difficulties in using precision, recall and F
242
![Page 93: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/93.jpg)
Difficulties in using precision, recall and F
We should always average over a large set of queries.
242
![Page 94: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/94.jpg)
Difficulties in using precision, recall and F
We should always average over a large set of queries.
There is no such thing as a “typical” or “representative” query.
242
![Page 95: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/95.jpg)
Difficulties in using precision, recall and F
We should always average over a large set of queries.
There is no such thing as a “typical” or “representative” query.
We need relevance judgments for information-need-documentpairs – but they are expensive to produce.
242
![Page 96: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/96.jpg)
Difficulties in using precision, recall and F
We should always average over a large set of queries.
There is no such thing as a “typical” or “representative” query.
We need relevance judgments for information-need-documentpairs – but they are expensive to produce.
For alternatives to using precision/recall and having toproduce relevance judgments – see end of this lecture.
242
![Page 97: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/97.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 98: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/98.jpg)
Moving from unranked to ranked evaluation
243
![Page 99: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/99.jpg)
Moving from unranked to ranked evaluation
Precision/recall/F are measures for unranked sets.
243
![Page 100: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/100.jpg)
Moving from unranked to ranked evaluation
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
243
![Page 101: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/101.jpg)
Moving from unranked to ranked evaluation
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results
243
![Page 102: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/102.jpg)
Moving from unranked to ranked evaluation
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results
This is called Precision/Recall at Rank
243
![Page 103: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/103.jpg)
Moving from unranked to ranked evaluation
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results
This is called Precision/Recall at Rank
Rank statistics give some indication of how quickly user willfind relevant documents from ranked list
243
![Page 104: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/104.jpg)
Precision/Recall @ Rank
Rank Doc
1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90
244
![Page 105: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/105.jpg)
Precision/Recall @ Rank
Rank Doc
1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90
Blue documents are relevant
244
![Page 106: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/106.jpg)
Precision/Recall @ Rank
Rank Doc
1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90
Blue documents are relevant
P@n: P@3=0.33, P@5=0.2, P@8=0.25
244
![Page 107: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/107.jpg)
Precision/Recall @ Rank
Rank Doc
1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90
Blue documents are relevant
P@n: P@3=0.33, P@5=0.2, P@8=0.25
R@n: R@3=0.33, R@5=0.33, R@8=0.66
244
![Page 108: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/108.jpg)
A precision-recall curve
245
![Page 109: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/109.jpg)
A precision-recall curve
245
![Page 110: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/110.jpg)
A precision-recall curve
Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)
245
![Page 111: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/111.jpg)
A precision-recall curve
Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)
Interpolation (in red): Take maximum of all future points
245
![Page 112: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/112.jpg)
A precision-recall curve
Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)
Interpolation (in red): Take maximum of all future points
Rationale for interpolation: The user is willing to look at morestuff if both precision and recall get better.
245
![Page 113: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/113.jpg)
Another idea: Precision at Recall r
Rank S1 S2
1 X2 X3 X45 X6 X X7 X8 X9 X10 X
→
S1 S2
p @ r 0.2 1.0 0.5p @ r 0.4 0.67 0.4p @ r 0.6 0.5 0.5p @ r 0.8 0.44 0.57p @ r 1.0 0.5 0.63
246
![Page 114: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/114.jpg)
Averaged 11-point precision/recall graph
247
![Page 115: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/115.jpg)
Averaged 11-point precision/recall graph
Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .
247
![Page 116: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/116.jpg)
Averaged 11-point precision/recall graph
Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .
Do this for each of the queries in the evaluation benchmark
247
![Page 117: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/117.jpg)
Averaged 11-point precision/recall graph
Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .
Do this for each of the queries in the evaluation benchmark
Average over queries
247
![Page 118: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/118.jpg)
Averaged 11-point precision/recall graph
Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .
Do this for each of the queries in the evaluation benchmark
Average over queries
The curve is typical of performance levels at TREC (morelater).
247
![Page 119: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/119.jpg)
Averaged 11-point precision more formally
248
![Page 120: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/120.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
248
![Page 121: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/121.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1
248
![Page 122: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/122.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1
To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj
248
![Page 123: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/123.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1
To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj
Interpolation for cases where there is no exact measurement at rj :
P̃i(rj ) =
{
max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists
P̃i(rj+1) otherwise
248
![Page 124: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/124.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1
To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj
Interpolation for cases where there is no exact measurement at rj :
P̃i(rj ) =
{
max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists
P̃i(rj+1) otherwise
Note that Pi (R = 1) can always be measured.
248
![Page 125: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/125.jpg)
Averaged 11-point precision more formally
P11 pt =1
11
10∑
j=0
1
N
N∑
i=1
P̃i (rj )
with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)
Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1
To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj
Interpolation for cases where there is no exact measurement at rj :
P̃i(rj ) =
{
max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists
P̃i(rj+1) otherwise
Note that Pi (R = 1) can always be measured.
Worked avg-11-pt prec example for supervisions at end of slides.
248
![Page 126: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/126.jpg)
Mean Average Precision (MAP)
249
![Page 127: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/127.jpg)
Mean Average Precision (MAP)
Also called “average precision at seen relevant documents”
249
![Page 128: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/128.jpg)
Mean Average Precision (MAP)
Also called “average precision at seen relevant documents”
Determine precision at each point when a new relevantdocument gets retrieved
249
![Page 129: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/129.jpg)
Mean Average Precision (MAP)
Also called “average precision at seen relevant documents”
Determine precision at each point when a new relevantdocument gets retrieved
Use P=0 for each relevant document that was not retrieved
249
![Page 130: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/130.jpg)
Mean Average Precision (MAP)
Also called “average precision at seen relevant documents”
Determine precision at each point when a new relevantdocument gets retrieved
Use P=0 for each relevant document that was not retrieved
Determine average for each query, then average over queries
249
![Page 131: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/131.jpg)
Mean Average Precision (MAP)
Also called “average precision at seen relevant documents”
Determine precision at each point when a new relevantdocument gets retrieved
Use P=0 for each relevant document that was not retrieved
Determine average for each query, then average over queries
MAP =1
N
N∑
j=1
1
Qj
Qj∑
i=1
P(doci)
with:Qj number of relevant documents for query j
N number of queriesP(doci ) precision at ith relevant document
249
![Page 132: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/132.jpg)
Mean Average Precision: example(MAP = 0.564+0.623
2 = 0.594)
Query 1Rank P(doci )
1 X 1.0023 X 0.67456 X 0.50789
10 X 0.4011121314151617181920 X 0.25AVG: 0.564
Query 2Rank P(doci )
1 X 1.0023 X 0.67456789
101112131415 X 0.2AVG: 0.623
250
![Page 133: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/133.jpg)
ROC curve (Receiver Operating Characteristic)
251
![Page 134: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/134.jpg)
ROC curve (Receiver Operating Characteristic)
x-axis: FPR (false positive rate): FP/total actual negatives;
251
![Page 135: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/135.jpg)
ROC curve (Receiver Operating Characteristic)
x-axis: FPR (false positive rate): FP/total actual negatives;
y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity)
251
![Page 136: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/136.jpg)
ROC curve (Receiver Operating Characteristic)
x-axis: FPR (false positive rate): FP/total actual negatives;
y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity)
251
![Page 137: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/137.jpg)
ROC curve (Receiver Operating Characteristic)
x-axis: FPR (false positive rate): FP/total actual negatives;
y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity) ≡ recall
FPR = fall-out = 1 - specificity (TNR; true negative rate)
251
![Page 138: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/138.jpg)
ROC curve (Receiver Operating Characteristic)
x-axis: FPR (false positive rate): FP/total actual negatives;
y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity) ≡ recall
FPR = fall-out = 1 - specificity (TNR; true negative rate)
But we are only interested in the small area in the lower leftcorner (blown up by prec-recall graph)
251
![Page 139: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/139.jpg)
Variance of measures like precision/recall
252
![Page 140: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/140.jpg)
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).
252
![Page 141: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/141.jpg)
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).
Indeed, it is usually the case that the variance of the samesystem across queries is much greater than the variance ofdifferent systems on the same query.
252
![Page 142: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/142.jpg)
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).
Indeed, it is usually the case that the variance of the samesystem across queries is much greater than the variance ofdifferent systems on the same query.
That is, there are easy information needs and hard ones.
252
![Page 143: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/143.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 144: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/144.jpg)
What we need for a benchmark
253
![Page 145: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/145.jpg)
What we need for a benchmark
A collection of documents
253
![Page 146: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/146.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
253
![Page 147: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/147.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
253
![Page 148: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/148.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queries
253
![Page 149: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/149.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.
253
![Page 150: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/150.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.
Human relevance assessments
253
![Page 151: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/151.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.
Human relevance assessments
We need to hire/pay “judges” or assessors to do this.
253
![Page 152: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/152.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.
Human relevance assessments
We need to hire/pay “judges” or assessors to do this.Expensive, time-consuming
253
![Page 153: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/153.jpg)
What we need for a benchmark
A collection of documents
Documents must be representative of the documents weexpect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.
Human relevance assessments
We need to hire/pay “judges” or assessors to do this.Expensive, time-consumingJudges must be representative of the users we expect to see inreality.
253
![Page 154: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/154.jpg)
First standard relevance benchmark: Cranfield
254
![Page 155: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/155.jpg)
First standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness
254
![Page 156: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/156.jpg)
First standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness
Late 1950s, UK
254
![Page 157: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/157.jpg)
First standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness
Late 1950s, UK
1398 abstracts of aerodynamics journal articles, a set of 225queries, exhaustive relevance judgments of allquery-document-pairs
254
![Page 158: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/158.jpg)
First standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness
Late 1950s, UK
1398 abstracts of aerodynamics journal articles, a set of 225queries, exhaustive relevance judgments of allquery-document-pairs
Too small, too untypical for serious IR evaluation today
254
![Page 159: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/159.jpg)
Second-generation relevance benchmark: TREC
255
![Page 160: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/160.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
255
![Page 161: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/161.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
255
![Page 162: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/162.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
TREC is actually a set of several different relevancebenchmarks.
255
![Page 163: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/163.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
TREC is actually a set of several different relevancebenchmarks.
Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999
255
![Page 164: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/164.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
TREC is actually a set of several different relevancebenchmarks.
Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999
1.89 million documents, mainly newswire articles, 450information needs
255
![Page 165: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/165.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
TREC is actually a set of several different relevancebenchmarks.
Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999
1.89 million documents, mainly newswire articles, 450information needs
No exhaustive relevance judgments – too expensive
255
![Page 166: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/166.jpg)
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards andTechnology (NIST)
TREC is actually a set of several different relevancebenchmarks.
Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999
1.89 million documents, mainly newswire articles, 450information needs
No exhaustive relevance judgments – too expensive
Rather, NIST assessors’ relevance judgments are availableonly for the documents that were among the top k returnedfor some system which was entered in the TREC evaluationfor which the information need was developed.
255
![Page 167: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/167.jpg)
Sample TREC Query
<num> Number: 508<title> hair loss is a symptom of what diseases<desc> Description:Find diseases for which hair loss is a symptom.<narr> Narrative:A document is relevant if it positively connects the loss of headhair in humans with a specific disease. In this context, “thinninghair” and “hair loss” are synonymous. Loss of body and/or facialhair is irrelevant, as is hair loss caused by drug therapy.
256
![Page 168: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/168.jpg)
TREC Relevance Judgements
Humans decide which document–query pairs are relevant.
257
![Page 169: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/169.jpg)
Example of more recent benchmark: ClueWeb09
258
![Page 170: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/170.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
258
![Page 171: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/171.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
25 terabytes (compressed: 5 terabyte)
258
![Page 172: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/172.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
25 terabytes (compressed: 5 terabyte)
Collected January/February 2009
258
![Page 173: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/173.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
25 terabytes (compressed: 5 terabyte)
Collected January/February 2009
10 languages
258
![Page 174: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/174.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
25 terabytes (compressed: 5 terabyte)
Collected January/February 2009
10 languages
Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GBcompressed)
258
![Page 175: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/175.jpg)
Example of more recent benchmark: ClueWeb09
1 billion web pages
25 terabytes (compressed: 5 terabyte)
Collected January/February 2009
10 languages
Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GBcompressed)
Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GBcompressed)
258
![Page 176: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/176.jpg)
Interjudge agreement at TREC
259
![Page 177: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/177.jpg)
Interjudge agreement at TREC
information number of disagreementsneed docs judged
51 211 662 400 15767 400 6895 400 110127 400 106
259
![Page 178: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/178.jpg)
Impact of interjudge disagreement
260
![Page 179: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/179.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
260
![Page 180: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/180.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
260
![Page 181: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/181.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
260
![Page 182: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/182.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
Virtually no impact on ranking of systems
260
![Page 183: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/183.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
Virtually no impact on ranking of systems
Supposes we want to know if algorithm A is better thanalgorithm B
260
![Page 184: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/184.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
Virtually no impact on ranking of systems
Supposes we want to know if algorithm A is better thanalgorithm B
An information retrieval experiment will give us a reliableanswer to this question . . .
260
![Page 185: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/185.jpg)
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
Virtually no impact on ranking of systems
Supposes we want to know if algorithm A is better thanalgorithm B
An information retrieval experiment will give us a reliableanswer to this question . . .
. . . even if there is a lot of disagreement between judges.
260
![Page 186: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/186.jpg)
Overview
1 Recap/Catchup
2 Introduction
3 Unranked evaluation
4 Ranked evaluation
5 Benchmarks
6 Other types of evaluation
![Page 187: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/187.jpg)
Evaluation at large search engines
261
![Page 188: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/188.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
261
![Page 189: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/189.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
261
![Page 190: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/190.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
261
![Page 191: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/191.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
261
![Page 192: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/192.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
Example 1: clickthrough on first result
261
![Page 193: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/193.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .
261
![Page 194: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/194.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .. . . but pretty reliable in the aggregate.
261
![Page 195: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/195.jpg)
Evaluation at large search engines
Recall is difficult to measure on the web
Search engines often use precision at top k , e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .. . . but pretty reliable in the aggregate.Example 2: A/B testing
261
![Page 196: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/196.jpg)
A/B testing
262
![Page 197: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/197.jpg)
A/B testing
Purpose: Test a single innovation
262
![Page 198: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/198.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
262
![Page 199: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/199.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
262
![Page 200: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/200.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation
262
![Page 201: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/201.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation
Evaluate with an “automatic” measure like clickthrough onfirst result
262
![Page 202: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/202.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation
Evaluate with an “automatic” measure like clickthrough onfirst result
Now we can directly see if the innovation does improve userhappiness.
262
![Page 203: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/203.jpg)
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation
Evaluate with an “automatic” measure like clickthrough onfirst result
Now we can directly see if the innovation does improve userhappiness.
Probably the evaluation methodology that large searchengines trust most
262
![Page 204: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/204.jpg)
Take-away today
263
![Page 205: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/205.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
263
![Page 206: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/206.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measure
263
![Page 207: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/207.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrieval
263
![Page 208: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/208.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
263
![Page 209: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/209.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
263
![Page 210: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/210.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
Significance testing is an issue
263
![Page 211: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/211.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
Significance testing is an issue
could a good result have occurred by chance?
263
![Page 212: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/212.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
Significance testing is an issue
could a good result have occurred by chance?is the result robust across different document sets?
263
![Page 213: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/213.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
Significance testing is an issue
could a good result have occurred by chance?is the result robust across different document sets?slowly becoming more common
263
![Page 214: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/214.jpg)
Take-away today
Focused on evaluation for ad-hoc retrieval
Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures
Evaluation for interactive tasks is more involved
Significance testing is an issue
could a good result have occurred by chance?is the result robust across different document sets?slowly becoming more commonunderlying population distributions unknown, so applynon-parametric tests such as the sign test
263
![Page 215: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/215.jpg)
Reading
MRS, Chapter 8
264
![Page 216: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/216.jpg)
Worked Example avg-11-pt prec: Query 1, measured datapoints
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Blue for Query 1
Bold Circles measured
Query 1Rank R P
1 X 0.2 1.00 P̃1(r2) = 1.002
3 X 0.4 0.67 P̃1(r4) = 0.6745
6 X 0.6 0.50 P̃1(r6) = 0.50789
10 X 0.8 0.40 P̃1(r8)= 0.40111213141516171819
20 X 1.0 0.25 P̃1(r10) = 0.25
Five rjs (r2, r4, r6, r8, r10)coincide directly withdatapoint
265
![Page 217: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/217.jpg)
Worked Example avg-11-pt prec: Query 1, interpolation
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Bold circles measured
thin circles interpolated
Query 1 P̃1(r0) = 1.00
Rank R P P̃1(r1) = 1.00
1 X .20 1.00 P̃1(r2) = 1.00
2 P̃1(r3) = .67
3 X .40 .67 P̃1(r4) = .674
5 P̃1(r5) = .50
6 X .60 .50 P̃1(r6) = .5078
9 P̃1(r7) = .40
10 X .80 .40 P̃1(r8)= .40111213
14 P̃1(r9) = .251516171819
20 X 1.00 .25 P̃1(r10) = .25
The six other rjs (r0, r1, r3, r5,r7, r9) are interpolated.
266
![Page 218: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/218.jpg)
Worked Example avg-11-pt prec: Query 2, measured datapoints
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Blue: Query 1; Red: Query 2
Bold circles measured; thincircles interpol.
Query 2Rank Relev. R P
1 X .33 1.0023 X .67 .67456789
1011121314
15 X 1.0 .2 P̃2(r10) = .20
Only r10 coincides with ameasured data point
267
![Page 219: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/219.jpg)
Worked Example avg-11-pt prec: Query 2, interpolation
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Blue: Query 1; Red:Query 2
Bold circles measured;thin circles interpol.
P̃2(r0) = 1.00
P̃2(r1) = 1.00
P̃2(r2) = 1.00
Query 2 P̃2(r3) = 1.00Rank Relev. R P
1 X .33 1.00 P̃2(r4) = .67
2 P̃2(r5) = .67
3 X .67 .67 P̃2(r6) = .67456789
1011
12 P̃2(r7) = .20
13 P̃2(r8) = .20
14 P̃2(r9) = .20
15 X 1.0 .2 P̃2(r10) = .20
10 of the rj s are interpolated
268
![Page 220: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/220.jpg)
Worked Example avg-11-pt prec: averaging
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Now average at each pj
over N (number ofqueries)
→ 11 averages
269
![Page 221: Lecture 6: Evaluation - University of Cambridge · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022042410/5f2881acd6e54d2ef45f15de/html5/thumbnails/221.jpg)
Worked Example avg-11-pt prec: area/result
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
Recall
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
0.8 0.9 1
End result:
11 point average precision
Approximation of areaunder prec. recall curve
270