lecture 6: evaluation · 6 other types of evaluation 224. overview 1 recap/catchup 2 introduction 3...

221
Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group [email protected] 2016 1 Adapted from Simone Teufel’s original slides 223

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Lecture 6: EvaluationInformation Retrieval

Computer Science Tripos Part II

Ronan Cummins1

Natural Language and Information Processing (NLIP) Group

[email protected]

2016

1Adapted from Simone Teufel’s original slides223

Page 2: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

224

Page 3: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 4: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Summary: Ranked retrieval

225

Page 5: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Summary: Ranked retrieval

In VSM one represents documents and queries as weightedtf-idf vectors

225

Page 6: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Summary: Ranked retrieval

In VSM one represents documents and queries as weightedtf-idf vectors

Compute the cosine similarity between the vectors to rank

225

Page 7: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Summary: Ranked retrieval

In VSM one represents documents and queries as weightedtf-idf vectors

Compute the cosine similarity between the vectors to rank

Language models rank based on the probability of a documentmodel generating the query

225

Page 8: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Today

IR SystemQuery

Document

Collection

Set of relevant

documents

Document Normalisation

Indexer

UI

Ranking/Matching ModuleQuery

Norm

.

Indexes

Today: evaluation

226

Page 9: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Today

IR SystemQuery

Document

Collection

Set of relevant

documents

Document Normalisation

Indexer

UI

Ranking/Matching ModuleQuery

Norm

.

Indexes

Evaluation

Today: how good are the returned documents?

227

Page 10: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 11: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

228

Page 12: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

228

Page 13: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

e.g., number of bytes per hour

228

Page 14: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

e.g., number of bytes per hour

How fast does it search?

228

Page 15: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

e.g., number of bytes per hour

How fast does it search?

e.g., latency as a function of queries per second

228

Page 16: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

e.g., number of bytes per hour

How fast does it search?

e.g., latency as a function of queries per second

What is the cost per query?

228

Page 17: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

How fast does it index?

e.g., number of bytes per hour

How fast does it search?

e.g., latency as a function of queries per second

What is the cost per query?

in dollars

228

Page 18: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

229

Page 19: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

229

Page 20: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

229

Page 21: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:

229

Page 22: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of response

229

Page 23: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of index

229

Page 24: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UI

229

Page 25: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

229

Page 26: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engine

229

Page 27: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engineWhether something was bought

229

Page 28: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engineWhether something was boughtWhether ads were clicked

229

Page 29: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engineWhether something was boughtWhether ads were clicked

Most important: relevance

229

Page 30: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engineWhether something was boughtWhether ads were clicked

Most important: relevance(actually, maybe even more important: it’s free)

229

Page 31: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Measures for a search engine

All of the preceding criteria are measurable: we can quantifyspeed / size / money

However, the key measure for a search engine is userhappiness.

What is user happiness?

Factors include:Speed of responseSize of indexUncluttered UIWe can measure

Rate of return to this search engineWhether something was boughtWhether ads were clicked

Most important: relevance(actually, maybe even more important: it’s free)

Note that none of these is sufficient: blindingly fast, butuseless answers won’t make a user happy.

229

Page 32: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

230

Page 33: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

230

Page 34: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

But how do you measure relevance?

230

Page 35: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

But how do you measure relevance?

Standard methodology in information retrieval consists ofthree elements.

230

Page 36: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

But how do you measure relevance?

Standard methodology in information retrieval consists ofthree elements.

A benchmark document collection

230

Page 37: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

But how do you measure relevance?

Standard methodology in information retrieval consists ofthree elements.

A benchmark document collectionA benchmark suite of queries

230

Page 38: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search resultsto the query.

But how do you measure relevance?

Standard methodology in information retrieval consists ofthree elements.

A benchmark document collectionA benchmark suite of queriesAn assessment of the relevance of each query-document pair

230

Page 39: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

231

Page 40: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

231

Page 41: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

Information need i

“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”

231

Page 42: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

Information need i

“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”

translated into:

Query q

[red wine white wine heart attack]

231

Page 43: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

Information need i

“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”

translated into:

Query q

[red wine white wine heart attack]

So what about the following document:

Document d ′

At the heart of his speech was an attack on the wine industry lobby for

downplaying the role of red and white wine in drunk driving.

231

Page 44: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

Information need i

“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”

translated into:

Query q

[red wine white wine heart attack]

So what about the following document:

Document d ′

At the heart of his speech was an attack on the wine industry lobby for

downplaying the role of red and white wine in drunk driving.

d ′ is an excellent match for query q . . .

231

Page 45: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

Relevance to what? The query?

Information need i

“I am looking for information on whether drinking red wine is moreeffective at reducing your risk of heart attacks than white wine.”

translated into:

Query q

[red wine white wine heart attack]

So what about the following document:

Document d ′

At the heart of his speech was an attack on the wine industry lobby for

downplaying the role of red and white wine in drunk driving.

d ′ is an excellent match for query q . . .

d ′ is not relevant to the information need i .

231

Page 46: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

232

Page 47: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

User happiness can only be measured by relevance to aninformation need, not by relevance to queries.

232

Page 48: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Relevance: query vs. information need

User happiness can only be measured by relevance to aninformation need, not by relevance to queries.

Sloppy terminology here and elsewhere in the literature: wetalk about query–document relevance judgments even thoughwe mean information-need–document relevance judgments.

232

Page 49: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 50: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

Precision (P) is the fraction of retrieved documents that arerelevant

Precision =#(relevant items retrieved)

#(retrieved items)= P(relevant|retrieved)

233

Page 51: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

Precision (P) is the fraction of retrieved documents that arerelevant

Precision =#(relevant items retrieved)

#(retrieved items)= P(relevant|retrieved)

Recall (R) is the fraction of relevant documents that areretrieved

Recall =#(relevant items retrieved)

#(relevant items)= P(retrieved|relevant)

233

Page 52: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

234

Page 53: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w

Relevant NonrelevantRetrieved true positives (TP) false positives (FP)Not retrieved false negatives (FN) true negatives (TN)

234

Page 54: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w THE TRUTH

Relevant NonrelevantRetrieved true positives (TP) false positives (FP)Not retrieved false negatives (FN) true negatives (TN)

234

Page 55: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w THE TRUTH

WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)

234

Page 56: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w THE TRUTH

WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)

True

Positives

True Negatives

False

Negatives

False

Positives

Relevant Retrieved

234

Page 57: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w THE TRUTH

WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)

True

Positives

True Negatives

False

Negatives

False

Positives

Relevant Retrieved

P = TP/(TP + FP)

234

Page 58: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision and recall

w THE TRUTH

WHAT THE Relevant NonrelevantSYSTEM Retrieved true positives (TP) false positives (FP)THINKS Not retrieved false negatives (FN) true negatives (TN)

True

Positives

True Negatives

False

Negatives

False

Positives

Relevant Retrieved

P = TP/(TP + FP)

R = TP/(TP + FN)

234

Page 59: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/recall tradeoff

235

Page 60: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/recall tradeoff

You can increase recall by returning more docs.

235

Page 61: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/recall tradeoff

You can increase recall by returning more docs.

Recall is a non-decreasing function of the number of docsretrieved.

235

Page 62: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/recall tradeoff

You can increase recall by returning more docs.

Recall is a non-decreasing function of the number of docsretrieved.

A system that returns all docs has 100% recall!

235

Page 63: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/recall tradeoff

You can increase recall by returning more docs.

Recall is a non-decreasing function of the number of docsretrieved.

A system that returns all docs has 100% recall!

The converse is also true (usually): It’s easy to get highprecision for very low recall.

235

Page 64: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A combined measure: F

236

Page 65: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A combined measure: F

F allows us to trade off precision against recall.

236

Page 66: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A combined measure: F

F allows us to trade off precision against recall.

F =1

α 1P+ (1− α) 1

R

=(β2 + 1)PR

β2P + Rwhere β2 =

1− α

α

α ∈ [0, 1] and thus β2 ∈ [0,∞]

Most frequently used: balanced F with β = 1 or α = 0.5

This is the harmonic mean of P and R : 1F= 1

2 (1P+ 1

R)

236

Page 67: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example for precision, recall, F1

237

Page 68: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example for precision, recall, F1

relevant not relevant

retrieved 20 40 60not retrieved 60 1,000,000 1,000,060

80 1,000,040 1,000,120

237

Page 69: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example for precision, recall, F1

relevant not relevant

retrieved 20 40 60not retrieved 60 1,000,000 1,000,060

80 1,000,040 1,000,120

P = 20/(20 + 40) = 1/3

237

Page 70: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example for precision, recall, F1

relevant not relevant

retrieved 20 40 60not retrieved 60 1,000,000 1,000,060

80 1,000,040 1,000,120

P = 20/(20 + 40) = 1/3

R = 20/(20 + 60) = 1/4

237

Page 71: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example for precision, recall, F1

relevant not relevant

retrieved 20 40 60not retrieved 60 1,000,000 1,000,060

80 1,000,040 1,000,120

P = 20/(20 + 40) = 1/3

R = 20/(20 + 60) = 1/4

F1 = 2 1113

+ 114

= 2/7

237

Page 72: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Accuracy

238

Page 73: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Accuracy

Why do we use complex measures like precision, recall, and F?

238

Page 74: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Accuracy

Why do we use complex measures like precision, recall, and F?

Why not something simple like accuracy?

238

Page 75: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Accuracy

Why do we use complex measures like precision, recall, and F?

Why not something simple like accuracy?

Accuracy is the fraction of decisions (relevant/nonrelevant)that are correct.

238

Page 76: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Accuracy

Why do we use complex measures like precision, recall, and F?

Why not something simple like accuracy?

Accuracy is the fraction of decisions (relevant/nonrelevant)that are correct.

In terms of the contingency table above,accuracy = (TP + TN)/(TP + FP + FN + TN).

238

Page 77: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Thought experiment

239

Page 78: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Thought experiment

Compute precision, recall and F1 for this result set:

239

Page 79: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Thought experiment

Compute precision, recall and F1 for this result set:

relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000

239

Page 80: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Thought experiment

Compute precision, recall and F1 for this result set:

relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000

The snoogle search engine below always returns 0 results (“0matching results found”), regardless of the query.

239

Page 81: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Thought experiment

Compute precision, recall and F1 for this result set:

relevant not relevantretrieved 18 2not retrieved 82 1,000,000,000

The snoogle search engine below always returns 0 results (“0matching results found”), regardless of the query.

Snoogle demonstrates that accuracy is not a useful measure inIR.

239

Page 82: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

240

Page 83: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

Simple trick to maximize accuracy in IR: always say no andreturn nothing

240

Page 84: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

Simple trick to maximize accuracy in IR: always say no andreturn nothing

You then get 99.99% accuracy on most queries.

240

Page 85: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

Simple trick to maximize accuracy in IR: always say no andreturn nothing

You then get 99.99% accuracy on most queries.

Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.

240

Page 86: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

Simple trick to maximize accuracy in IR: always say no andreturn nothing

You then get 99.99% accuracy on most queries.

Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.

It’s better to return some bad hits as long as you returnsomething.

240

Page 87: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Why accuracy is a useless measure in IR

Simple trick to maximize accuracy in IR: always say no andreturn nothing

You then get 99.99% accuracy on most queries.

Searchers on the web (and in IR in general) want to findsomething and have a certain tolerance for junk.

It’s better to return some bad hits as long as you returnsomething.

→ We use precision, recall, and F for evaluation, not accuracy.

240

Page 88: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Recall-criticality and precision-criticality

241

Page 89: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Recall-criticality and precision-criticality

Inverse relationship between precision and recall forces generalsystems to go for compromise between them

241

Page 90: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Recall-criticality and precision-criticality

Inverse relationship between precision and recall forces generalsystems to go for compromise between them

But some tasks particularly need good precision whereasothers need good recall:

241

Page 91: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Recall-criticality and precision-criticality

Inverse relationship between precision and recall forces generalsystems to go for compromise between them

But some tasks particularly need good precision whereasothers need good recall:

Precision-criticaltask

Recall-critical task

Time matters matters lessTolerance to cases ofoverlooked informa-tion

a lot none

Information Redun-dancy

There may bemany equally goodanswers

Information is typi-cally found in onlyone document

Examples web search legal search, patentsearch

241

Page 92: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Difficulties in using precision, recall and F

242

Page 93: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Difficulties in using precision, recall and F

We should always average over a large set of queries.

242

Page 94: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Difficulties in using precision, recall and F

We should always average over a large set of queries.

There is no such thing as a “typical” or “representative” query.

242

Page 95: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Difficulties in using precision, recall and F

We should always average over a large set of queries.

There is no such thing as a “typical” or “representative” query.

We need relevance judgments for information-need-documentpairs – but they are expensive to produce.

242

Page 96: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Difficulties in using precision, recall and F

We should always average over a large set of queries.

There is no such thing as a “typical” or “representative” query.

We need relevance judgments for information-need-documentpairs – but they are expensive to produce.

For alternatives to using precision/recall and having toproduce relevance judgments – see end of this lecture.

242

Page 97: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 98: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

243

Page 99: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

Precision/recall/F are measures for unranked sets.

243

Page 100: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

Precision/recall/F are measures for unranked sets.

We can easily turn set measures into measures of ranked lists.

243

Page 101: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

Precision/recall/F are measures for unranked sets.

We can easily turn set measures into measures of ranked lists.

Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results

243

Page 102: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

Precision/recall/F are measures for unranked sets.

We can easily turn set measures into measures of ranked lists.

Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results

This is called Precision/Recall at Rank

243

Page 103: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Moving from unranked to ranked evaluation

Precision/recall/F are measures for unranked sets.

We can easily turn set measures into measures of ranked lists.

Just compute the set measure for each “prefix”: the top 1,top 2, top 3, top 4 etc results

This is called Precision/Recall at Rank

Rank statistics give some indication of how quickly user willfind relevant documents from ranked list

243

Page 104: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/Recall @ Rank

Rank Doc

1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90

244

Page 105: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/Recall @ Rank

Rank Doc

1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90

Blue documents are relevant

244

Page 106: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/Recall @ Rank

Rank Doc

1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90

Blue documents are relevant

P@n: P@3=0.33, P@5=0.2, P@8=0.25

244

Page 107: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Precision/Recall @ Rank

Rank Doc

1 d122 d1233 d44 d575 d1576 d2227 d248 d269 d7710 d90

Blue documents are relevant

P@n: P@3=0.33, P@5=0.2, P@8=0.25

R@n: R@3=0.33, R@5=0.33, R@8=0.66

244

Page 108: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A precision-recall curve

245

Page 109: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A precision-recall curve

245

Page 110: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A precision-recall curve

Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)

245

Page 111: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A precision-recall curve

Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)

Interpolation (in red): Take maximum of all future points

245

Page 112: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A precision-recall curve

Each point corresponds to a result for the top k ranked hits(k = 1, 2, 3, 4, . . .)

Interpolation (in red): Take maximum of all future points

Rationale for interpolation: The user is willing to look at morestuff if both precision and recall get better.

245

Page 113: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Another idea: Precision at Recall r

Rank S1 S2

1 X2 X3 X45 X6 X X7 X8 X9 X10 X

S1 S2

p @ r 0.2 1.0 0.5p @ r 0.4 0.67 0.4p @ r 0.6 0.5 0.5p @ r 0.8 0.44 0.57p @ r 1.0 0.5 0.63

246

Page 114: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision/recall graph

247

Page 115: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision/recall graph

Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .

247

Page 116: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision/recall graph

Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .

Do this for each of the queries in the evaluation benchmark

247

Page 117: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision/recall graph

Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .

Do this for each of the queries in the evaluation benchmark

Average over queries

247

Page 118: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision/recall graph

Compute interpolated precision at recall levels 0.0, 0.1, 0.2,. . .

Do this for each of the queries in the evaluation benchmark

Average over queries

The curve is typical of performance levels at TREC (morelater).

247

Page 119: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

248

Page 120: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

248

Page 121: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1

248

Page 122: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1

To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj

248

Page 123: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1

To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj

Interpolation for cases where there is no exact measurement at rj :

P̃i(rj ) =

{

max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists

P̃i(rj+1) otherwise

248

Page 124: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1

To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj

Interpolation for cases where there is no exact measurement at rj :

P̃i(rj ) =

{

max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists

P̃i(rj+1) otherwise

Note that Pi (R = 1) can always be measured.

248

Page 125: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Averaged 11-point precision more formally

P11 pt =1

11

10∑

j=0

1

N

N∑

i=1

P̃i (rj )

with P̃i (rj ) the precision at the jth recall point in the ith query (out of N)

Define 11 standard recall points rj =j10 : r0 = 0, r1 = 0.1 ... r10 = 1

To get P̃i (rj), we can use Pi(R = rj) directly if a new relevantdocument is retrieved exacty at rj

Interpolation for cases where there is no exact measurement at rj :

P̃i(rj ) =

{

max(rj ≤ r < rj+1)Pi (R = r) if Pi (R = r) exists

P̃i(rj+1) otherwise

Note that Pi (R = 1) can always be measured.

Worked avg-11-pt prec example for supervisions at end of slides.

248

Page 126: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

249

Page 127: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

Also called “average precision at seen relevant documents”

249

Page 128: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

Also called “average precision at seen relevant documents”

Determine precision at each point when a new relevantdocument gets retrieved

249

Page 129: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

Also called “average precision at seen relevant documents”

Determine precision at each point when a new relevantdocument gets retrieved

Use P=0 for each relevant document that was not retrieved

249

Page 130: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

Also called “average precision at seen relevant documents”

Determine precision at each point when a new relevantdocument gets retrieved

Use P=0 for each relevant document that was not retrieved

Determine average for each query, then average over queries

249

Page 131: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision (MAP)

Also called “average precision at seen relevant documents”

Determine precision at each point when a new relevantdocument gets retrieved

Use P=0 for each relevant document that was not retrieved

Determine average for each query, then average over queries

MAP =1

N

N∑

j=1

1

Qj

Qj∑

i=1

P(doci)

with:Qj number of relevant documents for query j

N number of queriesP(doci ) precision at ith relevant document

249

Page 132: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Mean Average Precision: example(MAP = 0.564+0.623

2 = 0.594)

Query 1Rank P(doci )

1 X 1.0023 X 0.67456 X 0.50789

10 X 0.4011121314151617181920 X 0.25AVG: 0.564

Query 2Rank P(doci )

1 X 1.0023 X 0.67456789

101112131415 X 0.2AVG: 0.623

250

Page 133: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

251

Page 134: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

x-axis: FPR (false positive rate): FP/total actual negatives;

251

Page 135: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

x-axis: FPR (false positive rate): FP/total actual negatives;

y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity)

251

Page 136: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

x-axis: FPR (false positive rate): FP/total actual negatives;

y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity)

251

Page 137: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

x-axis: FPR (false positive rate): FP/total actual negatives;

y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity) ≡ recall

FPR = fall-out = 1 - specificity (TNR; true negative rate)

251

Page 138: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

ROC curve (Receiver Operating Characteristic)

x-axis: FPR (false positive rate): FP/total actual negatives;

y-axis: TPR (true positive rate): TP/total actual positives,(also called sensitivity) ≡ recall

FPR = fall-out = 1 - specificity (TNR; true negative rate)

But we are only interested in the small area in the lower leftcorner (blown up by prec-recall graph)

251

Page 139: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Variance of measures like precision/recall

252

Page 140: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Variance of measures like precision/recall

For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).

252

Page 141: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Variance of measures like precision/recall

For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).

Indeed, it is usually the case that the variance of the samesystem across queries is much greater than the variance ofdifferent systems on the same query.

252

Page 142: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Variance of measures like precision/recall

For a test collection, it is usual that a system does badly onsome information needs (e.g., P = 0.2 at R = 0.1) and reallywell on others (e.g., P = 0.95 at R = 0.1).

Indeed, it is usually the case that the variance of the samesystem across queries is much greater than the variance ofdifferent systems on the same query.

That is, there are easy information needs and hard ones.

252

Page 143: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 144: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

253

Page 145: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

253

Page 146: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

253

Page 147: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

253

Page 148: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queries

253

Page 149: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.

253

Page 150: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.

Human relevance assessments

253

Page 151: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.

Human relevance assessments

We need to hire/pay “judges” or assessors to do this.

253

Page 152: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.

Human relevance assessments

We need to hire/pay “judges” or assessors to do this.Expensive, time-consuming

253

Page 153: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

What we need for a benchmark

A collection of documents

Documents must be representative of the documents weexpect to see in reality.

A collection of information needs

. . . which we will often incorrectly refer to as queriesInformation needs must be representative of the informationneeds we expect to see in reality.

Human relevance assessments

We need to hire/pay “judges” or assessors to do this.Expensive, time-consumingJudges must be representative of the users we expect to see inreality.

253

Page 154: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

First standard relevance benchmark: Cranfield

254

Page 155: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

First standard relevance benchmark: Cranfield

Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness

254

Page 156: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

First standard relevance benchmark: Cranfield

Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness

Late 1950s, UK

254

Page 157: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

First standard relevance benchmark: Cranfield

Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness

Late 1950s, UK

1398 abstracts of aerodynamics journal articles, a set of 225queries, exhaustive relevance judgments of allquery-document-pairs

254

Page 158: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

First standard relevance benchmark: Cranfield

Pioneering: first testbed allowing precise quantitativemeasures of information retrieval effectiveness

Late 1950s, UK

1398 abstracts of aerodynamics journal articles, a set of 225queries, exhaustive relevance judgments of allquery-document-pairs

Too small, too untypical for serious IR evaluation today

254

Page 159: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

255

Page 160: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

255

Page 161: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

255

Page 162: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

TREC is actually a set of several different relevancebenchmarks.

255

Page 163: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

TREC is actually a set of several different relevancebenchmarks.

Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999

255

Page 164: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

TREC is actually a set of several different relevancebenchmarks.

Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999

1.89 million documents, mainly newswire articles, 450information needs

255

Page 165: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

TREC is actually a set of several different relevancebenchmarks.

Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999

1.89 million documents, mainly newswire articles, 450information needs

No exhaustive relevance judgments – too expensive

255

Page 166: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Second-generation relevance benchmark: TREC

TREC = Text Retrieval Conference (TREC)

Organized by the U.S. National Institute of Standards andTechnology (NIST)

TREC is actually a set of several different relevancebenchmarks.

Best known: TREC Ad Hoc, used for first 8 TREC evaluationsbetween 1992 and 1999

1.89 million documents, mainly newswire articles, 450information needs

No exhaustive relevance judgments – too expensive

Rather, NIST assessors’ relevance judgments are availableonly for the documents that were among the top k returnedfor some system which was entered in the TREC evaluationfor which the information need was developed.

255

Page 167: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Sample TREC Query

<num> Number: 508<title> hair loss is a symptom of what diseases<desc> Description:Find diseases for which hair loss is a symptom.<narr> Narrative:A document is relevant if it positively connects the loss of headhair in humans with a specific disease. In this context, “thinninghair” and “hair loss” are synonymous. Loss of body and/or facialhair is irrelevant, as is hair loss caused by drug therapy.

256

Page 168: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

TREC Relevance Judgements

Humans decide which document–query pairs are relevant.

257

Page 169: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

258

Page 170: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

258

Page 171: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

25 terabytes (compressed: 5 terabyte)

258

Page 172: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

25 terabytes (compressed: 5 terabyte)

Collected January/February 2009

258

Page 173: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

25 terabytes (compressed: 5 terabyte)

Collected January/February 2009

10 languages

258

Page 174: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

25 terabytes (compressed: 5 terabyte)

Collected January/February 2009

10 languages

Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GBcompressed)

258

Page 175: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Example of more recent benchmark: ClueWeb09

1 billion web pages

25 terabytes (compressed: 5 terabyte)

Collected January/February 2009

10 languages

Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GBcompressed)

Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GBcompressed)

258

Page 176: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Interjudge agreement at TREC

259

Page 177: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Interjudge agreement at TREC

information number of disagreementsneed docs judged

51 211 662 400 15767 400 6895 400 110127 400 106

259

Page 178: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

260

Page 179: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

260

Page 180: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

260

Page 181: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

Large impact on absolute performance numbers

260

Page 182: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

Large impact on absolute performance numbers

Virtually no impact on ranking of systems

260

Page 183: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

Large impact on absolute performance numbers

Virtually no impact on ranking of systems

Supposes we want to know if algorithm A is better thanalgorithm B

260

Page 184: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

Large impact on absolute performance numbers

Virtually no impact on ranking of systems

Supposes we want to know if algorithm A is better thanalgorithm B

An information retrieval experiment will give us a reliableanswer to this question . . .

260

Page 185: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Impact of interjudge disagreement

Judges disagree a lot. Does that mean that the results ofinformation retrieval experiments are meaningless?

No.

Large impact on absolute performance numbers

Virtually no impact on ranking of systems

Supposes we want to know if algorithm A is better thanalgorithm B

An information retrieval experiment will give us a reliableanswer to this question . . .

. . . even if there is a lot of disagreement between judges.

260

Page 186: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Overview

1 Recap/Catchup

2 Introduction

3 Unranked evaluation

4 Ranked evaluation

5 Benchmarks

6 Other types of evaluation

Page 187: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

261

Page 188: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

261

Page 189: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

261

Page 190: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

261

Page 191: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

Search engines also use non-relevance-based measures.

261

Page 192: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

Search engines also use non-relevance-based measures.

Example 1: clickthrough on first result

261

Page 193: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

Search engines also use non-relevance-based measures.

Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .

261

Page 194: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

Search engines also use non-relevance-based measures.

Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .. . . but pretty reliable in the aggregate.

261

Page 195: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Evaluation at large search engines

Recall is difficult to measure on the web

Search engines often use precision at top k , e.g., k = 10 . . .

. . . or use measures that reward you more for getting rank 1right than for getting rank 10 right.

Search engines also use non-relevance-based measures.

Example 1: clickthrough on first resultNot very reliable if you look at a single clickthrough (you mayrealize after clicking that the summary was misleading and thedocument is nonrelevant) . . .. . . but pretty reliable in the aggregate.Example 2: A/B testing

261

Page 196: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

262

Page 197: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

262

Page 198: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

262

Page 199: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

Have most users use old system

262

Page 200: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

Have most users use old system

Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation

262

Page 201: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

Have most users use old system

Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation

Evaluate with an “automatic” measure like clickthrough onfirst result

262

Page 202: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

Have most users use old system

Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation

Evaluate with an “automatic” measure like clickthrough onfirst result

Now we can directly see if the innovation does improve userhappiness.

262

Page 203: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

A/B testing

Purpose: Test a single innovation

Prerequisite: You have a large search engine up and running.

Have most users use old system

Divert a small proportion of traffic (e.g., 1%) to the newsystem that includes the innovation

Evaluate with an “automatic” measure like clickthrough onfirst result

Now we can directly see if the innovation does improve userhappiness.

Probably the evaluation methodology that large searchengines trust most

262

Page 204: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

263

Page 205: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

263

Page 206: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measure

263

Page 207: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrieval

263

Page 208: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

263

Page 209: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

263

Page 210: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

Significance testing is an issue

263

Page 211: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

Significance testing is an issue

could a good result have occurred by chance?

263

Page 212: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

Significance testing is an issue

could a good result have occurred by chance?is the result robust across different document sets?

263

Page 213: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

Significance testing is an issue

could a good result have occurred by chance?is the result robust across different document sets?slowly becoming more common

263

Page 214: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Take-away today

Focused on evaluation for ad-hoc retrieval

Precision, Recall, F-measureMore complex measures for ranked retrievalother issues arise when evaluating different tracks, e.g. QA,although typically still use P/R-based measures

Evaluation for interactive tasks is more involved

Significance testing is an issue

could a good result have occurred by chance?is the result robust across different document sets?slowly becoming more commonunderlying population distributions unknown, so applynon-parametric tests such as the sign test

263

Page 215: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Reading

MRS, Chapter 8

264

Page 216: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: Query 1, measured datapoints

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Blue for Query 1

Bold Circles measured

Query 1Rank R P

1 X 0.2 1.00 P̃1(r2) = 1.002

3 X 0.4 0.67 P̃1(r4) = 0.6745

6 X 0.6 0.50 P̃1(r6) = 0.50789

10 X 0.8 0.40 P̃1(r8)= 0.40111213141516171819

20 X 1.0 0.25 P̃1(r10) = 0.25

Five rjs (r2, r4, r6, r8, r10)coincide directly withdatapoint

265

Page 217: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: Query 1, interpolation

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Bold circles measured

thin circles interpolated

Query 1 P̃1(r0) = 1.00

Rank R P P̃1(r1) = 1.00

1 X .20 1.00 P̃1(r2) = 1.00

2 P̃1(r3) = .67

3 X .40 .67 P̃1(r4) = .674

5 P̃1(r5) = .50

6 X .60 .50 P̃1(r6) = .5078

9 P̃1(r7) = .40

10 X .80 .40 P̃1(r8)= .40111213

14 P̃1(r9) = .251516171819

20 X 1.00 .25 P̃1(r10) = .25

The six other rjs (r0, r1, r3, r5,r7, r9) are interpolated.

266

Page 218: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: Query 2, measured datapoints

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Blue: Query 1; Red: Query 2

Bold circles measured; thincircles interpol.

Query 2Rank Relev. R P

1 X .33 1.0023 X .67 .67456789

1011121314

15 X 1.0 .2 P̃2(r10) = .20

Only r10 coincides with ameasured data point

267

Page 219: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: Query 2, interpolation

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Blue: Query 1; Red:Query 2

Bold circles measured;thin circles interpol.

P̃2(r0) = 1.00

P̃2(r1) = 1.00

P̃2(r2) = 1.00

Query 2 P̃2(r3) = 1.00Rank Relev. R P

1 X .33 1.00 P̃2(r4) = .67

2 P̃2(r5) = .67

3 X .67 .67 P̃2(r6) = .67456789

1011

12 P̃2(r7) = .20

13 P̃2(r8) = .20

14 P̃2(r9) = .20

15 X 1.0 .2 P̃2(r10) = .20

10 of the rj s are interpolated

268

Page 220: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: averaging

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Now average at each pj

over N (number ofqueries)

→ 11 averages

269

Page 221: Lecture 6: Evaluation · 6 Other types of evaluation 224. Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Worked Example avg-11-pt prec: area/result

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

0.8 0.9 1

End result:

11 point average precision

Approximation of areaunder prec. recall curve

270