user performance versus precision measures for simple search tasks ( don’t bother improving map )
DESCRIPTION
User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au. People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg. - PowerPoint PPT PresentationTRANSCRIPT
User Performance versus Precision Measures for Simple Search Tasks
(Don’t bother improving MAP)
Andrew Turpin
Falk Scholer
{aht,fscholer}@cs.rmit.edu.au
People in glass houses should not throw stones
http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg
Scientists should not live in glass houses.Nor straw, nor wood…
http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg
Scientists should do more than throw stones
www.worth1000.com/entries/ 161000/161483INPM_w.jpg
Overview
• How are IR systems compared?– Mean Average Precision: MAP
• Do metrics match user experience?• First grain (Turpin & Hersh SIGIR 2000)• Second pebble (Turpin & Hersh SIGIR 2001)• Third stone (Allan et al SIGIR 2005)• This golf ball (Turpin & Scholer SIGIR 2006)
0
0
P@5 1/5 = 0.20 2/5 = 0.40
P@1 0/1 = 0.00 0/1 = 0.00
0.00
0.25
0.20
0.17
0.00
0
1
0
0.000
1
0
0
0
1
0
0.00
0.00
0.67
0.25
0.40
0.33
AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54
Sum of all precision values at relevant documents Number of relevant docs in the list
Sum of all precision values at relevant documents Number of relevant docs in all lists
AP =
AP =
(0.25) / 1 =
(0.67 + 0.40) / 2 =
0.25
0.54
0.08
0.36
(0.25) / 3 =
(0.67 + 0.40) / 3 =
Mean Average Precision (MAP)
• Previous example showed precision for one query
• Ideally need many queries (50 or more)• Take the mean of the AP values over all
queries: MAP• Do a paired t-test, Wilcoxon, Tukey HSD,
…• Compares systems on the same
collection and same queries
Similarity Measure
Simple Terms
Simple Terms + Phrases
Percentage Improvement
Lnu.ltu 0.3616 0.3758 3.9% unknown
BBA-AGJ-BCA 0.3497 0.3683 5.1% p=0.006
BDA-CI-BCA 0.3373 0.3586 5.9% p=0.006
Turpin & Moffat SIGIR 1999
Typical IR empirical systems paper
Fang et al SIGIR 2004
Monz et al SIGIR 2005
Shi et al SIGIR 2005Jordan et al JCDL June 2006
Implicit assumptionMore relevant documents high in the list is good
• Do users generally want more than one relevant document?
• Do users read lists top to bottom?• Who determines relevance? Binary?
Conditional or state-based?
• While MAP is tractable, does it reflect user experience?
• Is Yahoo! really better than Google, or vice-versa?
General Experiment
• Get a collection, set of queries, relevance judgments
• Compare System A and System B using MAP (Cranfield)
• Get users to do queries with System A or System B (balanced design…)
• Did the users do better with A or B?• Did the users prefer A or B?
Experiment 2000
24 Users Engine A
Engine B
MAP 0.275
IR 0.330
MAP 0.324
IR 0.3906 Queries
Experiment 2001
32 Users Engine A
Engine B
MAP 0.270
QA 66%
MAP 0.354
QA 60%8 Queries
Experiment 2005
• James Allan et al, UMass, SIGIR2005
• Passage retrieval and a recall task
• Used bpref, which “tracks MAP”
• Small benefit to users when bpref goes from – 0.50 to 0.60 and 0.90 to 0.95
• No benefit in the mid range 0.60 to 0.90
Predicted
Instance recall 81% 15% (p = 0.27)
Question answering 58% -6% (p = 0.41)
Actual
Experiments 2000, 2001, 2005
MAP
Exp 2005 20% 20%
16% 1%
50% 0%
Exp 2001
Exp 2002
Experiment 2006
32 Users
A
MAP 0.55
50 Queries
B
C
D
E
MAP 0.65
MAP 0.75
MAP 0.85
MAP 0.95
(100 documents)
Our Sheep
MAP
0.55 0.65 0.75 0.85 0.95
Tim
e (s
econ
ds)
5010
015
0 20
025
030
00
Time required to find first relevant document
Failures
0
5
10
15
20
25
55% 65% 75% 85% 95%
MAP
% o
f qu
erie
s w
ith n
o re
leva
nt a
nsw
er
“Better” MAP definition
Conclusion
• MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true– Supported by 4 different experiments
• Don’t automatically choose MAP as a metric– P@1 for Web style tasks?
P@1
P@10 1
Tim
e (s
econ
ds)
5010
015
0 20
025
030
00
0-10%10-20%
20-30%30-40%
40-50%50-60%
60-70%70-80%
80-90%90-100%
Rank of saved/viewed docs
Number of relevant found