evaluation in (music) information retrieval through the audio music similarity task
DESCRIPTION
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions. Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.TRANSCRIPT
![Page 1: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/1.jpg)
Evaluation in (Music) Information Retrieval through the Audio Music Similarity task
Julián Urbano
Barcelona, Spain · January 16th 2014
![Page 2: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/2.jpg)
Spam
• @julian_urbano
• Postdoctoral researcher
– Music Technology Group, Universitat Pompeu Fabra
• Recently: PhD, Computer Science
– (Evaluation in) (Music) Information Retrieval
2
![Page 3: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/3.jpg)
Information Retrieval
• Automatic representation, storage and search of unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music
• A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents
3
![Page 4: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/4.jpg)
Information Retrieval Evaluation
• IR systems are based on models to estimate relevance, implementing different techniques
• How good is my system? What system is better?
– Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements
• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
4
![Page 5: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/5.jpg)
Disclaimer
• If you see…
A system is evaluated with a test collection containing queries, documents and judgments
telling how relevant a document is to a query
• …you can think of
An algorithm is evaluated with a dataset containing queries, songs and annotations
telling how similar a song is to a query
5
![Page 6: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/6.jpg)
Talk outline
• Why we want to Evaluate…
• …and what we do with Cranfield
• Validity: users versus systems
• Reliability: estimating from samples
• Efficiency: reducing annotations
6
![Page 7: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/7.jpg)
Introduction: Why we want to Evaluate…
![Page 8: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/8.jpg)
The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease? 8
![Page 9: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/9.jpg)
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use …
• Their distributions describe user experience, fully
– For an arbitrary user, query and document collection, what is the distribution of…
9
0 time to complete task
none frustration
much some
![Page 10: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/10.jpg)
The big(ger) picture
• Different user-measures attempting to assess the same thing: user satisfaction
– How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system?
• This is the ultimate goal: the good, the better
10
![Page 11: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/11.jpg)
The big(ger) question
• User satisfaction…as Bernoulli trial
• Probability of satisfaction P(Sat = yes)?
• Probability that k in n users are satisfied?
• Probability of >80% users satisfied?
11
satisfaction yes no
![Page 12: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/12.jpg)
Introduction: …what we do with Cranfield
![Page 13: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/13.jpg)
Sources of variability
user-measure = f(documents, query, user, system)
• Our goal is the distribution of the user-measure for our system, which is impossible to calculate
– (Possibly?) infinite population
• The best we can do is estimate it
– Sample documents, queries and users
– Measure user experience, implicitly or explicitly
– Representativeness, cost, ethics, privacy…
13
![Page 14: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/14.jpg)
Fix samples
• Hard to replicate experiment and repeat results
• Just plain impossible to reproduce results
• Get a (hopefully) good sample and fix it
– Documents and queries
• But we can’t fix the users!
14
![Page 15: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/15.jpg)
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures
• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)
15
![Page 16: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/16.jpg)
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures
• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)
user-measure = f(system)
15
![Page 17: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/17.jpg)
Test collections
• Controlled set of documents, queries and judgments, shared across researchers
• (Most?) important resource for IR research
– Experiments are inexpensive (collections are not!)
– Research becomes systematic
– Reproducibility becomes possible and easy
16
![Page 18: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/18.jpg)
Wait a minute
• Are we estimating distributions about users or distributions about systems?
system-effectiveness = f(system, scale, measure)
• We come up with different distributions of system-effectiveness, depending on how we abstract users from the experiment – Different scales to assess relevance
– Different measures to model user behavior
17
![Page 19: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/19.jpg)
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
![Page 20: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/20.jpg)
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
![Page 21: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/21.jpg)
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
![Page 22: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/22.jpg)
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
![Page 23: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/23.jpg)
Assumption
• System-measures correspond to user-measures
19
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
![Page 24: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/24.jpg)
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
20
![Page 25: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/25.jpg)
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
20
![Page 26: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/26.jpg)
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
• This poses several problems
– That we have been dealing with for over 50 years
– But hey, they’re extremely interesting!
20
![Page 27: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/27.jpg)
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to? – Internal: are observed effects due to hidden factors?
– External: are queries, documents and users representative?
– Construct: do system-measures match user-measures?
– Conclusion: how good is good and how better is better?
• Reliability: how repeatable are the results? – How large do collections need to be?
– What statistical methods should be used?
• Efficiency: how inexpensive is it to get valid and reliable results? (i.e. to build a test collection) – Can we estimate results with fewer judgments?
21
![Page 28: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/28.jpg)
In this talk
How to study and improve the validity, reliability and efficiency
of the methods used to evaluate IR systems
• Audio Music Similarity task as example – Song as query input to system, audio signal
– Retrieve songs musically similar to it, by content
– Resembles traditional Ad Hoc retrieval in Text IR
– Important task in Music IR • Music recommendation
• Playlist generation
• Plagiarism detection
22
![Page 29: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/29.jpg)
Validity: Effectiveness and Satisfaction
![Page 30: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/30.jpg)
Assumption of Cranfield
• Systems with better effectiveness are perceived by users as more useful, more satisfactory
• Tricky: different effectiveness measures and relevance scales produce different distributions
– Which one is better to predict satisfaction?
• Map system effectiveness onto user satisfaction, experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory?
– What is P(Sat | P@10 = 0.2)? 24
![Page 31: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/31.jpg)
User-oriented System-measures
• Effectiveness measures are generally not formulated to correlate with user-satisfaction
– If effectiveness is λ = 0, we expect P(Sat) = 0
– If effectiveness is λ = 1, we expect P(Sat) = 1
– In general, we expect P(Sat | λ) = λ
• But this is not what we have
– Effectiveness measures need to be reformulated
– Upper bounds, recall components, ideal rankings
– Many mathematical details omitted in this talk 25
![Page 32: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/32.jpg)
User Components: Measures and Scales
• How is relevance measured in the judgments?
– Nominal, ordinal, interval, ratio
• How are results consumed?
– Set, list
• What determines document utility?
– Positional, cascade
– Linear, exponential
• What determines user persistence?
– Navigational, informational
26
![Page 33: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/33.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 34: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/34.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 35: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/35.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 36: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/36.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 37: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/37.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 38: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/38.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 39: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/39.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 40: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/40.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 41: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/41.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 42: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/42.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 43: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/43.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
![Page 44: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/44.jpg)
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
MIREX
![Page 45: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/45.jpg)
Experimental design
28
![Page 46: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/46.jpg)
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
29
![Page 47: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/47.jpg)
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
– At least 200 examples per (measure-scale-λ)
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions
• 113 unique subjects 30
![Page 48: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/48.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
What do we expect?
31
![Page 49: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/49.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
31
![Page 50: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/50.jpg)
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction
32
![Page 51: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/51.jpg)
Single system: how good is it?
• Users don’t pay attention to ranking?
33
![Page 52: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/52.jpg)
Single system: how good is it?
• Exponential gain underestimates satisfaction
34
![Page 53: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/53.jpg)
Single system: how good is it?
• Document utility independent of others
35
![Page 54: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/54.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
What do we expect?
36
![Page 55: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/55.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
36
![Page 56: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/56.jpg)
Two systems: which one is better?
• Large differences needed for users to note them
37
![Page 57: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/57.jpg)
Two systems: which one is better?
• More relevance levels are better to discriminate
38
![Page 58: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/58.jpg)
Two systems: which one is better?
• Cascade and navigational user models are not appropriate
39
![Page 59: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/59.jpg)
Two systems: which one is better?
• Users do prefer the (supposedly) worse system
40
![Page 60: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/60.jpg)
Summary
• Effectiveness and satisfaction are clearly correlated – There is a 20% bias: P(Sat | 0) > 0 and P(Sat | 1) < 1 – Room to improve: personalization, better user abstraction
• Magnitude of differences does matter – Just looking at rankings is very naive
• Be careful with statistical significance
– Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX
• Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than
navigational and cascade – The more relevance levels, the better
41
![Page 61: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/61.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
42
![Page 62: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/62.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
43
![Page 63: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/63.jpg)
Validity: Satisfaction over Samples
![Page 64: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/64.jpg)
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql@5 = 0.61 = 0.7
• Easily for n users and a single query
– P Sat15 = 10 Ql@5 = 0.61 = 0.21
• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials
45
![Page 65: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/65.jpg)
Expected probability of satisfaction
• Now we can compute point and interval estimates of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness
46
![Page 66: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/66.jpg)
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary before
– Now it is meaningful, in terms of user satisfaction
• Intuitively, we want the majority of users to find the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5)
• Improving queries for which we are bad is worthier than further improving those for which we are already good
47
![Page 67: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/67.jpg)
Distribution of P(Sat)
• But we (will) only have a handful queries, estimates will probably be bad – Need to estimate the cumulative distribution function of
user satisfaction: FP(Sat)
– Not described by any typical distribution family
• More than ≈25 queries in the collection – ecdf approximates better
• Less than ≈25 queries in the collection – Normal for graded scales, ecdf for binary scales
• Beta is always the best with the Fine scale – Which turns out to be the best scale, overall
48
![Page 68: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/68.jpg)
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
![Page 69: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/69.jpg)
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
![Page 70: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/70.jpg)
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
![Page 71: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/71.jpg)
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries
50
![Page 72: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/72.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
51
![Page 73: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/73.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
52
![Page 74: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/74.jpg)
Reliability: Optimal Statistical Significance Tests
![Page 75: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/75.jpg)
Random error
• Test collections are just samples from larger, possibly infinite, populations
• If we conclude system A is better than B, how confident can we be?
– Δλ𝒬 is just an estimate of the population mean μΔλ
• Usually employ some statistical significance test for differences in location
• If it is statistically significant, we have confidence that the true difference is at least that large
54
![Page 76: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/76.jpg)
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0: μΔλ = 0
– H1: μΔλ ≠ 0
• Run test, obtain p-value= P μΔλ ≥ Δλ𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence
• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
55
![Page 77: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/77.jpg)
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test
• Based on resampling
– Bootstrap test, permutation/randomization test
• They make certain assumptions about distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that assumptions are violated?
56
![Page 78: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/78.jpg)
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates
• Safety
– Minimize Type I error rates
– Usually decreases power
• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
57
![Page 79: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/79.jpg)
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections
• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels
• All systems and queries from MIREX 2007-2011
– >15M p-values 58
![Page 80: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/80.jpg)
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the most successful, depending on α level
59
![Page 81: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/81.jpg)
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels
60
![Page 82: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/82.jpg)
Summary
• Bootstrap test is the most powerful, and still it has smaller Type I error rates, so we are safe
• Power and success:
– CGl@5 > GAP@5 > DCGl@5 > RBPl@5
– Fine > Broad > binary
• Conflicts:
– Very similar across measures and scales
– Corrections for multiple comparisons (e.g. Tukey) do not seem necessary
61
![Page 83: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/83.jpg)
Reliability: Test Collection Size
![Page 84: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/84.jpg)
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?
• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes and experimental designs
63
![Page 85: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/85.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 86: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/86.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 87: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/87.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 88: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/88.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 89: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/89.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 90: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/90.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 91: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/91.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 92: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/92.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 93: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/93.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 94: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/94.jpg)
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
![Page 95: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/95.jpg)
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
![Page 96: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/96.jpg)
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
Larger σs2
![Page 97: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/97.jpg)
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵 𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
Larger σs2 Smaller σq
2 or
more queries
![Page 98: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/98.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
66
![Page 99: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/99.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
66
![Page 100: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/100.jpg)
Effect of query set size • Average absolute stability Φ = 0.97 • ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases • Fine scale slightly better than Broad and binary scales • RBPl@5 and nDCGl@5 are the most stable
67
![Page 101: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/101.jpg)
Effect of query set size • Average relative stability Eρ 2 = 0.98
• ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
• Fine scale better than Broad and binary scales
• CGl@5 and RBPl@5 are the most stable
68
![Page 102: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/102.jpg)
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable
• Tested in MIREX 2012
– In 2013 too, but not analyzed here
69
![Page 103: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/103.jpg)
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable – From Φ = 0.81 to Φ = 0.83
– From Eρ 2 = 0.93 to Eρ 2 = 0.95
70
![Page 104: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/104.jpg)
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability
71
![Page 105: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/105.jpg)
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied
• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q
72
![Page 106: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/106.jpg)
Effect of assessor set size
• Broad scale: σ s2 ≈ σ h:q
2
• Fine scale: σ s2 ≫ σ h:q
2
• Always better to spend resources on queries
73
![Page 107: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/107.jpg)
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?
• Employ G-Theory while building the collection
• Fine better than Broad, better than binary
• CGl@5 and DCGl@5 best for relative stability
• RBPl@5 and nDCGl@5 best for absolute stability 74
![Page 108: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/108.jpg)
Implications
• Fixing the number of queries across years is unrealistic – Especially because they are not intended for reuse
• Fixing the number of queries across task is simply nonsense
• Need to analyze on a case-by-case basis, while building the collections – GT4IReval, R package online – https://github.com/julian-urbano/GT4IREval
75
![Page 109: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/109.jpg)
Efficiency: Learning Relevance Distributions
![Page 110: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/110.jpg)
Probabilistic evaluation
• The MIREX setting is still expensive – Need to judge all top k documents from all systems
– Takes days, even weeks sometimes
• Model relevance probabilistically – Relevance judgments are random variables over the space
of possible assignments of relevance
E Rd = P 𝑅𝑑 = ℓ · ℓ
ℓ∈ℒ
Var 𝑅𝑑 = P 𝑅𝑑 = ℓ · ℓ2
ℓ∈ℒ
− E 𝑅𝑑2
• Effectiveness measures are also probabilistic 77
![Page 111: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/111.jpg)
Probabilistic evaluation
• Accuracy increases as we make judgments
– E Rd ← rd
• Reliability increases too (confidence)
– Var Rd ← 0
• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop
• Judge as few documents as possible 78
![Page 112: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/112.jpg)
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P Rd = ℓ θd
– For each document separately
– Ordinal Logistic Regression
• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity 79
![Page 113: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/113.jpg)
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity
– Decent fit, R2 ≈ 0.35
• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity
– Excellent fit, R2 ≈ 0.91 80
![Page 114: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/114.jpg)
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine
• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine
• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine
• Negligible under the current MIREX setting
81
![Page 115: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/115.jpg)
Efficiency: Probabilistic Evaluation
![Page 116: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/116.jpg)
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝐷𝐶𝐺𝑙@𝑘 = 𝑟𝑖/ log2 𝑖 + 1𝑘𝑖=1
𝑛ℒ − 1 / log2 𝑖 + 1𝑘𝑖=1
– (New) probabilistic formulation:
E 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙
E 𝑅𝑖log2 𝑖 + 1
𝑘
𝑖=1
Var 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙2
Var 𝑅𝑖log2 𝑖 + 1 2
𝑘
𝑖=1
83
![Page 117: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/117.jpg)
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝐷𝐶𝐺𝑙@𝑘 = 𝑟𝑖/ log2 𝑖 + 1𝑘𝑖=1
𝑛ℒ − 1 / log2 𝑖 + 1𝑘𝑖=1
– (New) probabilistic formulation:
E 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙
E 𝑅𝑖log2 𝑖 + 1
𝑘
𝑖=1
Var 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙2
Var 𝑅𝑖log2 𝑖 + 1 2
𝑘
𝑖=1
83
𝜂𝐷𝐶𝐺𝑙
![Page 118: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/118.jpg)
Probabilistic effectiveness measures
• From there we can compute Δ𝐷𝐶𝐺𝑙@𝑘AB
• And averages over a sample of queries 𝒬
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence
• For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
84
![Page 119: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/119.jpg)
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%
85
![Page 120: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/120.jpg)
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
86
DCGl@5
Confidence Broad Fine
In bin Accuracy In bin Accuracy
[0.5, 0.6) 23 (6.5%) 0.826 22 (6.2%) 0.636
[0.6, 0.7) 14 (4%) 0.786 16 (4.5%) 0.812
[0.7, 0.8) 14 (4%) 0.571 11 (3.1%) 0.364
[0.8, 0.9) 22 (6.2%) 0.864 21 (6%) 0.762
[0.9, 0.95) 23 (6.5%) 0.87 19 (5.4%) 0.895
[0.95, 0.99) 24 (6.8%) 0.917 27 (7.7%) 0.926
[0.99, 1) 232 (65.9%) 0.996 236 (67%) 0.996
E[Accuracy] 0.938 0.921
![Page 121: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/121.jpg)
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems
• What documents should we judge? – Those that are the most informative
– Measure-dependent 87
![Page 122: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/122.jpg)
Relative estimates with judgments
• Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5
• Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931
88
![Page 123: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/123.jpg)
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores
• What documents should we judge? – Those that reduce variance the most
– Measure-dependent 89
![Page 124: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/124.jpg)
Absolute estimates with judgments • The stopping condition is overly confident – Virtually no judgments are even needed (supposedly)
• But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance
90
![Page 125: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/125.jpg)
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments
91
![Page 126: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/126.jpg)
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct
• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy
• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05
92
![Page 127: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/127.jpg)
Implications
• We do not need dozens of volunteers to make thousands of judgments over several days
• Just one person spending a couple hours is fine
• The spare manpower can be put to better use – Redundant judgments to have better estimates
– Make annotations for other tasks
• It naturally promotes collaborative creation of test collections by iteratively adding the judgments needed in each experiment (if any)
93
![Page 128: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/128.jpg)
Future Work
![Page 129: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/129.jpg)
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
– Different user models within the same task
95
![Page 130: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/130.jpg)
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while building test collections
96
![Page 131: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/131.jpg)
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights
97
![Page 132: Evaluation in (Music) Information Retrieval through the Audio Music Similarity task](https://reader030.vdocument.in/reader030/viewer/2022020115/5497bd1db479593d4d8b527f/html5/thumbnails/132.jpg)
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval