information retrieval meta-evaluation: challenges and opportunities in the music domain
DESCRIPTION
The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.TRANSCRIPT
Information Retrieval Meta-Evaluation:
Challenges and Opportunities
in the Music DomainJulián Urbano @julian_urbano
University Carlos III of Madrid
ISMIR 2011Miami, USA · October 26thPicture by Daniel Ray
Picture by Bill Mill
current evaluation practices
hinder the proper
development of Music IR
we lack
meta-evaluation studies
we can’t complete the IR
research & development cycle
how we got here?
Picture by NASA History Office
20111960
Cranfield 2(1962-1966)
MEDLARS(1966-1967)
SMART(1961-1995)
TREC(1992-today)
CLEF(2000-today)
NTCIR(1999-today)
the basis users collections large-scale multi-language &multi-modal
20111960
Cranfield 2(1962-1966)
MEDLARS(1966-1967)
SMART(1961-1995)
TREC(1992-today)
CLEF(2000-today)
NTCIR(1999-today)
ISMIR(2000-today)
ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and
evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development
3 workshops (2002-2003):The MIR/MDL Evaluation Project
20111960
ISMIR(2000-today)
MIREX(2005-today)
ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and
evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development
3 workshops (2002-2003):The MIR/MDL Evaluation Project
followfollowfollowfollow the steps of the Text IR Text IR Text IR Text IR folksbut carefully: but carefully: but carefully: but carefully: not everything applies to music
>1200runs!
Cranfield 2(1962-1966)
MEDLARS(1966-1967)
SMART(1961-1995)
TREC(1992-today)
CLEF(2000-today)
NTCIR(1999-today)
20111960
ISMIR(2000-today)
MIREX(2005-today)
are we done already?
positive impact on MIR
TREC(1992-today)
CLEF(2000-today)
NTCIR(1999-today)
Cranfield 2(1962-1966)
MEDLARS(1966-1967)
SMART(1961-1995)
Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR
20111960
ISMIR(2000-today)
MIREX(2005-today)
are we done already?
positive impact on MIR
TREC(1992-today)
CLEF(2000-today)
NTCIR(1999-today)
Cranfield 2(1962-1966)
MEDLARS(1966-1967)
SMART(1961-1995)
Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR
a lot of thingsa lot of thingsa lot of thingsa lot of thingshave happened here!
“not everything applies”but much of it does!but much of it does!but much of it does!but much of it does!
some good practices inherited from here
we still have
a very long
way to go
evaluationPicture by Official U.S. Navy Imagery
Cranfield Paradigm
Task
User Model
Experimental Validity
how well an experiment meets the well-grounded
requirements of the scientific method
do the results fairly and actually assess
what was intended?
Meta-Evaluation
analyze the validity of IR Evaluation experiments
Task
Use
r m
od
el
Do
cum
en
ts
Qu
eri
es
Gro
un
d t
ruth
Sy
ste
ms
Me
asu
res
Construct x x x
Content x x x x x
Convergent x x x
Criterion x x x
Internal x x x x x
External x x x x
Conclusion x x x x x
experimental failures
construct validity
what?
do the variables of the experiment correspond
to the theoretical meaning of the concept
they purport to measure?
how?
thorough selection and justification
of the variables used
#fail
measure quality of a Web search engine
by the number of visits
construct validity in IR
effectiveness measures and their user model[Carterette, SIGIR2011]
set-based measures do not resemble real users[Sanderson et al., SIGIR2010]
rank-based measures are better[Jarvelin et al., TOIS2002]
graded relevance is better[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]
other forms of ground truth are better[Bennet et al., SIGIRForum2008]
content validity
what?
do the experimental units reflect and represent
the elements of the domain under study?
how?
careful selection of the experimental units
#fail
measure reading comprehension
only with sci-fi books
content validity in IR
tasks closely resembling real-world settings
systems completely fulfilling real-user needs
heavy user component, difficult to control
evaluate de system component instead[Cleverdon, SIGIR2001][Voorhees, CLEF2002]
actual value of systems is really unknown[Marchioni, CACM2006]
sometimes they just do not work with real users[Turpin et al., SIGIR2001]
content validity in IR
documents resembling real-world settings’
large and representative samples
careful selection of queries, diverse but reasonable[Voorhees, CLEF2002][Carterette et al., ECIR2009]
some queries are better to differentiate bad systems[Guiver et al., TOIS2009][Robertson, ECIR2011]
random selection is not goodnot goodnot goodnot good
specially for Machine LearningMachine LearningMachine LearningMachine Learning
convergent validity
what?
do the results agree with others, theoretical or
experimental, they should be related with?
how?
careful examination and confirmation
of the relationship between the results
and others supposedly related
#fail
measures of math skills not correlated
with abstract thinking
convergent validity in IR
ground truth data is subjective
differences across groups and over time
different results depending on who evaluates
absolute numbers change
relative differences stand still for the most part[Voorhees, IP&M2000]
for large-scale evaluations or varying experience
of assessors, differences do exist[Carterette et al., 2010]
convergent validity in IR
measures are precision- or recall-oriented
they should therefore be correlated with each other
but they actually are not[Kekäläinen, IP&M2005][Sakai, IP&M2007]
better correlated with others than with themselves![Webber et al., SIGIR2008]
correlation with user satisfaction in the task[Sanderson et al., SIGIR2010]
ranks, unconventional judgments, discounted gain…[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
reliability?
criterion validity
what?
are the results correlated with those of
other experiments already known to be valid?
how?
careful examination and confirmation of the
correlation between our results and previous ones
#fail
ask if the new drink is good
instead of better than the old one
criterion validity in IR
practical large-scale methodologies: pooling[Buckley et al., SIGIR2004]
judgments by non-experts[Bailey et al., SIGIR2008]
crowdsourcing for low-cost[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]
estimate measures with fewer judgments[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]
select what documents to judge, by informativeness[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]
use no relevance judgments at all[Soboroff et al., SIGIR2001]
less effort, but same results?same results?same results?same results?
internal validity
what?
can the conclusions be rigorously drawn
from the experiment alone
and not other overlooked factors?
how?
careful identification and control of possible
confounding variables and selection of desgin
#fail
measure usefulness of Windows vs Linux vs iOS
only with Apple employees
internal validity in IR
inconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010]
incompleteness: performance depends on pools
system reinforcement[Zobel, SIGIR2008]
affects reliability of measures and overall results[Sakai, JIR2008][Buckley et al., SIGIR2007]
train-test: same characteristics in queries and docs
improvements on the same collections: overfitting[Voorhees, CLEF2002]
measures must be fair to all systems
specially forMachine LearningMachine LearningMachine LearningMachine Learning
external validity
what?
can the results be generalized
to other populations and experimental settings?
how?
careful design and justification
of sampling and selection methods
#fail
study cancer treatment mostly with teenage males
external validity in IR
weakest point of IR Evaluation[Voorhees, CLEF2002]
large-scale is always incomplete[Zobel, SIGIR2008][Buckley et al., SIGIR2004]
test collections are themselves an evaluation result
but they become hardly reusable[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
external validity in IR
cross-collection comparisons are unjustified
highly depends on test collection characteristics[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]
systems perform differently with different collections
interpretation of results must be in terms of
pairwise comparisons, not absolute numbers[Voorhees, CLEF2002]
do not claim anything about state of the art
based on a handful of experiments
baselines can be used to compare across collections[Armstrong et al., CIKM2009]meaningful,
not random!not random!not random!not random!
conclusion validity
what?
are the conclusions justified based on the results?
how?
careful selection of the measuring instruments and
statistical methods used to draw grand conclusions
#fail
more access to the Internet in China than in the US
because of the larger total number of users
conclusion validity in IR
measures should be sensitive and stable[Buckley et al., SIGIR2000]
and also powerful[Voorhees et al., SIGIR2002][Sakai, IP&M2007]
with little effort[Sanderson et al., SIGIR2005]
always bearing in mind
the user model and the task
conclusion validity in IR
statistical methods to compare score distributions[Smucker et al., CIKM2007][Webber et al., CIKM2008]
correct interpretation of the statistics
hypothesis testing is troublesome
statistical significance ≠ practical significance
increasing #queries (sample size) increases power
to detect ever smaller differences (effect-size)
eventually, everything is statistically significant
challenges
Picture by Brian Snelson
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
MIR evaluation practices
do not allow usto complete this cycle
IR Research & Development Cycle
IR Research & Development Cycle
loose definition of task task task task intent intent intent intent and user modeluser modeluser modeluser model
realistic data
IR Research & Development Cycle
IR Research & Development Cycle
collections are too small too small too small too small and/or biasedbiasedbiasedbiased
lack of realistic, realistic, realistic, realistic, controlled public controlled public controlled public controlled public
collectionscollectionscollectionscollections
privateprivateprivateprivate, undescribedand unanalyzed
collections emerge
can’t replicate can’t replicate can’t replicate can’t replicate results, often
leading towrong conclusionswrong conclusionswrong conclusionswrong conclusions
standardstandardstandardstandard formats and evaluation software to
minimize bugsminimize bugsminimize bugsminimize bugs
IR Research & Development Cycle
IR Research & Development Cycleundocumentedundocumentedundocumentedundocumented measures,
no accepted evaluation softwarelack of baselines as lower bound(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)
proper statisticsproper statisticsproper statisticsproper statistics correct interpretationinterpretationinterpretationinterpretationof statistics
IR Research & Development Cycle
IR Research & Development Cycle
rawrawrawraw musical material unknown
undocumented undocumented undocumented undocumented queries queries queries queries and/ordocumentsdocumentsdocumentsdocuments
go back to private collections: overfittingoverfittingoverfittingoverfitting!!!!
IR Research & Development Cycle
IR Research & Development Cycle
collections can’t be can’t be can’t be can’t be reusedreusedreusedreused
blindblindblindblindimprovementsimprovementsimprovementsimprovements
go back toprivate collections:
overfittingoverfittingoverfittingoverfitting!!!!
Picture by Donna Grayson
collections
large, heterogeneous and controlled
not a hard endeavour, except for the damn copyright
Million Song Dataset!
still problematic (new features?, actual music)
standardize collections across tasks
better understanding and use of improvements
raw music data
essential for Learning and Improvement phases
use copyright-free data
Jamendo!
study possible biases
reconsider artificial material
evaluation model
let teams run their own algorithms(needs public collections)
relief for IMIRSEL and promote wider participation
successfuly used for 20 years in Text IR venues
adopted by MusiCLEF
only viable alternative in the long run
MIREX-DIY platforms still don’t allow full completion
of the IR Research & Development Cycle
organization
IMIRSEL plans, schedules and runs everything
add a 2nd tier of organizers, task-specific
logistics, planning, evaluation, troubleshooting…
format of large forums like TREC and CLEF
smooth the process and develop tasks that really
push the limits of the state of the art
overview papers
every year, by task organizers
detail the evaluation process, data, results
discussion to boost Interpretation and Learning
perfect wrap-up for team papers
rarely discuss results, and many are not even drafted
specific methodologies
MIR has unique methodologies and measures
meta-evaluate: analyze and improve
human effects on the evaluation
user satisfaction
standard evaluation software
bugs are inevitable
open evaluation software to everybody
gain reliability
speed up the development process
serve as documentation for newcomers
promote standardization of formats
baselines
help measuring the overall progress of the filed
standard formats + standard software +
public controlled collections + raw music +
task-specific organization
measure the state of the art
commitment
we need to acknowledge the current problems
MIREX should not only be a place to
evaluate and improve systems
but also a place to
meta-evaluate and improve how we evaluate
and a place to
design tasks that challenge researchers
analyze our evaluation methodologies
we all need to start
questioningevaluation practices
it’s worth it
Picture by Brian Snelson
we all need to start
questioningevaluation practices
it‘s not that eveything we do is wrong…
we all need to start
questioningevaluation practices
it‘s not that eveything we do is wrong…it’s that we don’t know it!