information retrieval meta-evaluation: challenges and opportunities in the music domain

Information Retrieval Meta-Evaluation:

Challenges and Opportunities

in the Music DomainJulián Urbano @julian_urbano

University Carlos III of Madrid

ISMIR 2011Miami, USA · October 26thPicture by Daniel Ray

Picture by Bill Mill

current evaluation practices

hinder the proper

development of Music IR

we lack

meta-evaluation studies

we can’t complete the IR

research & development cycle

how we got here?

Picture by NASA History Office

20111960

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

the basis users collections large-scale multi-language &multi-modal

20111960


MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

ISMIR(2000-today)

ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and

evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development

3 workshops (2002-2003):The MIR/MDL Evaluation Project

20111960

ISMIR(2000-today)

MIREX(2005-today)

ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and

evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development

3 workshops (2002-2003):The MIR/MDL Evaluation Project

followfollowfollowfollow the steps of the Text IR Text IR Text IR Text IR folksbut carefully: but carefully: but carefully: but carefully: not everything applies to music

>1200runs!


MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

20111960

ISMIR(2000-today)

MIREX(2005-today)

are we done already?

positive impact on MIR

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)


MEDLARS(1966-1967)

SMART(1961-1995)

Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR

20111960

ISMIR(2000-today)

MIREX(2005-today)

are we done already?

positive impact on MIR

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)


MEDLARS(1966-1967)

SMART(1961-1995)

Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR

a lot of thingsa lot of thingsa lot of thingsa lot of thingshave happened here!

“not everything applies”but much of it does!but much of it does!but much of it does!but much of it does!

some good practices inherited from here

we still have

a very long

way to go

evaluationPicture by Official U.S. Navy Imagery

Cranfield Paradigm

Task

User Model

Experimental Validity

how well an experiment meets the well-grounded

requirements of the scientific method

do the results fairly and actually assess

what was intended?

Meta-Evaluation

analyze the validity of IR Evaluation experiments

Task

Use

r m

od

el

Do

cum

en

ts

Qu

eri

es

Gro

un

d t

ruth

Sy

ste

ms

Me

asu

res

Construct x x x

Content x x x x x

Convergent x x x

Criterion x x x

Internal x x x x x

External x x x x

Conclusion x x x x x

experimental failures

construct validity

what?

do the variables of the experiment correspond

to the theoretical meaning of the concept

they purport to measure?

how?

thorough selection and justification

of the variables used

#fail

measure quality of a Web search engine

by the number of visits

construct validity in IR

effectiveness measures and their user model[Carterette, SIGIR2011]

set-based measures do not resemble real users[Sanderson et al., SIGIR2010]

rank-based measures are better[Jarvelin et al., TOIS2002]

graded relevance is better[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]

other forms of ground truth are better[Bennet et al., SIGIRForum2008]

content validity

what?

do the experimental units reflect and represent

the elements of the domain under study?

how?

careful selection of the experimental units

#fail

measure reading comprehension

only with sci-fi books

content validity in IR

tasks closely resembling real-world settings

systems completely fulfilling real-user needs

heavy user component, difficult to control

evaluate de system component instead[Cleverdon, SIGIR2001][Voorhees, CLEF2002]

actual value of systems is really unknown[Marchioni, CACM2006]

sometimes they just do not work with real users[Turpin et al., SIGIR2001]

content validity in IR

documents resembling real-world settings’

large and representative samples

careful selection of queries, diverse but reasonable[Voorhees, CLEF2002][Carterette et al., ECIR2009]

some queries are better to differentiate bad systems[Guiver et al., TOIS2009][Robertson, ECIR2011]

random selection is not goodnot goodnot goodnot good

specially for Machine LearningMachine LearningMachine LearningMachine Learning

convergent validity

what?

do the results agree with others, theoretical or

experimental, they should be related with?

how?

careful examination and confirmation

of the relationship between the results

and others supposedly related

#fail

measures of math skills not correlated

with abstract thinking

convergent validity in IR

ground truth data is subjective

differences across groups and over time

different results depending on who evaluates

absolute numbers change

relative differences stand still for the most part[Voorhees, IP&M2000]

for large-scale evaluations or varying experience

of assessors, differences do exist[Carterette et al., 2010]

convergent validity in IR

measures are precision- or recall-oriented

they should therefore be correlated with each other

but they actually are not[Kekäläinen, IP&M2005][Sakai, IP&M2007]

better correlated with others than with themselves![Webber et al., SIGIR2008]

correlation with user satisfaction in the task[Sanderson et al., SIGIR2010]

ranks, unconventional judgments, discounted gain…[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]

reliability?

criterion validity

what?

are the results correlated with those of

other experiments already known to be valid?

how?

careful examination and confirmation of the

correlation between our results and previous ones

#fail

ask if the new drink is good

instead of better than the old one

criterion validity in IR

practical large-scale methodologies: pooling[Buckley et al., SIGIR2004]

judgments by non-experts[Bailey et al., SIGIR2008]

crowdsourcing for low-cost[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]

estimate measures with fewer judgments[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]

select what documents to judge, by informativeness[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]

use no relevance judgments at all[Soboroff et al., SIGIR2001]

less effort, but same results?same results?same results?same results?

internal validity

what?

can the conclusions be rigorously drawn

from the experiment alone

and not other overlooked factors?

how?

careful identification and control of possible

confounding variables and selection of desgin

#fail

measure usefulness of Windows vs Linux vs iOS

only with Apple employees

internal validity in IR

inconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010]

incompleteness: performance depends on pools

system reinforcement[Zobel, SIGIR2008]

affects reliability of measures and overall results[Sakai, JIR2008][Buckley et al., SIGIR2007]

train-test: same characteristics in queries and docs

improvements on the same collections: overfitting[Voorhees, CLEF2002]

measures must be fair to all systems

specially forMachine LearningMachine LearningMachine LearningMachine Learning

external validity

what?

can the results be generalized

to other populations and experimental settings?

how?

careful design and justification

of sampling and selection methods

#fail

study cancer treatment mostly with teenage males

external validity in IR

weakest point of IR Evaluation[Voorhees, CLEF2002]

large-scale is always incomplete[Zobel, SIGIR2008][Buckley et al., SIGIR2004]

test collections are themselves an evaluation result

but they become hardly reusable[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]

external validity in IR

cross-collection comparisons are unjustified

highly depends on test collection characteristics[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]

systems perform differently with different collections

interpretation of results must be in terms of

pairwise comparisons, not absolute numbers[Voorhees, CLEF2002]

do not claim anything about state of the art

based on a handful of experiments

baselines can be used to compare across collections[Armstrong et al., CIKM2009]meaningful,

not random!not random!not random!not random!

conclusion validity

what?

are the conclusions justified based on the results?

how?

careful selection of the measuring instruments and

statistical methods used to draw grand conclusions

#fail

more access to the Internet in China than in the US

because of the larger total number of users

conclusion validity in IR

measures should be sensitive and stable[Buckley et al., SIGIR2000]

and also powerful[Voorhees et al., SIGIR2002][Sakai, IP&M2007]

with little effort[Sanderson et al., SIGIR2005]

always bearing in mind

the user model and the task

conclusion validity in IR

statistical methods to compare score distributions[Smucker et al., CIKM2007][Webber et al., CIKM2008]

correct interpretation of the statistics

hypothesis testing is troublesome

statistical significance ≠ practical significance

increasing #queries (sample size) increases power

to detect ever smaller differences (effect-size)

eventually, everything is statistically significant

challenges

Picture by Brian Snelson

IR Research & Development Cycle

MIR evaluation practices

do not allow usto complete this cycle


loose definition of task task task task intent intent intent intent and user modeluser modeluser modeluser model

realistic data


collections are too small too small too small too small and/or biasedbiasedbiasedbiased

lack of realistic, realistic, realistic, realistic, controlled public controlled public controlled public controlled public

collectionscollectionscollectionscollections

privateprivateprivateprivate, undescribedand unanalyzed

collections emerge

can’t replicate can’t replicate can’t replicate can’t replicate results, often

leading towrong conclusionswrong conclusionswrong conclusionswrong conclusions

standardstandardstandardstandard formats and evaluation software to

minimize bugsminimize bugsminimize bugsminimize bugs

IR Research & Development Cycleundocumentedundocumentedundocumentedundocumented measures,

no accepted evaluation softwarelack of baselines as lower bound(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)

proper statisticsproper statisticsproper statisticsproper statistics correct interpretationinterpretationinterpretationinterpretationof statistics


rawrawrawraw musical material unknown

undocumented undocumented undocumented undocumented queries queries queries queries and/ordocumentsdocumentsdocumentsdocuments

go back to private collections: overfittingoverfittingoverfittingoverfitting!!!!


collections can’t be can’t be can’t be can’t be reusedreusedreusedreused

blindblindblindblindimprovementsimprovementsimprovementsimprovements

go back toprivate collections:

overfittingoverfittingoverfittingoverfitting!!!!

Picture by Donna Grayson

collections

large, heterogeneous and controlled

not a hard endeavour, except for the damn copyright

Million Song Dataset!

still problematic (new features?, actual music)

standardize collections across tasks

better understanding and use of improvements

raw music data

essential for Learning and Improvement phases

use copyright-free data

Jamendo!

study possible biases

reconsider artificial material

evaluation model

let teams run their own algorithms(needs public collections)

relief for IMIRSEL and promote wider participation

successfuly used for 20 years in Text IR venues

adopted by MusiCLEF

only viable alternative in the long run

MIREX-DIY platforms still don’t allow full completion

of the IR Research & Development Cycle

organization

IMIRSEL plans, schedules and runs everything

add a 2nd tier of organizers, task-specific

logistics, planning, evaluation, troubleshooting…

format of large forums like TREC and CLEF

smooth the process and develop tasks that really

push the limits of the state of the art

overview papers

every year, by task organizers

detail the evaluation process, data, results

discussion to boost Interpretation and Learning

perfect wrap-up for team papers

rarely discuss results, and many are not even drafted

specific methodologies

MIR has unique methodologies and measures

meta-evaluate: analyze and improve

human effects on the evaluation

user satisfaction

standard evaluation software

bugs are inevitable

open evaluation software to everybody

gain reliability

speed up the development process

serve as documentation for newcomers

promote standardization of formats

baselines

help measuring the overall progress of the filed

standard formats + standard software +

public controlled collections + raw music +

task-specific organization

measure the state of the art

commitment

we need to acknowledge the current problems

MIREX should not only be a place to

evaluate and improve systems

but also a place to

meta-evaluate and improve how we evaluate

and a place to

design tasks that challenge researchers

analyze our evaluation methodologies

we all need to start

questioningevaluation practices

it’s worth it

Picture by Brian Snelson



it‘s not that eveything we do is wrong…



it‘s not that eveything we do is wrong…it’s that we don’t know it!

information retrieval meta-evaluation: challenges and opportunities in the music domain

Technology

evaluation metrics

decades of evaluation

ir tasks

ir measures

easymeta metaevaluation

convergent validity

content validity

easy metametaevaluation