carnegie mellon school of computer science | - prefacepbennett/bbr-workshop/bbr-proceedings.pdf ·...

Preface

The SIGIR 2008 Workshop on Beyond Binary Relevance: Preferences, Diversity, and Set-Level judgments, held in conjunction with the 31st Annual International ACM SIGIR Con-ference, is the first SIGIR workshop to explore research challenges at the intersection of novelmeasures of relevance, novel learning methods, and core evaluation issues. The goal of thisworkshop is to examine how the type of response elicited from assessors and users influencesthe evaluation and analysis of retrieval and filtering applications. For example, research sug-gests that asking people which of two results they prefer is faster and more reliable thanasking them to make absolute judgments about the relevance of each result. Similarly, manyresearchers are using implicit measures, such as clicks, to evaluate systems. New methodslike preference judgments or usage data require learning methods, evaluation measures, andcollection procedures designed for them. This workshop tackles these and other related issues.The workshop has received sponsorship from ACM SIGIR.

Reflecting the positioning of this workshop at the intersection of several fields, the call forpapers resulted in submissions from a wide variety of viewpoints – less than half of whichwere accepted for publication in these proceedings. All papers were thoroughly reviewed bythe program committee and external reviewers. The accepted papers were divided into twosessions: “Relevance and Evaluation” and “Diversity”. Additionally, we have organized twocomplementary sessions: one on “Position Statements and Early Work” to foster discussionon related research in early stages and another on “Algorithmics” with invited speakers toaddress relevant learning challenges and approaches. Additionally, the proceedings includea paper documenting a dataset and system baselines several of the organizing committeemembers collected to help promote research in this arena.

We would like to thank the SIGIR 2008’s main committee for its support. We are particu-larly thankful for the workshop chairs’ (Peter Anick and Hwee Tou Ng) feedback and quickresponses to questions and for the publication chair’s (Ee-Peng Lim) template for workshopproceedings.

Finally we would like to thank authors, program committee members, external reviewers,and contributors for their effort in supporting the workshop.

Paul N. Bennett, Ben Carterette, Olivier Chapelle, Thorsten JoachimsOrganizing Committee

1

Table of Contents

Organization of SIGIR 2008 Workshop on Beyond Binary Relevan ce: Preferences,Diversity, and Set-Level Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Sponsors & Supporters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

Workshop Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

Session 1: Relevance and Evaluation• Invited Talk Abstract: Empirical justification of the discount function for nDCG . . . . .6

Evangelos Kanoulas, Javed A. Aslam (Northeastern University)

• Learning the Gain Values and Discount Factors of DCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Ke Zhou (Dept. of Computer Science and Engineering, Shanghai Jiao-Tong University)Hongyuan Zha (College of Computing, Georgia Institute of Technology)Gui-Rong Xue, Yong Yu (Dept. of Computer Science and Engineering, Shanghai Jiao-Tong University)

Session 2: Diversity• Creating a Test Collection to Evaluate Diversity in Image Retrieval . . . . . . . . . . . . . . . . . 15

Thomas Arni, Jaiyu Tang, Mark Sanderson, Paul Clough (Department of Information Studies, University ofSheffield)

• Nugget-based Framework to Support Graded Relevance Assessments . . . . . . . . . . . . 22Maheedhar Kolla, Olga Vechtomova, Charles L.A. Clarke (University of Waterloo)

Session 3: Position Statements and Early Work• Is Relevance the Right Criterion for Evaluating Interactive Information Retrieval? . 28

Nicholas J. Belkin, Ralf Bierig, Michael Cole (Information & Library Studies, Rutgers University)

• Beyond Relevance Judgments: Cognitive Shifts and Gratification . . . . . . . . . . . . . . . . . . 28Amanda Spink, Frances Alvarado-Albertorio, Jia T. Du (Faculty of Information Technology, Queensland Uni-versity of Technology)

• Set Based Retrieval: The Potemkin Buffet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Soo-Yeon Hwang, Paul Kantor, Michael J. Pazzani (Rutgers University)

Session 4: Algorithmics• Invited Talk Abstract: Learning Diverse Rankings by Minimizing Abandonment . . . 29

Filip Radlinski (Department of Computer Science, Cornell University)

2

• Invited Talk Abstract: Clicks-vs-Judgments and Optimization . . . . . . . . . . . . . . . . . . . . . . .30Nick Craswell (Micrsoft Research Cambridge)

• A Test Collection of Preference Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Ben Carterette (Center for Intelligent Information Retrieval, University of Massachusetts Amherst)Paul N. Bennett (Microsoft Research)Olivier Chapelle (Yahoo! Research)

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3

SIGIR 2008 Workshop OrganizationBeyond Binary Relevance:

Preferences, Diversity, and Set-Level Judgments

OrganizingCommittee:

Paul N. Bennett (Chair & Contact), (Microsoft Research, USA)Ben Carterette, (University of Massachusetts Amherst, USA)Olivier Chapelle, (Yahoo! Research, USA)Thorsten Joachims, (Cornell University, USA)

Program Committee: Javed Aslam, (Northeastern University, USA)Chris J.C. Burges, (Microsoft Research, USA)Jaime Carbonell, (Carnegie Mellon, USA)Fernando Diaz, (Yahoo! Applied Research, Canada)Djoerd Hiemstra, (University of Twente, Netherlands)Paul Kantor, (Rutgers University, USA)Oren Kurland, (Technion – Israel Institute of Technology, Israel)Guy Lebanon, (Purdue University, USA)Don Metzler, (Yahoo! Research, USA)Filip Radlinski, (Cornell University, USA)Mark Sanderson, (University of Sheffield, UK)Ian Soboroff, (NIST, USA)Amanda Spink, (Queensland University of Technology, Australia)Ellen Voorhees, (NIST, USA)Weng-Keen Wong, (Oregon State University, USA)Emine Yilmaz, (Microsoft Research, UK)Yisong Yue, (Cornell University, USA)Hugo Zaragoza, (Yahoo! Research Barcelona, Spain)

External Reviewers: Azzah Al-Maskari (University of Sheffield, UK)

Sponsors:

4

Program for SIGIR 2008 Workshop Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments

Workshop Begins: 8:30 a.m.

Relevance & Evaluation Session: 8:30 – 10 a.m. o (45 min) Invited Talk: Empirical justification of the discount function for nDCG

Javed A. Aslam o (20 min) Learning the Gain Values and Discount Factors of DCG

Ke Zhou, Hongyuan Zha, Gui-Rong Xue, Yong Yu o (25 min) Discussion

Morning Tea: 10 – 10:30 a.m.

Diversity Session: 10:30 a.m. – 12:30 p.m. o (45 min) Invited Talk: Title TBA

David Hawking o (20 min) Creating a Test Collection to Evaluate Diversity in Image Retrieval

Thomas Arni, Jaiyu Tang, Mark Sanderson, Paul Clough o (20 min) Nugget-based Framework to Support Graded Relevance Assessments

Maheedhar Kolla, Olga Vechtomova, Charles L.A. Clarke o (35 min) Discussion

Lunch Break: 12:30 – 1:30 p.m.

Position Statements and Early Work Session: 1:30 – 2:45 p.m. o (25 min) Is Relevance the Right Criterion for Evaluating Interactive Information

Retrieval? Nicholas J. Belkin, Ralf Bierig, Michael Cole

o (15 min) Set Based Retrieval: The Potemkin Buffet Model Soo-Yeon Hwang, Paul Kantor, Michael J. Pazzani

o (15 min) Beyond Relevance Judgments: Cognitive Shifts and Gratification Amanda Spink, Frances Alvarado-Albertorio, Jia T. Du

o (20 min) Discussion

Algorithmic Session: 2:45 – 3:30 p.m. o (45 min) Invited Talk: Learning Diverse Rankings by Minimizing Abandonment

Filip Radlinski

Afternoon tea: 3:30 – 4 p.m.

Algorithmic Session continued: 4:00 – 5:10 p.m. o (45 min) Invited Talk: Clicks-vs-Judgments and Optimization

Nick Craswell o (25 min) Discussion

Wrap-Up, Final Words, and Discussion: 5:10 – 6 p.m. o (10 min) A Test Collection of Preference Judgments

Ben Carterette, Paul Bennett, Olivier Chapelle o (40 min) Wrap-Up Discussion

Workshop Ends: 6 p.m.

5

Empirical justification of the discount function for nDCG

Evangelos [email protected]

Javed A. [email protected]

College of Computer and Information ScienceNortheastern University

360 Huntington Ave, #202 WVHBoston, MA 02115

ABSTRACTJarvelin and Kekalainen proposed nDCG as a measure of system effectivenessthat utilizes graded relevance. One of the major criticism nDCG has receivedis due to the fact that the discount function used to weight the importance ofCumulative Gain at different ranks is arbitrary. The original paper introducingnDCG proposed 1/logb(rank) as a discount function. 1/rank has also beendiscussed as a different option. Other researchers in the IR community employ1/log2(rank+1) as the discount function, which seems, nowadays, to dominatethe literature. Nevertheless, the proposed discount functions are ad hoc, lackingintuition and theoretical foundations or empirical justification.

In this talk we will discuss how to empirically derive a discount function thatoptimizes the self-convergence of the nDCG measure. That is, a discount functionthat minimizes the number of queries that should be used such as the variability inthe mean nDCG values observed reflects the difference in the effectiveness of thesystems as it would be measured by nDCG if the universe of all possible querieswas employed. In a sense, this also leads to the maximum discriminative powerof nDCG. First, we will describe the univariate component analysis frameworkutilized to find the optimal discount function. Then we will compare the optimaldiscount function to the different discount functions that have appeared in theliterature with respect to self-convergence and ranking of systems.

Copyright is held by the author/owner(s).SIGIR 2008 Workshop: Beyond Binary Relevance: Preferences, Diversityand Set-Level Judgments, July 24th, 2008, Singapore.ACM 978-1-60558-164-4/08/07

6

Learning the Gain Values and Discount Factors of DCG

Ke ZhouDept. of Computer Science

and EngineeringShanghai Jiao-Tong University

No. 800, Dongchuan Road,Shanghai, China 200240

[email protected]

Hongyuan ZhaCollege of Computing

Georgia Institute ofTechnology

Atlanta, GA [email protected]

Gui-Rong Xue, Yong YuDept. of Computer Science

and EngineeringShanghai Jiao-Tong University

No. 800, Dongchuan Road,Shanghai, China 200240

{grxue,yyu}@apex.sjtu.edu.cn

ABSTRACT

Evaluation metrics are an essential part of a rankingsystem, and in the past many evaluation metrics havebeen proposed in information retrieval and Web search.Discounted Cumulated Gains (DCG) has emerged asone of the evaluation metrics widely adopted for evalu-ating the performance of ranking functions used in Websearch. However, the two sets of parameters, gain valuesand discount factors, used in DCG are determined in arather ad-hoc way. In this paper we first show that DCGis generally not coherent, meaning that comparing theperformance of ranking functions using DCG very muchdepends on the particular gain values and discount fac-tors used. We then propose a novel methodology thatcan learn the gain values and discount factors from userpreferences over rankings. Numerical simulations illus-trate the effectiveness of our proposed methods.

1. INTRODUCTIONDiscounted Cumulated Gains (DCG) is a popular eval-

uation metric for comparing the performance of rankingfunctions [4]. It can deal with multi-grade judgmentsand it also explicitly incorporates the position informa-tion of the documents in the result sets through the useof discount factors. However, in the past, the selectionof the two sets of parameters, gain values and discountfactors, used in DCG is rather arbitrary, and severaldifferent sets of values have been used. This is ratheran unsatisfactory situation considering the popularityof DCG. In this paper, we address the following two


important issues of DCG:

1. Does the parameter set matter? I.e., do differentparameter sets give rise to different preference overthe ranking functions?

2. If the answer to the above question is yes, is there aprincipled way to selection the set of parameters?

The answer to the first question is yes if there aremore than two grades used in the evaluation. This isgenerally the case for Web search where multiple gradesare used to indicate the degree of relevance of documentswith respect to a query. We then propose a principledapproach for learning the set of parameters using pref-erences over different rankings of the documents. Aswill be shown the resulting optimization problem for thelearning the parameters can be solved using quadraticprogramming very much like what is done in supportvector machines for classification. We did several nu-merical simulations that illustrate the feasibility and ef-fectiveness of the proposed methodology. We want toemphasize that the experimental results are preliminaryand limited in its scope because of the use of the sim-ulation data; and experiments using real-world searchengine data are being considered.

2. RELATED WORKCumulated gain based measures such as DCG [4] have

been applied to evaluate information retrieval systems.Despite their popularity, little research has been focusedon analyzing the coherence of these measures to the bestof our knowledge. The study of [7] shows that differentgain values of DCG can raise different judgements ofranking lists. In this study, we first prove that the DCGis incoherency and then propose a principled method tolearn the DCG parameters as a linear utility function.

Learning to rank attracts a lot of research interests inrecent years. Several methods have been developed tolearn the ranking function through directly optimizationperformance metrics such as MAP and DCG [5, 8, 9].These studies focus on learning a good ranking function

7

with respect to given performance metrics, while thegoal of this paper is to analysis coherence of DCG andpropose a learning method to determine the parametersof DCG.

As we have mentioned in Section 5, DCG can beviewed as a linear utility function. Therefore, the prob-lem of learning DCG is closely related to the problem oflearning the utility function. Learning utility functionis studied under the name of conjoint analysis by themarket science community [1, 6]. The goal of conjointanalysis is to model the users’ preference over productsand infer the features that satisfy the demands of users.Several methods have been proposed to model solve theproblem [2,3].

3. DISCOUNTED CUMULATED GAINSWe first introduce some notation used in this pa-

per. We are interested in ranking N documents X ={x1, . . . , xN}. We assume that we have a finite ordinallabel (grade) set L = {`1, . . . , `L}. We assume that ì

is preferred over ì+1, i = 1, . . . , L − 1. In Web search,for example, we can have

L = {Perfect, Excellent, Good, Fair, Bad},i.e., L = 5. A ranking of X is a permutation

π = (π(1), . . . , π(N)),

of (1, . . . , N), i.e., the rank of xπ(i) under the ranking πis i.

For each label is associated a gain value gi ≡ g(ì),and gi, i = 1, . . . , L constitute the set of gain valuesassociated with L. The DCG for π with the associatedlabels is computed as

DCGg,K(π) =

K∑i=1

cigπ(i), K = 1, . . . , N,

where c1 > c2 > · · · > cK > 0 are the so-called discountfactors [4].

The gain vector g = [g1, . . . , gL] is said to be compati-ble if g1 > g2 > · · · > gL. If two gain vectors gA and gB

are both compatible, then we say they are compatiblewith each other. In this case, there is a transformationφ such that

φ(gAi ) = gB

i , i = 1, . . . , L,

and the transformation φ is order preserving, i.e.,

φ(gi) > φ(gj), if gi > gj .

4. INCOHERENCY OF DCGNow assume there are two rankers A and B using

DCG with gain vectors gA and gB, respectively. Wewant to investigate how coherent A and B are in evalu-ating different rankings

4.1 Good NewsFirst the good news: if A and B are compatible, then

A and B agree on which set of rankings is optimal, i.e.,which set of rankings have the highest DCG. We firststate the following well-known result.

Proposition 1. Let a1 ≥ · · · ≥ aN and b1 ≥ · · · ≥bN . Then

N∑i=1

aibi = maxπ

N∑i=1

aibπ(i).

It follows from the above result that any ranking πsuch that

gπ(1) ≥ gπ(2) ≥ · · · ≥ gπ(K)

achieves the highest DCGg,K , as long as the gain vectorg is compatible.

How about those rankings that have smaller DCGs?We say two compatible rankers A and B are coherent, ifthey score any two rankings coherently, i.e., for rankingsπ1 and π2,

DCGgA,K(π1) ≥ DCGgA,K(π2)

if and only if

DCGgB ,K(π1) ≥ DCGgB ,K(π2),

i.e., ranker A thinks π1 is better than π2 if and only ifranker B thinks π1 is better than π2. Now the questionis whether compatibility implies coherency. We havethe following result.

Theorem. If L = 2, then compatibility implies co-herency.

Proof. Fix K > 1, and let

c =K∑

i=1

ci.

When there are only two labels, let the correspondinggains be g1, g2. For a ranking π, define

c1(π) =∑

π(i)=`1

ci, c2(π) = c− c1(π).

Then

DCGg,K(π) = c1(π)g1 + c2(π)g2.

For any two rankings π1 and π2,

DCGgA,K(π1) ≥ DCGgA,K(π2)

implies that

c1(π1)gA1 + c2(π1)g

A2 > c1(π1)g

A1 + c2(π1)g

A2

which gives

(c1(π1)− c1(π2))(gA1 − gA

2 ) > 0.

Since A and B are compatible, the above implies that

(c1(π1)− c1(π2))(gB1 − gB

2 ) > 0.

8

Therefore DCGgB ,K(π1) ≥ DCGgB ,K(π2). The proofis completed by exchange A and B in the above argu-ments.

4.2 Bad NewsNot too surprisingly, compatibility does not imply co-

herency when L > 2. We now present an example.Example. Let X = {x1, x2, x3}, i.e., N = 3. We

consider DCGg,K with K = 2. Assume the labels ofx1, x2, x3 are `2, `1, `3, and for ranker A, the correspond-ing gains are 2, 3, 1/2. The optimal ranking is (2, 1, 3).Consider the following two rankings,

π1 = (1, 3, 2), π2 = (3, 2, 1)

None of them is optimal. Let the discount factors be

c1 = 1 + ε, c2 = 1− ε, 1/4 < ε < 1.

It is easy to check that

DCGgA,2(π1) = 2c1 + (1/2)c2 >

> (1/2)c1 + 3c2 = DCGgA,2(π2).

Now let gB = φ(gA), where φ(t) = tk, and φ is cer-tainly order preserving, i.e., A and B are compatible.However, it is easy to see that for k large enough, wehave

2kc1 + (1/2)kc2 < (1/2)kc1 + 3kc2

which is the same as

DCGgB ,2(π1) < DCGgB ,2(π2).

Therefore, A thinks π1 is better than π2 while B thinksπ2 is better than π1 even though A and B are compat-ible. This implies A and B are not coherent.

4.3 RemarksWhen we have more than two labels, which is the case

for Web search, using DCGK with K > 1 to comparethe DCGs of various ranking functions will very muchdepend on the gain vectors used. Different gain vectorscan lead to completely different conclusions about theperformance of the ranking functions.

The current choice of gain vectors for Web search israther ad hoc, and there is no criterion to judge whichset of gain vectors are reasonable or natural.

5. LEARNING GAIN VALUES AND DIS-

COUNT FACTORSDCG can be considered as a simple form of linear

utility function. In this section, we discuss a method tolearn the gain values and discount factors that consti-tute this utility function.

5.1 A Binary RepresentationWe consider a fixed K, and we use a binary vector

s(π) of dimension K × L to represent a ranking π con-sidered for DCGg,K . Here L is the number of levels

of the labels. In particularly, the first L-componentsof s correspond to the first position of the K-positionranking in question, and the second L-components thesecond position, and so on. Within each L-components,the i-th component is 1 if and only if the item in positionone has label `i, i = 1, . . . , L.

Example. In the Web search case, L = 5, supposewe consider DCGg,3, and for a particular ranking π thelabels of the first three documents are

Perfect, Bad, Good.

Then the corresponding 15-dimensional binary vectors(π) is

[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0].

We postulate a utility function u(s) = wT s which alinear function of s, and w is the weight vector, and wewrite

w = [w1,1, . . . , w1,L, w2,1, . . . , w2,L, . . . , wk,1, . . . , wK,L].

We distinguish two cases.Case 1. The gain values are position independent.

This corresponds to the case

wi,j = cigj , , i = 1, . . . , K, j = 1, . . . , L.

This is to say that ci, i = 1, . . . , K are the discountfactors, and gj , j = 1, . . . , L are the gain values. It iseasy to see that

wT s(π) = DCGg,K(π).

Case 2. In this framework, we can consider the moregeneral case that the gain values are position dependent.Then w1,1, . . . , w1,L are just the products of the discountfactor c1 and the position dependent gain values for po-sition one, and so on. In this case, there is no need toseparate the gain values and the discount factors. Theweights in the weight vector w are what we need.

5.2 Learning w

We assume we have available a partial set of pref-erences over the set of all rankings. For example, wecan present a pair of rankings π1 and π2 to a user, andthe user prefers π1 over π2, denoted by π1 � π2, whichtranslates into wT s(π1) ≥ wT s(π2). Let the set of pref-erences be

πi � πj , (i, j) ∈ S.

In the second case described above, we can formulatethe problem as learning the weight vector w subject toa set of constraints (similar to rank SVM):

minw, ξij

wT w + C∑

(i,j)∈S

ξ2ij (1)

subject to

wT (s(πi)− s(πj)) ≥ 1− ξij , ξij ≥ 0, (i, j) ∈ S.

9

wkl ≥ wk,l+1, k = 1, . . . , K, l = 1, . . . , L− 1

For the first case, we can compute w as in Case 2,and then find ci and gj to fit w. It is also possible tocarry out hypothesis testing to see if the gain values areposition dependent or not.

6. SIMULATIONIn this section, we report the results of numerical sim-

ulations to show the feasibility and effectiveness of themethod proposed in Equation (1).

6.1 Experimental SettingsWe use a ground-truth w to obtain preference of rank-

ing lists. Our goal is to investigate whether we can re-construct w via learning from the preference of rankinglists. The ground-truth w is generated according to thefollowing equation:

wkl =Gl

log(k + 1)(2)

For a comprehensive comparison, we distinguish twosettings of Gl. Specifically, we set Gl = l in the firstsetting (Data 1), and Gl = 2l − 1 in the second setting(Data 2).

The ranking lists are obtained by randomly permut-ing a ground-truth ranking list. For example, the rank-ing lists can be generated by permuting the list [5, 5, 4, 4,3, 3, 2, 2, 1, 1] randomly. We randomly generate differentnumbers of pairs of ranking lists and use the ground-truth to judge which ranking lists is preferred. Specifi-cally, if wT s(π1) > wT s(π2), we have a preference pairπ1 � π2. Otherwise we have a preference pair π2 � π1.

6.2 Evaluation MeasuresGiven the estimated w and the ground-truth w, we

apply two measures to evaluate the quality of the esti-mated w.

The first measure is the precision on a test set. Anumber of pairs of ranking lists are generated as thetest set. We apply w to predict the preference over thetest set. Then the precision of w is calculated as theproportion of correctly predicted preference in the testset.

The second measure is the similarity of w and w.Given the true value of w and the estimation w de-fined by the above optimization problem, the similaritybetween w and w can be defined as follows:

T (w) = (w11 −w1L, , . . . , w1L − w1L,

. . . , wK1 − wKL, . . . , wKL −wKL) (3)

We can observe that the transformation T preserve theorders between ranking lists, i.e., T (w)T s(π1) > T (w)T s(π2)iff wT s(π1) > wT s(π2). The similarity between w and

w is measured by sim(w, w) = T (w)T T (w)‖T (w)‖‖T (w)‖ .

6.3 General PerformanceWe randomly sample a number of ranking lists and

generate preference pairs according to a ground-truthw. The number of preference pairs in training set rangesfrom 20 to 200. We plot the precision and similarity ofthe estimated w with respect to the number of trainingpairs in Figure 1 and Figure 2. It can be observed fromFigure 1 and 2 that the performance generally growswith the increasing of training pairs, indicting that thepreference over ranking lists can be utilized to enhancethe estimation of the unity function w. Another obser-vation is when about 200 preference pairs are includedin training set, the precisions in test sets become closeto 95% under both settings. This observation suggeststhat we can estimate w precisely from the preferenceof ranking lists. We also notice that the similarity andprecision sometimes give different conclusions of the rel-ative performance over Data 1 and Data 2. We thinkit is because the similarity measure is sensitive to thechoice of the offset constant. For example, large offsetconstants will give similarity very close to 1. Currently,we use w1L, . . . , wKL of the offset constant as in Equa-tion (3). Generally, we refer the precision as a moremeaningful evaluation metric and report similarity as acomplement to precision.

After the utility function w is obtained, we can recon-struct the gain vector and the discount factors from w.To this end, we rewrite w as a matrix W of size L×K.Assume that singular value decomposition of the ma-trix W can be expressed as W = Udiag(σ1, . . . , σn)V T

where σ1 ≥ · · · ≥ σn. Then, the rank-1 approxima-tion of W is W = σ1u1v

T1 . In this case, the first left

singular vector u1 is the estimation of the gain vectorand the first right singular vector v1 is the estimation ofthe discount factors. We plot the estimated gain vectorand discount factors with respect to their true valuesin Figure 4 and Figure 3, respectively. Note the per-fect estimations give straight lines in these two figures.We can see that the discount factors for the top-rankedpositions are more close to a straight line and thus areestimated more accurately. This is because the discountfactors of top-ranked positions have greater impact tothe preference of the ranking lists. Therefore, these dis-count factors are captured more precisely by the con-straints. The similar phenomenon can also be observedfor the gain vector.

6.4 Noisy SettingsIn real world scenarios, the preference pairs of rank-

ing lists can be noisy. Therefore, it is interesting toinvestigate the effect of the noisy pairs to the perfor-mance. To this end, we fix the number training pairsto be 200 and create the noisy pairs by randomly flip anumber of pairs in the training set. In our experiments,the number of noisy pairs ranges from 5 to 40. Sincethe trade off value C is important to the performancein the noisy setting, we select the value of C that shows

10

0.7

0.75

0.8

0.85

0.9

0.95

1

20 40 60 80 100 120 140 160 180 200

Pre

cis

ion

Number of Training Pairs

Data 1Data 2

Figure 1: Precision over the test set with respectto the number of training pairs

0.75

0.8

0.85

0.9

0.95

1

20 40 60 80 100 120 140 160 180 200

Sim

ilarity


Data 1Data 2

Figure 2: Similarity between w and w with re-spect to the number of training pairs

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

estim

ate

d d

iscount fe

acto

rs

true discount factors

Figure 3: Estimated discount factors with re-spect to true discount factors

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35

estim

ate

d g

ain

vecto

r

true gain vector

Figure 4: The estimated gain vector with respectto the true gain vector

the best performance on an independent validation set.We report the performance with respect to the numberof noisy pairs in Figure 5 and Figure 6. We can ob-serve that the performance decreases when the numberof noisy pairs grows.

In addition to the noisy preference pairs, we also con-sider the noise in the grades of documents. In this case,we randomly modify the grades of a number of docu-ments to form noise in training set. The estimated w isused to predict to preference on a test set. The precisionwith respect to the number of noisy documents is shownin Figure 7. It can be observed that the performancedecreases when the number of noisy documents grows.

6.5 Optimal RankingsWe can further restrict the preference pairs by involv-

ing the optimal ranking in each pair of training data.For example, we set one ranking list of the preference

pair to be the optimal ranking [5, 5, 4, 4, 3, 3, 2, 2, 1, 1].In this case, if the other ranking list is generated bypermuting the same list, it is implied by Proposition 1that any compatible gain vectors will agree on the factthat the optimal ranking is preferred to other rankinglists. In other words, the preference pairs do not carryany constraints to the utility function w. Therefore,the constraints corresponding to these preference pairsare not effective in determining the utility function w.Consequently, the performance do not increase when thenumber of training pair grows as shown in Figure 9.

If the ranking lists contain different sets of grades,a fraction of constraints can be effective. The perfor-mance grows slowly with the number of training pairsincreases as reported in Figure 10. By comparing Figure1,2 and Figure 10, we can observe that when the typeof preference is restricted, the learning algorithm re-quires more pairs to obtain a comparable performance.

11

0.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25 30 35 40

Pre

cis

ion

Number of Noisy Pairs

Data 1Data 2

Figure 5: Precision over the test set with respectto the number of noisy pairs in noisy settings

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 5 10 15 20 25 30 35 40

Sim

ilarity

Number of Noisy Pairs

Data 1Data 2

Figure 6: Similarity between w and w with re-spect to the number of noisy pairs in noisy set-tings

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 0.5 1 1.5 2 2.5 3 3.5 4

Pre

cis

ion

Number of Noise Documents

Data 1Data 2

Figure 7: Precision with respect to the numberof noisy grades in noisy settings

0.75

0.8

0.85

0.9

0.95

1

0 0.5 1 1.5 2 2.5 3 3.5 4

Sim

ilarity

Number of Noise Documents

Data 1Data 2

Figure 8: Similarity with respect to the numberof noisy grades in noisy settings

We conclude from this observation that some pairs aremore effective than others to determine w. Thus, if wecan design algorithm to select these pairs, the number ofpairs required for training can be greatly reduced. Howto design algorithms to select effect preference pairs forlearning DCG will be addressed as a future researchtopic.

7. AN ENHANCED MODELThe objective function defined in Equation (1) does

not consider the degree of difference between rankinglists. For example, it deals with preference pairs [5, 5, 4, 2, 1]� [5, 4, 5, 2, 1] and [5, 5, 4, 2, 1] � [2, 1, 5, 5, 4] in the sameapproach, although they have great differences in DCG.In order to overcome this problem, we propose an en-hanced model that takes the degree of difference be-tween ranking list into consideration.

minwT w + C∑

(i,j)∈S

ξ2ij (4)

subject to:

wT (s(πi)− s(πj)) ≥ Dist(πi, πj)− ξij (i, j) ∈ S

ξij ≥ 0 (i, j) ∈ S

wkl ≥ wk,l+1 k = 1, . . . , K and l = 1, . . . , L− 1

where Dist() is a distance measure for a pair of per-mutations π1 and π2. In principle, we would preferDist(πi, πj) to be a good approximation of DCG(πi)−DCG(πj). However, since we do not actually know theground-truth w in practice, it is generally difficult toobtain a precise approximation. In our simulation, we

12

0.4

0.5

0.6

0.7

0.8

0.9

20 40 60 80 100 120 140 160 180 200

Pe

rfo

rma

nce


PrecisionSimilarity

Figure 9: Performance when training pairs aregenerated by permuting the same list

0.5

0.6

0.7

0.8

0.9

1

50 100 150 200 250 300 350 400

Pe

rfo

rma

nce


PrecisionSimilarity

Figure 10: Performance when training pairs aregenerated from different lists

apply the Hamming distance as the distance measure:

Ham(π1, π2) =∑

k

1[π1(k) 6= π2(k)] (5)

We perform simulation to evaluate the enhanced model.Form Figure 12 and Figure 11, we can observe thatthe performance improvement of the enhanced modelis not very significant. We doubt that this is becauseHamming distance is not a precise approximation of theDCG difference. We plan to investigate this problem inour future study.

8. LEARNING WITHOUT DOCUMENT

GRADESWhen the grades of the documents are not available,

we can also fit a model to predict the preference of tworanking list. To this end, we use a K ×K-dimensionalbinary vector s(π) to represents the ranking π. The firstK components of s(π) correspond to the first position ofthe K-position ranking, and the second K-componentsthe second the position, and so on.

For example, for a ranking list π

d3, d1, d4, d2, d5

the corresponding binary vector is

[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]

Given a set of preference over ranking lists, we can ob-tain w by solving the following optimization problem:

min wT w + C∑

(i,j)∈S

ξ2ij (6)

subject to:

wT (s(πi)− s(πj)) ≥ 1− ξij (i, j) ∈ S (7)

ξij ≥ 0 (i, j) ∈ S (8)

In this case, the constraints wkl ≥ wk,l+1 are not in-cluded in the optimization problem, since we do nothave any prior knowledge about the grades of the doc-uments. The precision on test set with respect to thenumber of training pairs is reported in Figure 13. Wecan observe that the w can be precisely learned evenwithout the grades of documents. In this case, thelearned utility function w can be interpreted as the rel-evant judgements for the documents.

9. CONCLUSIONS AND FUTURE WORKIn this paper, we investigate the coherence of DCG,

which is an important performance measure in infor-mation retrieval. Our analysis show that the DCG isincoherency in general, i.e., different gain vectors canlead to different judgements about the performance ofa ranking function. Therefore, it is a vital problem toselect reasonable parameters for DCG in order to obtainmeaningful comparisons of ranking functions. We pro-pose to learn the DCG gain values and discount factorsfrom preference judgements of ranking lists. In particu-lar, we develop a model to learn DCG as a linear utilityfunction and formulate the method as a quadratic pro-gramming problem. Preliminary results of simulationsuggest the effectiveness of the proposed method.

We plan to further investigate the problem of learn-ing DCG and apply the proposed method in real worlddata sets. Furthermore, we plan to generalize DCG tononlinear utility functions to model more sophisticatedrequirements of ranking lists, such as diversity and per-sonalization.

10. REFERENCES[1] J. D. Carroll and P. E. Green. Psychometric

methods in marketing research: Part II,Multidimensional scaling. Journal of MarketingResearch, 34:193–204, 1997.

13

0.75

0.8

0.85

0.9

0.95

1

20 40 60 80 100 120 140 160 180 200

Sim

ilariy


Data 1Data 2

Data 1 EnhancedData 2 Enhanced

Figure 11: Similarity of the enhanced model withrespect to the number of training pairs

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

20 40 60 80 100 120 140 160 180 200

Pre

cis

ion


Data 1Data 2

Data 1 EnhancedData 2 Enhanced

Figure 12: Precision of the enhanced model overthe test set with respect to the number of train-ing pairs

0.4

0.5

0.6

0.7

0.8

0.9

1

100 150 200 250 300 350 400

Perf

orm

ance


PrecisionSimilarity

Figure 13: Performance over the test set withrespect to the number of training pairs whenthe grades of documents are not known

[2] O. Chapelle and Z. Harchaoui. A machine learningapproach to conjoint analysis. volume 17, pages257–264, Cambridge, MA, USA, 2005. MIT Press.

[3] T. Evgeniou, C. Boussios, and G. Zacharia.Generalized robust conjoint estimation. MarketingScience, 24(3):415–429, 2005.

[4] K. Jarvelin and J. Kekalainen. Ir evaluationmethods for retrieving highly relevant documents.In SIGIR ’00: Proceedings of the 23rd annualinternational ACM SIGIR conference on Researchand development in information retrieval, pages41–48, New York, NY, USA, 2000. ACM.

[5] T. Joachims. A support vector method formultivariate performance measures. In Proceedings

of the 22nd international conference on Machinelearning, 2005.

[6] J. J. Louviere, D. A. Hensher, and J. D. Swait.Stated choice methods: analysis and application.Cambridge University Press, New York, NY, USA,2000.

[7] E. M. Voorhees. Evaluation by highly relevantdocuments. In SIGIR ’01: Proceedings of the 24thannual international ACM SIGIR conference onResearch and development in information retrieval,pages 74–82, New York, NY, USA, 2001. ACM.

[8] J. Xu and H. Li. Adarank: a boosting algorithmfor information retrieval. In Proceedings of the 30thACM SIGIR, pages 391–398, New York, NY, USA,2007.

[9] Y. Yue, T. Finley, F. Radlinski, and T. Joachims.A support vector method for optimizing averageprecision. In Proceedings of ACM SIGIR, NewYork, NY, USA, 2007.

14

Creating a test collection to evaluate diversity in image retrieval

Thomas Arni, Jiayu Tang, Mark Sanderson and Paul Clough Department of Information Studies, University of Sheffield, UK

ABSTRACT This paper describes the adaptation of an existing test collection for image retrieval to enable diversity in the results set to be measured. Previous research has shown that a more diverse set of results often satisfies the needs of more users better than standard document rankings. To enable diversity to be quantified, it is necessary to classify images relevant to a given theme to one or more sub-topics or clusters. We describe the challenges in building (as far as we are aware) the first test collection for evaluating diversity in image retrieval. This includes selecting appropriate topics, creating sub-topics, and quantifying the overall effectiveness of a retrieval system. A total of 39 topics were augmented for cluster-based relevance and we also provide an initial analysis of assessor agreement for grouping relevant images into sub-topics or clusters.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search

General Terms Measurement, Experimentation, Human Factors, Verification.

Keywords Diversity, image test collection, evaluation, image retrieval, building test collection

1. INTRODUCTION It is common for modern search engines to ensure that duplicate or near-duplicate documents retrieved in response to a query are hidden from the user. This in turn leads to results (typically a ranked list), which offers a greater diversity: describing different sub-topics; or representing different senses of a query. This functionality is particularly important when a user’s query is either ambiguous or poorly specified. Users issuing such queries are likely to have a clear idea of the kind of items they wish to retrieve, but the search engine has little knowledge of users’ exact preferences. A search engine that retrieves a diverse, yet relevant, set of documents at the top of a ranked list is more likely to satisfy the user [1, 2, 9]. New diversity-aware retrieval systems should not only increase diversity within the top n documents of the result set, but also reduce redundancy in the overall results set.

In turn, eliminating redundancy should lead to promotion of novelty leading to an overall set of results which are likely to be more satisfying to a user.

Although diversity is a key aspect to many commercial search engines, there are limited benchmarking resources to evaluate approaches for generating diverse results sets. To the best of our knowledge, only one test collection exists that provides some support for evaluating diversity [19]. However, in the field of image retrieval no such collection exists, despite the benefits that such a resource might offer. Building test collections is typically an expensive task in terms of time and effort [12]. However, we describe how to reduce the amount of effort involved by augmenting a pre-existing image test collection to support the evaluation of diversity.

The remainder of the paper is organized as follows: Section 2 describes related literature; Section 3 discusses principles behind evaluation, the ImageCLEFPhoto evaluation campaign as well as this year’s task., Section 4 shows a the process of creating a test collection for diversity, comparison of cluster judgements as well as some statistics. Section 6 describes metrics to quantify retrieval effectiveness; and Section 7 concludes the paper and describes future work.

2. LITERATURE REVIEW 2.1 Probability Ranking Principle The underlying principle of many classic document ranking systems is the Probability Ranking Principle (PRP). According to a common interpretation of this principle, retrieval systems should rank documents, which are most similar to the query, nearer the top of a ranked list, as well as maximize the number of relevant documents returned. Documents are therefore ranked in decreasing order of their predictive probabilities of relevance. Under reasonable assumptions, one can prove that ranking documents in descending order by their probability of relevance yields the maximum expected number of relevant documents, and thus maximizes the expected values of the well known precision and recall metrics [1]. However, simply returning all relevant documents including duplicates or near-duplicates, is not always the best way to satisfy a user’s needs [2, 9]. Therefore, the common interpretation of the PRP needs to be re-thought, and consequently, new retrieval techniques are needed. To test their effectiveness and improve them, adequate test collections are essential.

2.2 Definition of Terms There is no general consensus in the literature about the naming of a more fine-grained categorization of relevant documents for a given topic. Terms such as “Sub-Topics” [1, 5], “Topic Aspects” [19], “Topic Instances” [6], “Facets” are interchangeably used and refer to the same concept. We refer to this concept as “Topic

Copyright is held by the author/owner(s).

SIGIR 2008 Workshop: Beyond Binary Relevance: Preferences, Diversity and Set-Level Judgments, July 24th, 2008, Singapore.

ACM 978-1-60558-164-4/08/07

15

Clusters”. The intention is always to group all relevant images into groups with similar content or to define to which group(s) a relevant image belongs to. While identifying relevant documents is one part of the search process, it is also crucial to include documents on as many sub-topics at the top of the results list as possible.

2.3 TREC Interactive Track The TREC Interactive Tracks 6, 7 and 8 have all focused on an instance recall task [16]. Searchers from each participating group were instructed to save as many documents as possible in 20 minutes on different aspects/instances of a topic. TREC assessors identified, a priori, all possible aspects (or instances) of a given topic. The resulting aspectual judgements were then used to measure the diversity of results sets generated by the searchers. Instance recall and instance precision metrics were used to compare results of participating groups and organisers concluded that methods such as relevance feedback, Okapi term weighting, and document summarisation did not improve instance recall.

It is possible to use the aspectual judgements from the TREC assessors as a test collection to quantify diversity within the results set produced by ad hoc search. However, the number of topics which can be used for evaluation is limited because only six to eight new topics where introduced each year during the three years of the interactive track (a total of only 20 topics is available). Organisers of TREC themselves, however, argue that 25 topics is the minimum number of topics which should be used in comparative evaluations because system rankings become unstable with fewer topics [17].

2.4 Novelty Novelty aims to avoid the redundancy of documents in the results set and Maximal Marginal Relevance (MMR) is one approach which has shown to increase novelty successfully [9]. It is assumed that a user is satisfied with only one, or a few similar relevant documents in the results set, instead of finding duplicates or near-duplicates. While reducing or eliminating redundancy in the results set should not only promote novelty, but also diversity (the emphasis on novelty is therefore an indirect way of promoting diversity and vice versa [14]). Novelty and diversity are thus related and normally the increase of one will help the other.

2.5 Maximal Diverse Relevance While MMR aims to optimize novelty, Maximal Diverse Relevance (MDR) tries to specifically promote diversity. Zhai [14] proposes a method with mixture language models to directly increase the diversity of the result. Zhai [ibid.] used the aspectual judgements form TREC Interactive to measure diversity. However as mentioned earlier, the quantity of topics is at the absolute lower limit.

3. EVALUTION PRINCIPLES 3.1 History of evaluation campaigns Cranfield is generally regarded as the first IR test collection, which defined the model used for evaluation ever since [13]. For more than a decade, standard ad hoc retrieval campaigns such as

Text REtrieval Conference (TREC1), the Cross-Language Evaluation Forum (CLEF2) and the NII-NACSIS Test Collection for IR Systems (NTCIR3) have defined the manner in which large-scale comparative testing of search engines is conducted. The goal of all evaluation campaigns, and their test collections, is to measure and improve retrieval algorithms and methods for specific tasks (e.g. filtering, routing, adhoc retrieval). Metrics like precision and recall have been used to measure retrieval effectiveness and research has lead to an understanding of which retrieval approaches are best suited to optimising precision and recall.

One drawback of most collections is that all judged relevant documents per topic in the assessment file (known as the qrels) are independent of each other. Further implications of this assumption are that all relevant documents are equally desirable, the user information need (expressed in the test collection topic) is static and the list of relevant documents is complete [12]. However, not all relevant duplicate or near-duplicate documents are equally desirable from a user’s perspective. Practical concerns of building test collections in a tractable number of person months were the reasons that drove the making of this assumption, because non-independent relevance judgements require much more effort to obtain.

Evidence for this derives, in part, from the experiences in building a test collection as part of TREC interactive. Some support for measuring diversity in result sets was provided. However, the assessor effort required to build the collection was high and subsequently only a limited number of queries were created. However, there is increasing evidence that ambiguous ill-specified queries are common [18], which has led to increased research in the fields of diversity and novelty [6, 9]. Therefore, it is necessary to re-examine the creation of test collections that support diversity measurement to try to determine the means of creating such collections in a tractable amount of time.

3.2 ImageCLEFPhoto 2008 The need for retrieval systems to produce diverse results is as strong in the field of image retrieval as it is document retrieval. As the organizers of the ImageCLEFPhoto task4, we have tried to address the growing need for diversity from image search engines. Hence, we created an image test collection which specifically allows diversity measurement. The guiding principles for the creation of this new collection were to ensure that result diversity could be measured effectively and make the use of the collection as easy as possible.

3.2.1 Collection overview The ImageCLEFPhoto task uses the IAPR TC-12 image collection, which comprises of 20,000 images with annotations in three different languages [3]. Sixty topics are available in 15 different languages [4], and the collection has been used in various ImageCLEFPhoto tasks during the past two years [7, 10].

1 http://trec.nist.gov 2 http://www.clef-campaign.org 3 http://research.nii.ac.jp/ntcir 4 http://www.imageclef.org/2008/photo

16

3.2.2 Diversity in the collection The existing 60 topics of the IAPR TC-12 image collection were derived from a log file analysis and the domain knowledge of topic authors. The topics embrace various search patterns, like locations, tourist destinations, accommodation, animals, people, objects, action or landscapes. Each topic was also classified as to how “visual” they were considered to be. Here, the word “visual” refers to consistence of visual information that can be interpreted from the relevant images of a particular topic. The more “visual” a topic is, the more consistent visual information can be extracted from the relevant images. The level of “visual” is measured by a rating between 1 and 5 based on the score of a content based retrieval systems as well as the opinion of three experts in the field of image analysis [4]. Topics, which are classified as highly “visual” are more likely to produce good results from content based retrieval participants. The topics were also classified by their “complexity”, which defines the difficulty for a retrieval system to return relevant images [4]..

The collection contains many different images of similar visual content. This is because most images were offered by a travel company, which repeated fixed itineraries on a regular basis. Therefore the collection’s similar images vary in illumination, viewing angle, weather condition and background [3].

3.2.3 Task Participants in the ImageCLEFPhoto 2008 task this year run each provided topic on their image search system to produce a ranking of images. In the top 20 results, there should be as many relevant images that are representatives of the different clusters within the overall set of results as possible. The definition of what constitutes diversity varies across the topics, but a clear indication is given in the topic indicating what clustering criteria the evaluators used. Participating groups return, for each topic, a ranked list of images IDs. We determine which images are relevant and count how many clusters are represented in the ranking. To make the task for all participants as straightforward as possible, participants only have to submit a usual TREC style results list. Thus the participants are not required to explicitly identify clusters or their labels. At the time of writing the paper, results from all participants have been submitted. We are in the process of evaluating all submissions, using the “gold standard” assessment file we created, which contains the information of cluster belongings of relevant images. In the following, we will discuss the details of the creation of clusters.

4. COLLECTION CREATION

4.1 Topic selection and enrichment 4.1.1 Deciding on cluster type To measure the diversity of a results set, the relevant images of each topic have first to be grouped into clusters. We examined each of the existing topics that are part of the collection to identify which topics would be good candidates for clustering.

For the majority of the topics, the clustering was clear. For example, if a topic asked for images of beaches in Brazil, clusters

were formed based on location. The cluster type in this case is the location of the beach. If a topic asked for photos of animals, clusters were formed based on animal type. Typical clusters according the given cluster type “animal” would be Elephant, Lion, and Crocodile etc. However, there is room for subjective interpretation regarding how an optimal clustering should be defined. This is not a new problem; all existing test collection topics have some kind of subjective relevance assessment and the same is true for cluster assessment. Therefore, to form the clusters, a cluster type was determined and all images classified according to this cluster type.

4.1.2 Topic selection Out of the 60 existing topics it was judged that 39 were appropriate to use in the evaluation. The remaining 21 topics were either: (a) too specific, (b) lacked diversity within the relevant images or (c) were considered too difficult to cluster.

4.1.3 Cluster types As mentioned, we had to decide on a cluster type for each topic. Without an explicitly-defined cluster type, it would not be possible to compare diversity in the results sets of groups participating in ImageCLEFPhoto. For the task, it is necessary that all participants cluster the images according to the given cluster type. The 39 selected topics can be classified into two main groups of cluster types: Geographical and Miscellaneous. Table 1 shows different cluster types, as well as the number of corresponding topics. We believe that the 39 topics and their cluster type are well-balanced, diverse and should present a retrieval challenge to participants wishing to use either text and/or low-level visual analysis techniques for creating clusters. We expect that due to the detailed geo-referencing in the annotations of the images, a use of geographical knowledge will also help.

Table 1. Overview of the cluster types

Group Cluster type Number of topics

Country 12

City 5 Geographical

Region / state 5 Animal 4

Sport 2

Vehicle 2

Composition 2

Weather condition 1 Venue / tourist attraction /

landmark 4

Religious statue 1

Miscellaneous

Volcano 1

4.2 Cluster assessment For each topic in the ImageClefPhoto set, relevant images need to be manually clustered into sub-topics. Cluster relevance judgements are required to indicate which cluster a relevant image belongs to. Relevance assessors were instructed to look for simple

17

clusters based on the cluster type of a topic. Three assessors were invited to cluster the relevant images of a topic into sub-topics. Firstly, two assessors, A1 and A2, were asked to manually judge all relevant images from each of the 39 topics according to the pre-defined cluster type. Then, a third assessor, A3, was asked to do the clustering, however only on the topics that the first two assessors could not achieve sufficient agreement. No time limit was given to any assessor to complete their judgements. The first two assessors used a graphical web tool showing all relevant images to categorize the images from one topic at a time. The assessor wrote down the ID of the image and assigned it to their individual chosen clusters. They were allowed to classify an image to more than one cluster, if this seemed appropriate to them. Topics, which needed further investigation, were given to the third assessor A3. 4.2.1 Cluster assessment file The cluster assessments from the three assessors are stored in three separate files. Each file stores the individual information for each of the 39 topics to which cluster(s) a relevant image belongs to. These judgements were then used to compare the cluster assessments from the assessors to build a “golden” standard cluster assessment file. Further details are described in the following section.

4.3 Cluster comparison Different judgements were observed across the assessors for some of the clusters. The geographic-based clusters types, as shown in Table 1, had an almost complete agreement from the first two assessors. Other topics from the Miscellaneous class had also a very large agreement. This means that the judgements from assessor A1 and A2 were nearly or completely consistent, were not analyzed further. Some small inconsistencies were found to be errors from one of the assessors. For example assessor A2 clustered some images from topic 6 (straight road in the USA), which has to be clustered by state, in a cluster “San Francisco”. San Francisco is though not a state, which is defined in the cluster type of the topic. Therefore the correct cluster assignments of these pictures belong to the cluster “California”. However, for 8 out of 39 of the topics there was variation. Reasons for this were analysed and found to be mainly because of (a) different notions of granularity, (b) different domain knowledge, (c) different interpretation of topic or (d) assignment of images to multiple clusters. These 8 topics were then given to assessor A3.

4.3.1 Granularity It was found that assessors can have a slightly different understanding of cluster granularity. In the 8 topics, a different granularity was observed. An example was that assessor A2 chose to classify animal images in a cluster bird, whereas assessor A1 had chosen specific bird types like pelican, condor etc. apart from other already defined clusters like dolphins, monkeys etc. However, it was not the case that one of the assessors consistently used more specific clusters; each assessor was in some cases more specific than the other 2 assessors. An overview of the comparison of different number of clusters generated by different assessors’ clustering is shown in Figure 1. Table 2 gives more

details on the 8 topics, i.e. description of the theme of topics, and the cluster types used for each topic.

Table 2. Topics with different granularity

Topic Nr. Topic Cluster type

3 religious statue in the foreground statue 5 animal swimming animal 10 destinations in Venezuela location 16 people in San Francisco landmark 20 close-up photograph of an animal animal 37 sights along the Inca-Trail tourist attraction 44 mountains on mainland Australia location 48 vehicle in South Korea vehicle type

4.3.2 Multi clusters The first two assessors were encouraged to classify an image into more than one cluster, if this seemed appropriate to them. Although these instructions were given, a total of only 13 relevant images out of 2401 relevant images from 6 different topics belong to more than one cluster. An example is to cluster images by famous landmarks in Sydney. In one and the same image, both the Harbour Bridge and the Opera House can be prominent. Thus these images are classified in both cluster Opera House and Harbour Bridge. In case of doubt, an image was classified in both clusters so that participants are not disadvantaged whatever cluster they have chosen to classify the image. Contrary to the interactive TREC-Experiments no large overlap of documents within the topic clusters was observed. However, we regarded this as a positive quality as often in search, where diversity is a desired, such as dealing with ambiguous queries, overlap between clusters would be expected to be low.

4.3.3 Unknown clusters In four topics, there are relevant images which cannot be classified to a certain cluster. The reason is because some specific information is missing in the image’s caption, in the image itself or there is lack of domain knowledge from the assessor. In one

Figure 1. Granularity shown by the number of clusters

18

topic for example, where clustering must be done by city, a specific city annotation/caption is not available. Thus these images are classified in an unknown cluster. However, the total amount of relevant images assigned to unknown cluster is only 48 out of 2401 relevant judged images and appears in a total of 4 out of 39 topics. The occurrence of unknown images in these 4 topics varies between 1% up to 29 % of all relevant images for the given topic. Due to the lack of domain knowledge no further investigation was done to assess these unknown images to the already given clusters or to new ones. However, images from unknown clusters do not imperatively belong to the same cluster, because they don't have to be similar to each other. Because of the small amount of occurrences no further sub clustering is applied.

4.3.4 Creating “gold standard” cluster judgements After comparing the clusters from each assessor, a “gold standard” was defined. Although a true “gold standard” hardly exists in real life because different people have different views of an appropriate cluster type, we try to create one as general as possible based on the judgements of 3 assessors. For the 8 topics with different granularities as shown in Figure 1, a granularity acceptable to 3 assessors was agreed. For all other topics the gold standard was already obvious due to the high level of cluster agreement. Small disagreements were corrected in a way, which all assessors could agree on. In one case however (topic 10), it was not only a different granularity, but also a total different understanding of the cluster type which resulted in various clusters. Further analysis and perhaps more assessors will be needed to come to a point where all assessors could agree. If this will be not the case, we have to consider dropping this topic because the assessor could not identify a common denominator.

Table 3. Cluster statistics

Description Mean Median Range [min, max]

Number of clusters/topic Ø 7.9 ± 5.0 7 2, 23

Number of relevant images/topic

Ø 61.6 ± 33.7 60 18, 184

4.4 Statistics Table 3 provides some descriptive statistics regarding the 39 topics, the corresponding cluster number and relevant image number. On average there are 7.9 clusters per topic, with an average of 61.6 relevant images per topic. The whole collection comprises of 20000 images from which 2401 are judged as relevant to the given 39 topics. Out of the 2401 relevant judged images from the 39 topics are 2130 unique. Reason for the deviation is that an image can be judged relevant to more than one topic. In our case 107 images are relevant to 2 topics and 19 are relevant to 3 topics.

5. EVALUATION METRICS Evaluation of IR systems with diversity is not as straightforward as that on ordinary ones. Intuitively, a good diversity IR system ranks relevant documents that cover many different sub-topics

early in the ranking list while avoiding covering the same sub-topics repeatedly. This leads to two new criteria as compared with ordinary retrieval: 1) maximize the number of sub-topics covered; 2) minimize the redundancy of covering of sub-topics. Zhai proposed two metrics for evaluation on sub-topic retrieval, namely the sub-topic recall (denoted as S-recall) and sub-topic precision (denoted as S-precision) [6]. They claimed that S-recall and S-precision are natural generalizations of ordinary recall and precision. S-recall at rank K is defined as the percentage of sub-topics covered by the first K documents in the list:

S-recall at K ( )

An

idsubtopicsKi 1=∪

≡

where di represents the ith document, subtopics(di) is the number of sub-topics di belongs to, and nA is the total number of sub-topics in a particular topic. The S-precision at S-recall level r (0<r<1) is defined as:

S-precision at r ≡MinRank Sopt ,r( )

MinRank(S,r)

where MinRank(S, r) is the minimal rank K at which an IR system S produces S-recall r. Sopt is a system that produces the optimal ranking that generates sub-topic recall r. In other words, MinRank(Sopt, r) is the smallest K that is possible to obtain S-recall of r. Calculation of MinRank(Sopt, r) can be considered as a minimum set covering problem, in which one tries to find the smallest sub-set of documents, the union of which covers percentage r of the whole set of sub-topics within a topic. S-recall and S-precision can be used as evaluation metrics in the same way as ordinary recall and precision, for example, the conventional recall-precision curves. In ImageCLEFPhoto 2008, we have announced that we will evaluate the results from the participants based on the top 20 documents per topic. In addition, for a more comprehensive comparison, choosing a number that is greater than 20 might be necessary. For example, plotting recall-precision curves needs the calculation of S-precisions at different levels of S-recall up to 0.9 and 1.0. It is likely that the IR systems developed by the participants can not cover all the sub-topics within the top 20 documents, so S-recall<1.0 at rank 20, or even S-recall<<1.0, which makes calculating S-precisions at high levels of S-recall difficult because of lack of results. Therefore, we choose to use two measures, ordinary precision at rank 20 and sub-topic recall (S-recall) at rank 20. The use of these two metrics, however, does not take into consideration the second criterion mentioned earlier, namely the redundancy of sub-topic covering. An extreme example is, given a topic with 2 sub-topics, one system manages to retrieve 10 documents that belong to one sub-topic and 10 other documents belonging to the second sub-topic; on the contrary, another system retrieves 19 documents belonging to the first sub-topic and only one belonging to the second. Both of the systems will obtain the exactly same values of ordinary precision and S-recall, but users may find the first system is much better. Intuitively, a more balanced covering of sub-topic is desirable to users. Metrics that evaluate the extent to which the covering is balanced between different sub-topics may be useful. In addition, rather than calculate precision and S-recall at rank 20, we could also calculate at rank 5, 10, 15 and 20 respectively.

19

5.1 Example evaluation By way of example, consider the following topic: <top> <num> Number: 5 </num> <title> animals swimming </title> <cluster>animal</cluster> </top>

For this example topic, within the collection there are four clusters/sub-topics (dolphin, turtle, alligator and pelican), each of which contains at least one relevant image. In this example, we are interested to know the cluster recall at position 20 of the result set. A search system which ignores diversity will lead to a result set, which contains groups of similar documents. This fact is illustrated in the left result set of Figure 2. The task is now to promote diversity and bring at least one relevant document from each relevant sub-topic (dolphin, turtle, alligator and pelican) within the first 20 results. The targeted result set is illustrated in the right of Figure 2. The order of the diverse and relevant documents within the first 20 is not consideration for the calculation of the cluster recall. This means that the first 20 relevant documents from our 4 different sub-topics can be in a random order, without affecting the cluster recall. In this case the cluster recall (N = 20) is 1, because at least one relevant document from each cluster is retrieved within the first 20 documents.

6. CONCLUSIONS AND FUTURE WORK While the adhoc standard retrieval tasks from several evaluation campaigns are appropriate in many settings, it is not universally the best approach to get a relevant yet diverse result set. Users care about factors like diversity and avoidance of redundant documents. With the construction of a new diversity image test collection, which measures both the standard precision/recall and

the diversity, we address the increasing need to promote diversity in the result set. We showed what has to be considered when selecting and augmenting existing topics and how to cluster the relevant images from all topics. We also presented appropriate evaluation metrics, which can be used to measure the effectiveness of the result. The whole test collection was set up so that participation is as simple and straightforward as possible. Participants are not required to submit cluster labels nor any further information about their clusters. They simply provide the common result set, where we are able to determine both recall/precision as well as the diversity of the result.

Due to the increasing interest in diversity, we intend to extend our work further on building test collection for diversity. A much bigger collection, as well as in alternative areas, would certainly allow us to have a look at ambiguous queries to start creating appropriate topics. Further research could also be applied how the clusters are build and how they are labelled. Studying the ordering of the clusters and the amount of images in each cluster is another thing, where more research should be applied.

Further research should also be done by comparison the diversity optimized retrieval results from this year’s participants and compare them with the standard ad hoc result (not optimized for diversity) from previous years. This will show if retrieval systems, which promote diversity, have the expected impact in the retrieval effectiveness. One way of doing this could be by user oriented effectiveness, where user judges result list from ad hoc and diverse results and give their preferences.

ACKNOWLEDGEMENTS We would like to thank Michael Grubinger for providing the data collection and queries which formed the basis of the ImageCLEFPhoto task for 2008. We also wish to thank the reviewers for their helpful comments. Work undertaken in this paper is supported by the EU-funded TrebleCLEF project (Grant agreement: 215231) the Multimatch project (contract No. IST-2005-2.5.10) and by Memoir (contract No. RU112355).

REFERENCES [1] Chen, H. and Karger, D. R. 2006. Less is more: probabilistic

models for retrieving fewer relevant documents. In Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Seattle, Washington, USA, August 06 - 11, 2006). SIGIR '06. ACM, New York, NY, 429-436.

[2] Song, K., Tian, Y., Gao, W., and Huang, T. 2006. Diversifying the image retrieval results. In Proceedings of the 14th Annual ACM international Conference on Multimedia (Santa Barbara, CA, USA, October 23 - 27, 2006). MULTIMEDIA '06. ACM, New York, NY, 707-710.

[3] Grubinger, M., Clough, P., Müller, H. and Deselaers, T. (2006), The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems, In Proceedings of International Workshop OntoImage’2006 Language Resources for Content-Based Image Retrieval, held in conjunction with LREC'06, Genoa, Italy, pp. 13-23.

Figure 2. Example of a diverse result after re-ranking the initial list to promote diversity

20

[4] Grubinger, M. and Clough, P. (2007) On the Creation of Query Topics for ImageCLEFPhoto, In Proceedings of the third MUSCLE / ImageCLEF workshop on image and video retrieval evaluation, Budapest, Hungary, 19-21 September 2007

[5] W. R. Hersh and P. Over. Trec-8 interactive track report. In Proceedings of TREC-8, 1999.

[6] C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of ACM SIGIR 2003, pages 10–17, 2003.

[7] M. Grubinger, P. Clough, A. Hanbury, and H. Müller. Overview of the ImageCLEFPhoto 2007 photographic retrieval task. In Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, Sept. 2007.

[8] B. Carterette and P.N. Bennett. A Test Collection of Preference Judgments. In SIGIR 2008 Workshops: Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments. Edited by P. Bennett, B. Carterette, O. Chappelle, and T. Joachims. URL: http://ciir.cs.umass.edu/~carteret/bbr-overview.pdf

[9] Jaime Carbonell , Jade Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summaries, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.335-336, August 24-28, 1998, Melbourne, Australia

[10] Clough, P., Grubinger, M., Deselaers, T., Hanbury, A. and Müller, H. (2007), Overview of the ImageCLEF 2006 Photographic Retrieval and Object Annotation Tasks, Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20-22, 2006,

[11] Jia Li, Two-scale image retrieval with significant meta-information feedback, Proceedings of the 13th annual ACM international conference on Multimedia, November 06-11, 2005, Hilton, Singapore

[12] Voorhees, E. M. (2001). The philosophy of information retrieval evaluation. In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, (pp. 355-370).

[13] Cleverdon C. (1967) The Cranfield test of index language devices. Reprinted in Reading in Information Retrieval Eds. 1998. Pages 47-59

[14] C. Zhai. Risk Minimization and Language Modeling in Text Retrieval. PhD thesis, Carnegie Mellon University, 2002

[15] K. Ali, C. Chang, and Y. F. Juan. Exploring cost-effective approaches to human evaluation of search engine relevance. In ECIR ’05, 27th European Conference on Information Retrieval, Santiago de Compostela, Spain, March 2005.

[16] Hersh W and Over P. TREC-8 interactive track report, in Proceedings of the 8th Text REtrieval Conference (TREC-8). 2000. Gaithersburg, MD: NIST, 57-64.

[17] Buckley, C. and Voorhees, E. (2000). Evaluating evaluation measure stability. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece. ACM Press. 33-40.

[18] Sanderson, M. (2008) Ambiguous Queries: Test Collections Need More Sense, to appear in the Proceedings of ACM SIGIR, 2008

[19] Over P. (1997) TREC-5 Interactive Track Report. In: Proceedings of the Fifth Text Retrieval Conference (TREC-5), EM Voorhees and DK Harman (Eds,):29-56.

21

Nugget-based Framework to Support Graded RelevanceAssessments

Maheedhar KollaUniversity of Waterloo

Canada. N2L 3G1

[email protected]

Olga VechtomovaUniversity of Waterloo

Canada. N2L 3G1

[email protected]

Charles L.A. ClarkeUniversity of Waterloo

Canada. N2L 3G1

[email protected]

ABSTRACTGraded relevance assessments indicate the degree to whicha given document would satisfy the information need of theuser. Graded assessments, however, are rarely used in re-trieval evaluation due to the difficulties faced in obtainingsuch assessments. In this paper, we propose a frameworkwhere we identify the nuggets associated with an informa-tion need. Graded relevance value of a document is thendetermined by the number of nuggets covered by the docu-ment. We conduct an exercise to compile a test collectionbased on our framework. From our analysis, we observe thatuser notion of document relevance is consistent with that ofnuggets covered.

1. INTRODUCTIONIn traditional evaluation, systems are ranked based on

their ability to retrieve relevant documents closer to the topof the retrieved list. In all such experiments, relevance of adocument is tightly bound to its topicality (extent to whichthe document would satisfy the user’s needs) [4]. Documentsare judged on a binary scale: relevant or not relevant basedon their topicality.

Such binary judgements treat all relevant documents asthe same and it is assumed that all relevant documentswould be equally liked by users. This contradicts the factthat relevance of a document would progressively increasewith its topicality [9].

The idea of graded relevance assessments was explored byJarvelin and Kekalainen [6], in order to distinguish betweenhighly relevant documents from others. They proposed gainbased measures to reward those systems that would retrievehighly relevant documents near the top of the list beforemarginal or non relevant documents. Recent study [1] intoassessing the quality of web pages retrieved by a commercialsearch engine found that user satisfaction correlates withCumulative Gain (CG) measure.

Sormunen [11] re-assessed a set of documents, judged by


NIST assessors in the context of TREC-7 and TREC-8, ontoa four-point scale of relevance: irrelevant, marginally rel-evant, relevant and highly relevant. They reported thataround 50% of the documents, previously judged relevant,were judged as marginally relevant and a smaller fraction ofthe documents (16%) was judged as highly relevant.

However, binary relevance judgements are popular andusually preferred over graded relevance judgements. This isdue to the ease with which assessors can classify documentson a binary scale. A major drawback of obtaining gradedrelevance assessments is to identify the optimum number ofrelevance categories [13]. Another drawback is that assessorsfind it hard to distinguish between various relevance levels.

We propose a framework that allows us to divide a giveninformation need into a set of associated nuggets. Eachnugget is a piece of information that a user associates witha relevant document. Graded relevance value of a documentcould then be obtained based on the number of associatednuggets covered by the document. Through an exercise, weobserve that nugget-based relevance assessments agree withother kinds of relevance assessments.

2. RELATED WORKCarterette et al [3] proposed an alternative method to ob-

tain graded relevance assessments from pairwise preferencejudgements of the assessor. Assessors are shown pairs ofdocuments and their preference of one document over an-other is used to relatively sort the documents. Once sorted,the top ranked documents (preferred over other documents)are considered to be of higher relevance than the bottomranked ones. They proposed measures such as “precision ofpreferences” or ppref representing the fraction of preferencescorrectly ordered by the search systems. They observed acloser correlation between wpref and nDCG values.

Although their method requires assessors to exhaustivelymake binary preference decisions, it would simplify the taskof obtaining graded relevance values without the need todefine an optimal number of relevance categories. Unlikestandard relevance assessments that are based on topical-ity of the document, preference based evaluation enablesresearchers to investigate document features that influenceusers’ assessment of document relevance.

The task of dividing information need into smaller subtopicswas studied before by Zhai et al [15]. Their work was fo-cused on diversifying the retrieved results following on anMMR ranking function [2] so as to cover more subtopics.

22

3. FRAMEWORKOur framework is modeled on the principle of identifying

nuggets associated with an information need. Similar meth-ods of evaluation have been proposed in the fields of sum-marization and question answering. In summarization [8],summaries (manual and system-generated) are divided intoSemantic Content Units (SCU), where each SCU is a clause-like unit referring to certain semantic label. For example,SCUs like “Madrid” and “capital of Spain” would both re-fer to one semantic label “capital of Spain”. A system-generated summary is then evaluated by the number of SCUsin common with the manual summaries. In question answer-ing [14], they identify a series of questions associated witha target (people, organization etc). For example, series ofquestions for a target The Daily Show are as shown below:

The Daily Show appears on what cable channel?The Daily Show parodies what other type of TV program?Who is host of The Daily Show?At what time is The Daily Show initially televised?Who is the creator of The Daily Show?What celebrities have appeared on The Daily Show?

Table 1: Question series for the target “The DailyShow“

Jimmy Lin [7] compared question answering systems andretrieval systems by the number of facts (or questions) an-swered after going through list of documents retrieved.Clarke et al [5] proposed a framework where a nugget refersto any possible kind of need (informational, navigational ortransactional) that could arise from the query.

In our framework, we define a nugget as an answer to asimple factoid question or an extensive answer to a complexquestion. We then attribute the number of nuggets covered(or answered) by a document to its relevance.

kXn=1

dni (1)

where k is the number of nuggets associated with the topicand dni is a binary value indicating the presence (1) or ab-sence (0) of nugget n in document i. Although, we considereach nugget to be of equal importance, we could differen-tiate between nuggets by applying weighted values for eachnugget indicating its importance.

3.1 Evaluation MeasuresGain-based measures [6] tend to express the information

accumulated by the user after going through a list of docu-ments. Since our framework assigns the relevance of a docu-ment based on information gained (in terms of nuggets cov-ered), we could easily adapt the measures such as CG,DCG,and nDCG to evaluate search systems with our framework.

4. PILOT STUDYThrough an exercise, we compiled a test collection us-

ing our framework of evaluation. The goal of our exerciseis to investigate the agreements between the nugget-basedrelevance assessment and and other methods of relevanceassessment: absolute, graded and preference based.

4.1 Topics and CorpusA snapshot of Wikipedia1 was selected as a corpus for our

exercise. The task on hand is informational search, wherea user is searching across the collection (Wikipedia) for ar-ticles that would help satisfy his/her information need. Intotal, 26 students, herein called topic creators, with suffi-cient background knowledge about retrieval evaluation au-thored the test topics and assessed documents retrieved fortheir topics. The topic creators agreed upon releasing thetopics and relevance judgements under a Creative Commonslicense.

Topic creation was modeled by an informational Ad Hocsearch task. An assessor, in our context, is interested in find-ing articles that would satisfy his/her information need. Weadopted a TREC style of topic compilation, where an infor-mation need is expressed through <title> and <description>fields.

<title>Schizophrenia <title><desc> User is interested in Schizophrenia, andwants to know general information about the con-dition. <desc>

In addition, for each topic, the topic creator identifiedthe list of nuggets associated with their need. Although, anugget could represent any of the three kinds of needs (infor-mational, navigational and transactional), we limited themto ones representing informational types of needs. Table 2shows the list of nuggets identified by the topic creator forthe topic “Schizophrenia”.

Topic: Schizophrenia

Symptoms of person with SchizophreniaCauses of SchizophreniaCommon methods of treatmentSimilar mental illnesses

Table 2: nuggets associated with topic “Schizophre-nia”

Using terms extracted from title and/or description fieldsas query terms, we retrieved documents implementing sev-eral automatic ranking methods on various existing retrievalsystems. From the retrieved results, we compiled a pool of108 documents for each topic for judging. Along with topicrelated articles, Wikipedia contains other document typessuch as redirect pages, image pages, disambiguation pages,and so on. In our study, we treated these types of docu-ments as not relevant and filtered them out while selectingthe documents for the judging phase.

4.2 Relevance JudgementsOur aim is to investigate the correspondence and agree-

ment/disagreement between nugget-based graded relevanceassessments and other graded relevance assessments. Weobtained three sets of judgements for 49 topics to carry outour investigation.

4.2.0.1 Topic Creators.In the first set of judgements, each topic creator judged

the documents retrieved for their topic using the interface

1http://en.wikipedia.org

23

as shown in Figure 1. For each document, we display thetitle and description of the topic near the top of the page.On the right-hand side of the page, we display the nuggetsassociated with the topic identified during the topic cre-ation phase of the exercise. The topic creators judged eachdocument on a three-point absolute scale: non-relevant (0),relevant(1) and highly relevant (2), based on the topicalityof the document. For those documents judged as relevantand highly relevant the topic creators select all the nuggetscovered (or answered) by the document from the right-handside of the page.

As mentioned in the previous section, relevance of a doc-ument is then equated to the number of nuggets covered.Since we do not exhaustively list all nuggets associated withthe topic, there could be a possibility of a document beingrelevant but not covering any of the nuggets identified dur-ing the topic creation phase. In order to avoid such scenariowe implicitly considered that each relevant or highly rele-vant document covers a nugget (being on topic) and modifyEquation 1 to obtain

kXn=1

dni + ti (2)

where ti is equal to 1 for relevant and highly relevant docu-ments indicating that they are on topic.

4.2.0.2 Assessor1.In a second set of judgements, we selected documents that

have been judged as relevant or highly relevant by the topiccreators and judged them on a four-point relevance scale:Highly relevant (3), Relevant (2), Marginally Relevant (1)and Not Relevant (0). Since, each topic creator judged doc-uments retrieved for their own topic, one of the co-authors,who was not involved in the topic creation or relevancejudgements, assigned graded relevance judgements for thesedocuments.

Figure 2: Right hand side of the interface to obtainmulti-level graded relevance assessments

Our motivation behind this set of judgements is to com-pare the agreement between the nuggets marked (as coveredin a document) by the topic creators and multi-level gradedrelevance assessments. We modified the interface used bytopic creators and did not display the nuggets compiled bytopic creators. The right-hand side of the interface is mod-ified as shown in Figure 2, so as to obtain graded relevance

assessment without displaying nuggets associated with thetopic.

4.2.0.3 Assessor2.Carterette et al [3] proposed a method to obtain graded

relevance assessments from pairwise preference judgements.Through the third set of judgements, we wish to investigatethe agreement between the documents preferred and numberof nuggets covered by the preferred documents. Another co-author, herein addressed as “assessor2”, of the paper whodid not take part in any of the previous judgements recordedtheir document preference when shown two documents at atime.

We designed our interface (Figure 3), similar to the oneexplained in [3]. For each pair of documents selected for agiven topic, we displayed two documents along with the titleand description of the topic. As in the case of assessor1, wedid not show the nuggets identified by the topic creator. Foreach document pair, assessor2 had a choice of selecting: 1)Document shown on the left (Left) 2) Document displayedon the right (Right) 3)Both are equally preferred (EquallyGood) 4) Both are equally bad for the topic (Equally Bad).

5. RESULTSIn total, we obtained relevance judgements from topic cre-

ators for 5106 documents over 49 topics. In Table 4, we showthe distribution of documents judged by the topic creatorsagainst the fraction of nuggets covered by those documents.We divided the documents based on their nugget coverageinto five bins: (<) 20%,40%,60%,80%,100%. Each entry inthe table represents the fraction of documents in a bin be-ing assigned a corresponding absolute relevance value by thetopic creator. As observed in the table, the more nuggetsare covered in the documents, the more likely the topic cre-ators are to judge the documents as highly relevant (insteadof relevant). This (unsurprising) result indicates that thetopic creators associated the relevance of a document withthe number of nuggets covered.

Percentage of nuggets coveredRel < 20% < 40% < 60% < 80% < 100%rel 87% 82% 72% 33% 19%

high rel 13% 18% 28% 67% 81%

Table 3: Absolute relevance judgements versusnuggets covered by a document (Topic creators)

In Table 5, we compare the multi-level graded relevanceassessments obtained from the second set of judgements (as-sessor1) against the fraction of nuggets covered by the doc-ument. As above, we divide the documents into bins basedon their nugget coverage:(<) 20%,40%,60%,80%,100%. Weobserved that a majority of the documents were judged asrelevant (rel 2). This is not surprising as the documentsselected for this assessor to judge were either judged asrelevant or highly relevant by the topic creators. We ob-served that there is a decrease in documents being judgedas marginally relevant (vs highly relevant) with the increasein nuggets covered.

In Table 6, we compare the agreement between the pref-erence judgements (assessor2) and that of the nuggets cov-ered (topic creators). For each pair, we check to see if thedocument preferred contains more nuggets than the other

24

Figure 1: Interface for obtaining absolute judgements from topic creators

Figure 3: Interface to obtain preference judgements

one. Each row in the table indicates the preference valuesobtained from the assessor2 (E.g: PA > PB refers to theinstances where document A was preferred over document

B). Each column in the table corresponds to the relation-ship between documents based on the number of nuggetscovered (E.g: NA > NB represents that document A cov-

25

Percentage of nuggets coveredRel < 20% < 40% < 60% < 80% < 100%0 12% 03% 1% 04% 1.3%1 24% 19% 19% 16% 9.2%2 51% 60% 52% 56% 43.4%3 13% 18% 28% 24% 46.1%

Table 4: Assessor1 Judgements: Graded relevancevalues against percentage of nuggets covered. Herepercentage of documents judged as highly relevant(3) increases as we traverse from left to right.

ers more associated nuggets than document B as judged bythe topic creator). We observe that the assessor preferenceof the documents is consistent with the number of nuggetscovered.

In Table 7, we compare the agreement between the gradedrelevance assessments obtained from assessor1 and that ofthe preference based judgements from assessor2. Similar toTable 6, here each row indicates the document preferencetype obtained from assessor2 (E.g: PA > PB refers to theinstances where document A was preferred over documentB). Each column in the table corresponds to the relation-ship between documents based on the multi-level gradedrelevance assessments (E.g: CA > CB represents that asses-sor1 assigned a higher relevance level to document A thanto document B.

From these tables (Table 6 and Table 7), we observe thefollowing scenarios of agreement and disagreement:

S1: Both assessors agree upon inequality between docu-ments shown. Also, both concur with relative orderingof the documents (i.e. in both sets of judgements A isconsidered more relevant than B).

S2: Both sets of judgements agree upon the difference be-tween the documents, but the document preferencesare reversed.

S3: Both disagree on the equality of the document. E.g.one set of judgements prefers document A while theother considers both of them as equals.

S4: Both judgements agree that the documents displayedare equal.

As observed from the two tables, there are two kinds ofdisagreement between judgements obtained from one asses-sor to another.

• Pairs where the choice of preferred documents is re-versed between the two sets of judgements.

• Pairs where a document is preferred in one set of judge-ments, but considered equal in the other.

In the former case, the two sets of judgements take oppos-ing views about the preference of the document in a givenpair. But, they still agree upon the fact that the documentsdisplayed have different levels of relevance. We define suchreversal of judgements as the opposition and measure it asfollows:

opposition =#pairs in S2

#pairs in S1 + #pairs in S2(3)

In the other scenario, both sets of judgements tend to dis-agree not only with respect to the relative importance ofthe document, but also express a reversed view about theequality of the document. We define such reversal as dis-agreement between the two sets of judgements and computeit as follows:

disagreement =#pairs in S2 + #pairs in S3

#Total pairs judged(4)

NA > NB NA = NB NA < NB TotalPA > PB 0.685 0.165 0.149 308PAB,Good 0.37 0.30 0.32 149PA < PB 0.168 0.21 0.618 309PAB,Bad 0.34 0.43 0.23 96

Table 5: Agreement between nuggets based judge-ments (topic creators) and preference based judge-ments (assessor2)

CA > CB CA = CB CA < CB TotalPA > PB 0.57 0.35 0.07 308PAB,Good 0.22 0.55 0.22 149PA < PB 0.08 0.307 0.605 309PAB,Bad 0.125 0.75 0.125 96

Table 6: Multi-level graded relevance assessments(assessor1) versus Preference judgements (asses-sor2)

disagreement oppositionpref vs. nugg 0.43 0.196pref vs. grad 0.39 0.137

Table 7: Disagreement and Opposition measures be-tween different sets of judgements

We observe that nugget-based measures are consistentwith assessors’ preference of documents when both sets ofjudgements agree that the documents shown are not equal(S1 and S2).

6. NOVELTY AND DIVERSITYAlmost all existing test collections are compiled under the

assumption of one interpretation per query. Each query istightly associated with a single interpretation and relevanceof the document is measured with respect to that one inter-pretation only. Sparck Jones et al [12], explored ambiguityin user submitted queries, where different sets of users wouldhave different kinds of information needs while submittinga query and thereby different notion of document relevance.A related study was conducted by Sanderson [10] , wherehe examined ambiguity with proper nouns such as names ofpeople and places. In both studies, they stress the impor-tance of having a test collection to evaluate ambiguous userqueries.

Clarke et al [5] examined the property of novelty and di-versity in retrieved results with the help of Question Answer-ing track data. They proposed measures that would rewardretrieval systems for retrieving novel (not present in previ-ously read documents) information and penalize systems forretrieving redundant information.

26

We wish to extend our framework of evaluation to compiletest collections to evaluate ambiguous queries.

7. COLLECTIONThe test collection compiled through this study is avail-

able for download2. We modified the downloaded wikipediacollection by removing the redirect pages and assigning eacharticle a unique document identifier. We provide two sets ofqrels along with our collection, where one follows the stan-dard and the other contains the list of nuggets marked ascovered in each document. The collection along with the rel-evance judgements and test topics is released under CreativeCommons license, agreed upon by all topic creators.

8. CONCLUSION AND FUTURE WORKIn this paper, we proposed a framework to obtain graded

relevance assessments for a document based on informationnuggets covered by the document. Graded relevance valuesare assigned based on the number of associated nuggets cov-ered by the document. We then compiled a test collectionto compare the graded relevance assessments obtained bythree sets of judgements.

In our investigation, we observed that assessors implicitlyprefer the documents that would likely cover more nuggets.When compared with pairwise preference judgements, theassessor tends to prefer the documents that contain morenuggets.

In this evaluation, we assumed that each associated nuggetis of equal importance. In the future, we wish to extend itby assigning weights to nuggets indicating their importancefor a given topic.

Although our framework provides a unified model to eval-uate informational, navigational and transactional types ofneeds, we restricted our nuggets to only one kind of nugget(informational). In the future, we would like to explore thecombinations of different kinds of needs.

We would like to carry out a large-scale user study tocompare the agreement among different kinds of graded rel-evance judgements.

AcknowledgementsWe would like to thank Azin Ashkan, Ian McKinnon andGerald Leung for their help in the compilation of our testcollection.

9. REFERENCES[1] A. Al-Maskari, M. Sanderson, and P. Clough. The

relationship between IR effectiveness measures anduser satisfaction. In Proceedings of the 30th AnnualInternational ACM SIGIR Conference on Researchand Development in Information Retrieval, pages773–774, 2007.

[2] J. Carbonell and J. Goldstein. The use of mmr,diversity-based reranking for reordering documentsand producing summaries. In SIGIR ’98: Proceedingsof the 21st annual international ACM SIGIRconference on Research and development ininformation retrieval, pages 335–336, New York, NY,USA, 1998. ACM Press.

2http://ir.uwaterloo.ca/resources/wikiCollection.html

[3] B. Carterette, P. N. Bennett, D. M. Chickering, andS. T. Dumais. Here or there: Preference judgments forrelevance. In 30th European Conference on IRResearch, pages 16–27, Glassgow, Scotland, 2008.

[4] Z. Chen and Y. Xu. User-oriented relevance judgment:A conceptual model. In HICSS ’05: Proceedings of theProceedings of the 38th Annual Hawaii InternationalConference on System Sciences (HICSS’05) - Track 4,page 101.2, Washington, DC, USA, 2005. IEEEComputer Society.

[5] C. L. A. Clarke, M. Kolla, G. V. Cormack,O. Vechtomova, A. Ashkan, S. Buttcher, andI. MacKinnon. Novelty and Diversity in InformationRetrieval Evaluation. In Proceedings of the 31st ACMSIGIR Conference on Research and Development inInformation Retrieval (SIGIR 2008), Singapore, July2008.

[6] K. Jarvelin and J. Kekalainen. Cumulated gain-basedevaluation of ir techniques. ACM Transactions onInformation Systems, 20(4):422–446, 2002.

[7] J. Lin. Is question answering better than informationretrieval? a task-based evaluation framework forquestion series. In Proceedings of the 2007 HumanLanguage Technology Conference and the NorthAmerican Chapter of the Association forComputational Linguistics Annual Meeting(HLT/NAACL 2007), pages 212–219, Rochester, NewYork, 2007.

[8] A. Nenkova, R. Passonneau, and K. McKeown. Thepyramid method: Incorporating human contentselection variation in summarization evaluation. ACMTrans. Speech Lang. Process., 4(2):4, 2007.

[9] S. Robertson and N. Belkin. Ranking in principle.Journal of Documentation, 34(2):93–100, 1978.

[10] M. Sanderson. Ambiguous queries: Test collectionsneed more sense. In Proceedings of the 31st ACMSIGIR Conference on Research and Development inInformation Retrieval (SIGIR 2008), Singapore, 2008.

[11] E. Sormunen. Liberal relevance criteria of trec -:counting on negligible documents? In SIGIR ’02:Proceedings of the 25th annual international ACMSIGIR conference on Research and development ininformation retrieval, pages 324–330, New York, NY,USA, 2002. ACM.

[12] K. Sparck Jones, S. E. Robertson, and M. Sanderson.Ambiguous requests: Implications for retrieval tests.SIGIR Forum, 41(2):8–17, 2007.

[13] R. Tang, J. William M. Shaw, and J. L. Vevea.Towards the identification of the optimal number ofrelevance categories. J. Am. Soc. Inf. Sci.,50(3):254–264, 1999.

[14] E. M. Voorhees and H. T. Dang. Overview of theTREC 2005 question answering track. In Proceedingsof the Fourteenth Text REtrieval Conference,Gaithersburg, Maryland, 2005.

[15] C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyondindependent relevance: methods and evaluationmetrics for subtopic retrieval. In SIGIR ’03:Proceedings of the 26th annual international ACMSIGIR conference on Research and development ininformaion retrieval, pages 10–17, New York, NY,USA, 2003. ACM Press.

27

Position Statements and Early Work Session

Is Relevance the Right Criterion for Evaluating Interactive Information Retrieval?Nicholas J. Belkin, Ralf Bierig, Michael Cole, Information & Library Studies, Rutgers UniversityAbstract: The standard criterion on which to evaluate information retrieval systems has long been the relevance of docu-ment(s) to information problem (or some operationalization, such as a query), and this criterion has been the basis of thetwo measures, recall and precision, usually based on a binary interpretation of relevance. However there is also a long historyof alternatives to relevance as a criterion, and a good deal of recent experimentation which has adopted both other criteria,and quite different measures. In this presentation, we suggest that: a) usefulness is a more appropriate evaluation criterionthan relevance, in particular for interactive information retrieval (IIR); and b) there are unlikely to be universally applicablemeasures based on usefulness, and therefore measures for evaluating IIR need to be tailored not only to the task which led toinformation seeking behavior, but also to the searching tasks or information seeking strategies in which a searcher is engagedat any point in an information seeking episode. We present several examples of usefulness-based measures applicable todifferent work tasks, searching tasks, and information seeking strategies, in order to support our claim.

Beyond Relevance Judgments: Cognitive Shifts and GratificationAmanda Spink, Frances Alvarado-Albertorio, Jia T. Du, Faculty of Information Technology, Queensland University of Tech-nologyAbstract: Relevance judgments are an important element of evaluation for the field of information retrieval (IR). Recentstudies also highlight the value of measuring users cognitive shifts and gratification judgments during interactive IR andWeb searching. This argues for an approach to interactive IR as a multi-level evaluation process, including users relevancejudgments, cognitive shifts and gratification judgments.

Set Based Retrieval: The Potemkin Buffet ModelSoo-Yeon Hwang, Paul Kantor, Michael J. Pazzani, Rutgers UniversityAbstract: We present an alternative scheme for speedy presentation of a sample that can help a system and a user todisambiguate each others behavior, and focus relevance feedback on the correct sector of an ambiguous query. This approachrecognizes that the retrieval valuation is intrinsically set based, and that in some cases the most desirable set is one thatbroadly spans the multiple facets of a query.

Copyright is held by the author/owner(s).SIGIR 2008 Workshop: Beyond Binary Relevance: Preferences, Diversityand Set-Level Judgments.July 24th, 2008, Singapore.ACM 978-1-60558-164-4/08/07 .

28

Learning Diverse Rankings by Minimizing Abandonment

Filip RadlinskiDepartment of Computer ScienceCornell University, Ithaca, NY, USA

[email protected]

ABSTRACTRanking functions are usually evaluated using independent document relevancejudgments: the relevance of each individual document is assessed by humanjudges. Optimizing metrics that aggregate over independent document scores,such as MAP and NDCG, does not take into account redundancy between thedocuments presented to users. This means that the most highly scoring rankingsin terms of MAP and related metrics can be repetitive. In contrast, diverserankings that capture different possible interpretations for a query are morelikely to allow more users to quickly find at least one relevant document. As aresult, post-processing is often used to diversify rankings presented to users.

In this talk, I will show how clickthrough data can be used in an online learningsetting to directly learn diverse rankings of documents. The first part of this talkwill present learning to rank using a regret minimization formulation. I will alsodescribe the multi-armed bandit model, often used to analyze regret minimizationproblems. The second part of the talk will show how standard multi-armedbandit algorithms can be adapted for learning rankings. In particular, I willpresent two algorithms that maximize the fraction of users who click on at leastone search result. Optimizing this measure minimizes abandonment, and learnsdiverse rankings.

Copyright is held by the author/owner(s).SIGIR 2008 Workshop: Beyond Binary Relevance: Preferences, Diversityand Set-Level Judgments.July 24th, 2008, Singapore.ACM 978-1-60558-164-4/08/07 .

29

Clicks-vs-Judgments and Optimization

Nick [email protected] Microsoft Research Cambridge

7 JJ Thomson AveCambridge, UK

ABSTRACTRelevance judgments give us some information about user satisfaction. Usagelogs give us another view. Then there are other criteria like avoiding duplication,promoting diversity, eliminating detrimental results, and adjusting the resultsaccording to the personal profile of the user and their recent activity. Then wecan ask a user “were you satisfied?” and “which is better?”. Given the goal ofoptimizing a ranking, “learning to rank”, how can we take multiple criteria intoaccount? This talk gives an overview of some approaches and open questions inthe area. It also analyzes the relative strengths and weaknesses of click log andrelevance judgment information in a Web search context.


30

A Test Collection of Preference Judgments

Ben CarteretteCIIR, UMass Amherst140 Governors DriveAmherst, MA 01003

[email protected]

Paul N. BennettMicrosoft ResearchOne Microsoft Way

Redmond, WA [email protected]

Olivier ChapelleYahoo! Research

2821 Mission College BlvdSanta Clara, CA 95054

[email protected]

ABSTRACTWe describe an initial release of a set of binary preferencejudgments over a subset of the LETOR data. These judg-ments are meant to serve as a starting point for researchinto questions of evaluation and learning over non-binary,multi-item assessments.

1. INTRODUCTIONInformation retrieval test collections traditionally contain

binary judgments of the relevance of documents to queries.Recently there has been interest in generalizing to non-binaryjudgments: graded scales, aspect relevance, and preferencesof the form “document A is preferred to document B”. Pref-erences in particular are the foundation of several learningalgorithms, such as the ranking SVM [6] and RankNet [2].

Though test collections exist for graded relevance and as-pect relevance, most research on preferences to date has usedpreferences inferred from binary or graded judgments. Wehave undertaken to construct a test collection of true pref-erence judgments. The initial release, described in moredetail below, contains preference judgments for documentsjudged for the Topic Distillation task of the TREC 2003 Webtrack [5].

2. PREFERENCE TEST COLLECTIONThe construction of this test collection is partially moti-

vated by the use of preferences in algorithms for learning torank. We decided to assemble preference judgments over asubset of the standard corpus used in learning-to-rank re-search: the LETOR (LEarning TO Rank) dataset [7].

The LETOR data consists of features and relevance judg-ments for about 1,000 documents judged for the TREC 2003and 2004 Web tracks as well as features and relevance judg-ments for the medical abstracts in the OHSUMED corpus.This initial release includes preferences for the TREC 2003queries only.

Topics. The topics are the 50 Topic Distillation topicsfrom the TREC 2003 Web track. Each topic consists of ashort title query and a longer description of the informationneed.

Corpus. The corpus is web pages in the .gov domain,crawled for the TREC 2003 Web track. LETOR providesfeatures for the top 1,000 of these documents by an imple-

Copyright is held by the authors/owners.SIGIR 2008 Workshop: Beyond Binary Relevance: Preferences, Diversityand Set-Level Judgments, July 24th, 2008, Singapore.ACM 978-1-60558-164-4/08/07

mentation of BM25 along with all documents judged rele-vant for the Topic Distillation task of the Web track. Forthis release, we have further restricted the corpus to alldocuments judged relevant for TREC, rounding out to 50documents total by including the top documents ranked byLETOR’s BM25 feature.

Judgments. Assessors were shown two documents andasked which they prefer for a given query. They could alsojudge a single document not relevant, effectively indicatingthat they would prefer every other (relevant) document tothat one. They were not allowed to say two documents wereequally relevant (except in the case of duplicates). Figure 1shows a screenshot of the interface.

Based on our previous work [4], we assumed assessorswould be transitive in their preferences and used that as-sumption to select pairs of documents to show. Along withthe ability to judge documents nonrelevant, this reduced thetotal amount of effort needed to construct the collection.Generally one of the two documents was held fixed until ithad been compared to all other documents.

The judgments were vetted and obvious errors correctedby one of the authors of this paper.

Agreement with TREC judgments. Our assessors did notalways agree with the TREC judgments on whether a doc-ument was relevant or not. There are several explanationsfor this:

1. Task. The TREC judgments are for the Topic Distil-lation task, which is more specific than the traditionalad hoc task. Topic distillation focuses on retrievinghome pages that act as gateways to the informationand this is reflected in the judgments. Our preferencejudgments are closer to the ad hoc type, so that manypages that would have been judged nonrelevant fordistillation are relevant (though perhaps not greatlypreferred) for an ad hoc task by virtue of containinginformation about the topic.

2. Noise in TREC judgments. In the process of vettingour assessors’ judgments, we found many documentsthat had been judged relevant for the distillation taskbut were tenuously related to the topic at best. Bern-stein & Zobel reported a similar phenomenon in judg-ments for the GOV2 corpus [1].

3. Noise in our judgments. Judging preferences requiresO(n log n) judgments, which for large n can becomequite tiring. This may have contributed to noise in ourpreferences and disagreement with the TREC judg-ments.

31

Figure 1: A screenshot of the preference interface. The query and description are shown at the top. The twopages are shown in inline windows. The parent window can be resized to provide more space for the inlinewindows. Query terms are highlighted in the web pages. The progress indicator in the upper right lets theassessor know how close they are to completing the query.

3. DATA FORMATWe have released two different versions of the preference

judgments. The first gives the DOCNOs in the .gov corpus.The second uses the docids in the LETOR dataset and isintended for joining with those features.

Each line of both sets has four fields: the query number,two document IDs, and the preference judgment. A judg-ment of −1 indicates the the first document ID was preferredto the second; 1 indicates the opposite. A 0 means the twodocuments were judged to be duplicates. If either of thedocids is NA, then the other document was judged to benonrelevant. A 2 or −2 in the judgment column indicatesthis explicitly.

For the purposes of distribution, the preferences for eachtopic were reduced to the 50 necessary to reconstruct all1,225 preferences (assuming transitivity). An example isshown in Figure 2.

4. EVALUATION MEASURESRetrieval systems are typically evaluated by some combi-

nation of precision, the proportion of retrieved documentsthat are relevant, and recall, the proportion of relevant doc-uments that were retrieved. When “retrieved” is defined interms of whether a document is ranked before some cutoffk, precision and recall can be calculated at any rank k.

We have proposed a generalization of precision and re-call to preference judgments [3]. First we define a few newterms. We will say a pair of documents (i, j) is ordered bythe system if one or both of i, j appears above rank k. Apair is unordered if neither i nor j are above k. A pair iscorrectly ordered if the system’s ordering matches assessorpreferences, and incorrectly ordered otherwise.

We then define precision of preferences (ppref) as the ratioof correctly ordered pairs to ordered pairs. For example, atrank k = 5 a system has effectively specified an ordering offive documents, and for each of these, orderings in relation to

1 G00-00-1006224 G00-10-3849661 -1

1 G00-10-3849661 G12-90-0628070 -1

...

1 G37-09-0021242 G04-99-1871403 -1

1 G04-99-1871403 G34-06-2520482 -1

1 G34-06-2520482 NA -2

1 96 5044 -1

1 5044 322933 -1

...

1 848686 136972 -1

1 136972 783579 -1

1 783579 NA -2

Figure 2: Example preference judgments. Thetop judgments use the .gov DOCNOs; the asses-sor preferred document G00-00-1006224 to G00-10-3849661 and G00-10-3849661 to G12-90-0628070(and hence preferred G00-00-1006224 to G12-90-0628070). Document G34-06-2520482 was judgednonrelevant. The bottom judgments are for thesame documents, but using their correspondingLETOR docids.

the remaining n− 5 documents (where n is the total corpussize). This yields 5(5 − 1)/2 + 5(n − 5) = 5n − 15 orderedpairs, and more generally k(2n− k − 1)/2 ordered pairs.

Likewise, recall of preferences (rpref) is defined as the ratioof correctly ordered pairs to the total number of preferencesmade by assessors. For the example above, rpref would bethe proportion of the full set of preferences that are correctlyordered among the 5n− 15 ordered pairs.

Note that ppref and rpref at rank k = ∞ are proportionalto Kendall’s τ rank correlation, which is a function of thenumber of incorrectly ordered pairs, i.e., the number of mis-

32

classified pairs.Ties. There are two situations that can be considered

“ties”: for a given pair of documents, an assessor eitherjudged them to be identical, judged both to be bad, or didnot specify anything about them at all. These pairs maybe ordered by a system, but it is not immediately clear howthey should be treated for calculation of ppref and rpref.The solution we adopt is to simply not count them as eitherordered or unordered, excluding them from both numeratorand denominator of ppref and rpref.

Summary measures. Like traditional precision and re-call, ppref and rpref can be plotted against each other forincreasing k to create a precision-recall curve. ppref canbe interpolated to create smooth curves, or averaged overranks at which rpref increases, producing average precisionof preferences (APpref).

Weighted preferences. Strictly speaking, precision andrecall can only be calculated for binary relevance. Dis-counted cumulative gain (DCG) is a precision-like measurethat supports graded (non-binary) relevance and discountingby rank. We can incorporate this idea into ppref and rprefas well, for when preferences have gradations (“strongly pre-fer”, “slightly prefer”, etc) and to discount pairs by rank. Wedefine a weight wij for each pair of ranks (i, j). By analogyto a commonly-used formulation of DCG, we set

wij =2|prefij | − 1

log2(min{i, j}+ 1)

where prefij is the degree of preference between the docu-ments at ranks i and j. Weighted precision of preferences(wppref) is the sum of the weights over ranks j > i for whichthe documents are correctly ordered divided by the totalweight of all ordered pairs. Normalized wppref (nwppref),like normalized DCG (NDCG), is wppref divided by the bestpossible wppref at the same rank.

An implementation of these measures is available at http://ciir.cs.umass.edu/~carteret/preferences.html.

5. BASELINE RESULTSJoachims’ RankSVM is an adaptation of the support vec-

tor machine to learning a ranker [6]. It optimizes a lossfunction based on preferences to learn a partial ordering ofitems. Its success has been demonstrated on the LETORdata [7], which consists of binary relevance judgments onthe GOV2 documents referenced above.

We trained and tested a RankSVM using the preferencesdirectly, with all O(n2) preferences inferred from the n thatare provided for each query. We joined the preference labelswith LETOR features, comprising such information as queryterm frequency and inverse document frequency, BM25 score,language modeling score, features of the HTML markup, andfeatures of the link graph. Features were normalized to havestandard deviation 1.

We tested the standard linear kernel. We compared totwo baselines: a random ranking of labeled documents anda SVM classifier trained with binary labels obtained by infer-ence from preferences: documents marked “bad” were con-sidered nonrelevant and all others relevant.

5.1 Training and TestingThe RankSVM was trained using the partitioning described

by Liu et al. [7]: five folds, each consisting of 30 trainingqueries, 10 validation queries, and 10 testing queries. The

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.5

0.6

0.7

0.8

0.9

1.0

recall of preferences

prec

isio

n of

pre

fere

nces

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.5

0.6

0.7

0.8

0.9

1.0

RankSVM with preferencesSVM with inferred binary judgments

Figure 3: Preference precision-recall curve for thelinear-kernel RankSVM trained with preferencesand the SVM trained with inferred binary judg-ments.

validation set was used to select RankSVM parameter C,the misclassification cost.

5.2 ResultsResults for the RankSVM, the binary classifier, and the

random ranker are shown in Table 1. Since 50 documentswere judged for each query, no more than 50 were rankedby any of the methods; ppref@max and rpref@max thereforerefer to the ppref and rpref at the maximum of 50 or the lastrank at which any document was ranked. Figure 3 shows thepreference precision-recall curve for the RankSVM trainedover binary judgments. Note that over half the documentswere judged “relevant”, i.e. not judged by the assessors tobe obviously bad, so ppref cannot be lower than 0.5, whilerpref cannot be higher than ppref. Figure 3 also shows thepreference precision-recall curve for the binary SVM (notethat while the SVM was trained with binary judgments, itsresults were evaluated with the preference judgments, whichare the “truth” in this setting). While the precision of thebinary classifier is slightly higher at the top-most ranks, itdrops off faster. The two curves hew closely to each other,but the preference curve is clearly superior over the entireranking.

The difference in performance between preferences andbinary labels is small, but preferences provide superior per-formance for every evaluation measure. Furthermore, pref-erences are superior at nearly every point in the preferenceprecision-recall curve, and where binary judgments give bet-ter performance, it is only a relatively small gain. The ran-dom baseline is quite high, likely because so many docu-ments were judged “relevant”. Additional research is nec-essary to understand differences between training over thetwo different types of data, but these results serve as a usefulbaseline for future research.

33

method ppref@10 ppref@25 ppref@max rpref@10 rpref@25 rpref@max APprefRankSVM-linear (pref) 0.5997 0.5663 0.5549 0.2545 0.4484 0.5549 0.6039SVM-linear (binary) 0.5835 0.5452 0.5455 0.2368 0.4072 0.5244 0.5987

random 0.4835 0.4959 0.5003 0.1789 0.3663 0.4821 0.5278

Table 1: RankSVM results for preference data along with SVM results for binary classification and resultsfor a random ranker.

6. CITATIONWhen using this data, please cite:

B. Carterette, P. N. Bennett, and O. Chapelle. ATest Collection of Preference Judgments. In SI-GIR 2008 Workshops: Beyond Binary Relevance:Preferences, Diversity, and Set-Level Judgments.Edited by Bennett, Carterette, Chapelle, and Joachims.

AcknowledgmentsThanks to Susan Dumais and Microsoft Research, whosegenerous donation made the preference collection possible,to Thorsten Joachims for helpful feedback, and to our as-sessors. This work was supported in part by the Center forIntelligent Information Retrieval and in part by MicrosoftLive Labs. Any opinions, findings, and conclusions or rec-ommendations expressed in this material are those of theauthors and do not necessarily reflect those of the sponsors.

7. REFERENCES[1] Y. Bernstein and J. Zobel. Redundant documents and

search effectiveness. In Proceedings of CIKM, pages736–743, 2005.

[2] C. Burges, T. Shaked, E. Renshaw, A. Lazier,M. Deeds, N. Hamilton, and G. Hullender. Learning torank using gradient descent. In Proceedings of ICML,pages 89–96, 2005.

[3] B. Carterette and P. N. Bennett. Evaluation measuresfor preference judgments. In Proceedings of SIGIR,2008. To appear.

[4] B. Carterette, P. N. Bennett, D. M. Chickering, andS. T. Dumais. Here or there: Preference judgments forrelevance. In Proceedings of ECIR, pages 16–27, 2008.

[5] N. Craswell and D. Hawking. Overview of the TREC2003 Web track. In Proceedings of TREC, pages 78–92,2003.

[6] T. Joachims. Evaluating retrieval performance usingclickthrough data. In Text Mining, pages 79–96. 2003.

[7] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li.LETOR: Benchmark dataset for research on learning torank for information retrieval. In Learning to Rank forInformation Retrieval workshop in conjunction withSIGIR, 2007.

34

Author Index

Alvarado-Albertorio, F. . . . . . . . . . . . 28Arni, T. . . . . . . . . . . . . . . . . . . . . . . . . . . 15Aslam, J.A. . . . . . . . . . . . . . . . . . . . . . . . . 6Belkin, N.J. . . . . . . . . . . . . . . . . . . . . . . . 28Bennett, P.N. . . . . . . . . . . . . . . . . . . . . . 31Bierig, R. . . . . . . . . . . . . . . . . . . . . . . . . . 28Carterette, B. . . . . . . . . . . . . . . . . . . . . . 31Chapelle, O. . . . . . . . . . . . . . . . . . . . . . . 31Clarke, C.L.A. . . . . . . . . . . . . . . . . . . . . 22Clough, P. . . . . . . . . . . . . . . . . . . . . . . . . 15Cole, M. . . . . . . . . . . . . . . . . . . . . . . . . . . 28Craswell, N. . . . . . . . . . . . . . . . . . . . . . . 30Du, J.T. . . . . . . . . . . . . . . . . . . . . . . . . . . 28Hwang, S. . . . . . . . . . . . . . . . . . . . . . . . . 28

Kanoulas, E. . . . . . . . . . . . . . . . . . . . . . . . 6Kantor, P. . . . . . . . . . . . . . . . . . . . . . . . . 28Kolla, M. . . . . . . . . . . . . . . . . . . . . . . . . . 22Pazzani, M.J. . . . . . . . . . . . . . . . . . . . . . 28Radlinski, F. . . . . . . . . . . . . . . . . . . . . . . 29Sanderson, M. . . . . . . . . . . . . . . . . . . . . 15Spink, A. . . . . . . . . . . . . . . . . . . . . . . . . . 28Tang, J. . . . . . . . . . . . . . . . . . . . . . . . . . . 15Vechtomova, O. . . . . . . . . . . . . . . . . . . . 22Xue, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Yu, Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Zha, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Zhou, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

35

NOTES

36