ranking metrics

7/29/2019 ranking metrics

1/15

Relevance Ranking Metrics forLearning Objects

Xavier Ochoa and Erik Duval, Member, IEEE Computer Society

AbstractThe main objective of this paper is to improve the current status of learning object search. First, the current situation is

analyzed and a theoretical solution, based on relevance ranking, is proposed. To implement this solution, this paper develops the

concept of relevance in the context of learning object search. Based on this concept, it proposes a set of metrics to estimate the topical,

personal, and situational relevance dimensions. These metrics are calculated mainly from usage and contextual information and do not

require any explicit information from users. An exploratory evaluation of the metrics shows that even the simplest ones provide

statistically significant improvement in the ranking order over the most common algorithmic relevance metric. Moreover, combining the

metrics through learning algorithms sorts the result list 50 percent better than the baseline ranking.

Index TermsLearning objects, relevance ranking, metadata, learning object repository, RankNet.

1 INTRODUCTION

IN a broad definition, learning objects are any digitaldocument that can be used for learning. Learning ObjectRepositories (LORs) exist to enable sharing of suchresources [1]. To be included in a repository, learningobjects are described by a metadata record usually providedat publishing time. All current LORs provide or are coupledwith some sort of search facility.

In the early stages of Learning Object deployment, theserepositories were isolated and only contained a smallnumber of learning objects [2]. The search facility usuallyprovided users with an electronic form where they couldselect the values for their desired learning object. Forexample, through the early ARIADNE Search and Indexa-

tion tool [3], a user could select English as the languageof the object, Databases as the subdiscipline, and Slideas the learning result type. The search engine thencompared the values entered in the query with the valuesstored in the metadata of all objects and returned thosewhich complied with those criteria. While initially thisapproach seems appropriate to find relevant learningobjects, experience shows that it presents three mainproblems: 1) Common users (i.e., non-metadata experts)found this query approach too difficult and even over-whelming [4]. The cognitive load required to express theirinformation need into the metadata standard used in therepository was too high. Metadata standards are useful as away to interchange information between repositories butnot as a user query interface. 2) In order for this approach towork, the metadata fields entered by the indexers need to

correspond with the metadata fields used by the searchers.A usability study by Najjar et al. [5] found that this is usuallynot the case. Finally, 3) the high precision of this approachoften leads to a low recall [6]. Being small repositories, mostsearches produced no results, discouraging the users.

Given these problems with the metadata-based search,most repositories provided a Simple Search approach,based on the success of text-based retrieval exemplified byWeb search engines [7]. In this approach, users only need toexpress their information needs in the form of keywords orquery terms. The learning object search engine thencompared those keywords with the text contained in the

metadata, returning all the objects that contained the samewords. This approach solved the three problems ofmetadata-based search: The searchers express their queriesas a sequence of keywords, the metadata completeness wasnot as important as before because the query terms could bematched with any field or even the text of the object, andfinally, the recall of the querys results increased. Thisapproach seemed the solution for small repositories.However, working with small, isolated repositories alsomeant that an important percentage of users did not findwhat they were looking for because no relevant object waspresent in the repository [4].

Current research in the Learning Object community has

produced technologies and tools that solve the scarcityproblem. Technologies like SQI [8] and OAI-PMH [9],enable search over several repositories simultaneously.Another technology, ALOCOM [10], decomposes complexlearning objects into smaller components that are easier toreuse. Finally, automatic generation of metadata based oncontextual information [11] enables the conversion of thelearning content of Learning Management Systems (LMSs),into metadata-annotated Learning Objects, ready to bestored into an LOR. Although these technologies are solvingthe scarcity problem, they are creating an inverse problem,namely, abundance of choice [12]. The user is no longer ableto review several pages of results in order to manually pick

the relevant objects. The bias of the search engines toward

34 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 1, NO. 1, JANUARY-MARCH 2008

. X. Ochoa is with the Centro de Tecnolog as de Informacion, EscuelaSuperior Polite cnica del Litoral, Campus Gustavo Galindo, Via PerimetralKm. 30.5, Apartado Guayaquil 09-01-5863, Ecuador.E-mail: [email protected].

. E. Duval is with the Department of Computer Science, KatholiekeUniversiteit Leuven, Celestijnenlaan 200 A, B-3001 Leuven, Belgium.E-mail: [email protected].

Manuscript received 21 Mar. 2008; accepted 19 June 2008; published online17 July 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TLTSI-2008-03-0030.

Digital Object Identifier no. 10.1109/TLT.2008.1.1939-1382/08/$25.00 2008 IEEE Published by the IEEE CS & ES


2/15

recall only exacerbates this problem. The final result is thateven if a very relevant object is present in the result list, theuser still could not find it, again, reducing the perceivedusefulness of LORs.

While doing a stricter filtering of results (increasingprecision at the expense of recall) could solve the over-supply problem, it could also lead again to the initial

problem of scarcity. A proven solution for this problem isranking or ordering the result list based on its relevance. Inthis way, it does not matter how long the list is, because themost relevant results will be at the top and the user couldmanually review them. As almost all search engines use thismethod, searchers are not only used to work with thesesorted lists of results but expect them [13]. To help the userfind relevant learning objects, Duval [12] proposed thecreation of LearnRank, a ranking function used to define therelevance of learning objects similarly to how PageRank [14]defines the relevance of web pages. Also, in a previouspaper [15], the authors explore how Contextualized Atten-tion Metadata (CAM) [16] could be mined to obtain

meaningful information about the relevance of a specificlearning object for a specific user and context. This paperprovides important progress in this direction, proposingand testing a set of multidimensional relevance rankingmetrics. These metrics use external sources of informationin addition to what is explicitly stated in the user query toprovide a more meaningful relevance ranking than currentquery-matching implementations. The development ofthese metrics address three main questions: 1) What doesrelevance mean in the context of Learning Objects? 2) Howdo we convert the multidimensional relevance concept intonumerical values that could be used for sorting? and 3) Canthe proposed metrics outperform current generic rankingpractices in LOR search?

The structure of this paper is given as follows: Section 2analyzes the current state of Learning Object Ranking.Section 3 discusses different dimensions of the relevanceconcept and how they translate to the context of LearningObject search. These relevance dimensions are used asguidelines in Section 4 to propose and compare a set ofmetrics that can rank a list of learning objects based on usageand contextual information. Section 5 presents differentmechanisms in which these metrics could be combined in aunique rank value. To obtain a rough estimate of the benefitthat these metrics could have in a real implementation, anexploratory study, where the metrics are compared againsthuman relevance rankings and existing ranking methods, is

presented in Section 6. This experiment also tests theefficacy of the metrics combination. This paper closes withconclusions and opens research questions.

2 CURRENT STATUS OF LEARNING OBJECTRANKING

Current LOR search interfaces already provide rankingfunctionalities based on generic ranking strategies. In thissection, we present three categories of those strategiesfound in practice and in the literature. The advantages anddisadvantages of each one are discussed. Finally, theprofile of the ideal ranking strategy for learning objects is

contrasted with these approaches.

2.1 Ranking Based on Human Review

Some LORs sort their results mainly by peer evaluation. Inthe case of MERLOT [17] for instance, a special group ofexpert users has the ability to review the objects and gradethe Content Quality, Effectiveness, and Ease of Use. Theaverage of these three grades is considered the rating forthe object. Peer reviewers also provide an explanation about

the decisions behind the grade. The main advantage of thissystem is that it provides the searching user with a mean-ingful evaluation of the overall quality of the object and thepossible scenarios where it could be useful. However, thereare two main disadvantages of this approach. First, this is avery laborious, manual process. Not surprisingly, in the caseof MERLOT for instance, only 10 percent of the objectscontained in the repository have been peer-reviewed [18](a more recent value of 12.3 percent was obtained throughdirect observation). This means that, even if an object isrelevant for a user and it is returned in the result list butbelongs to the 90 percent of the not peer-reviewed material,the userwillprobably not find it, as it will behiddenin a deep

result page. Even worse, an object that received a lowscore inthe review will still be ranked higher than any other not-rated object regardless of its quality. A solution tried byMERLOT to this problem is to allow users to comment onand rate the objects. While this helps to increase the numberof rated objects (circa 25 percent), it is still difficult to reach amajority of objects. Also, the user reviews are less detailedthan the peer review, providing less help to the searcher.

The second disadvantage is that a human measurementof quality is static and does not adapt to different usersneeds. For example, searching for databases in MERLOTwill present in the first results highly rated resources foreducational databases of content. This answer, while usefulfor users that are searching other repositories of learningmaterials, will not help the user that is looking for learningresources about relational databases, such as MySQL orOracle. A similar approach is taken by Vargo et al. [19].They use the data generated by users evaluation of thequality of the learning object to sort. Users measure theLO quality using the Learning Object Review Instrument, aset of nine quality parameters that the learning objectshould meet. The main drawback of this ranking approachis the previously mentioned problem of the lack ofscalability of user review.

In summary, using manual review of learning objects forranking suffers from the same problem of indexing byexperts or librarians: in most situations, humans do not

scale [20]. While normally highly meaningful for the user,this approach will break apart in a projected ecosystemwhere millions of objects are created everyday. Anotherperceived problem of manual review ranking is that itcannot be easily adapted to different users or contexts. Therecall of the top-k elements in the result list tends to be low,as relevant objects will not be evaluated. The precision alsovaries depending on whether the context of the humanevaluator was or not the same as that of the searcher.

2.2 Ranking Based on Text Similarity

A completely different approach is followed by otherrepositories. We will take SMETE [21] as an example. It

relies on content-based calculations in order to assign a

OCHOA AND DUVAL: RELEVANCE RANKING METRICS FOR LEARNING OBJECTS 35


3/15

relevance value to all the objects returned in a search. In thecase of SMETE, it calculates the similarity between the queryterms and the text contained in the metadata of the learningobjects. It uses some variation of vector-space algorithms[22]. This algorithm creates a vector for the documents andthe query where each dimension is a word. The magnitudeof the document or query vector in each word is the

frequency of the word in the query divided by its frequencyin the whole repository. This is similar to the algorithmsused for basic text information retrieval [23] and early Websearch engines [24]. Other examples are presented in thework of Chellappa [25] that summarizes the methodologyfollowed by several repositories: adapting full-text searchapproaches to rank the learning objects based only on thesimilarity between the query terms and the text fields of themetadata record. This approach, using simple text-basedmetrics, has the advantage that can be computed easily foreach one of the objects in the list. Nonetheless, it presentstwo main disadvantages. First, the amount of text normallycontained in the learning object metadata is low. This leads

to equal values for several objects and underperformancecompared to the use of the same algorithm against a full-textindex. Second, the order of the final list reflects how manytimes the query words appear on the metadata fields of theobject, but it does not transmit to the user any notion aboutthe quality/relevance of the object itself. Fine-grained butvery relevant differences between learning objects, forexample targeted age group or educational context, arevery difficult to be captured by any of the current textanalysis/clustering mechanisms. In order to obtain a moremeaningful ranking, more contextual information is needed.For example, all current Web search engines make use ofranking algorithms that are heavily based on the analysis ofthe Web network [14] or click-through information [26] inorder to improve the sorting of the result list.

In summary, using the distance between the text in themetadata and the query terms to rank learning objects,especially using advanced approaches that deal withsynonym words such as Latent Semantic Analysis [27],leads to high recall. However, using only a text-basedapproach reduces the precision of the top-k results. The lackof additional information to address learning-specificaspects of the relevance ranking makes uncertain thatobjects in top positions correlate well with the realinformation need of the learner.

2.3 Ranking Based on User Profile

Olmedilla [28] proposes to compare topics provided inthe user profile with the classification of the learning object.The closer the values of the profile and the object are in thetaxonomy, the higher the relevance of the object. From atheoretical point of view, this approach should lead to ahigh precision in the top-k. However, in practice, thisapproach presents several handicaps: 1) Users need toexplicitly select their interest from a taxonomy beforeperforming the search. 2) It could only be applied to objectsthat have been classified with the same taxonomy as the onepresented to the users.

Personalized ranking based on user profile has, forinstance, been implemented in the HCD-Online tool.

The personalized ranking of this tool was evaluated by

Law et al. [29]. The results of the evaluation show that thetext-based ranking (based on a Lucene [30] index) outper-forms any of the personalized rankings. In a similar work,Dolog et al. [31] propose a rule-based personalization basedon the semantic description of both the user profile and thelearning object. The main disadvantage of this approach isthat it requires very rich manual metadata annotation of both

the user and the object in order to work.In summary, while the idea of using the profile to

personalize the search results works in similar environ-ments [32], the way in which it is implemented could lead tounwanted results. Manually generated user profiles usuallydo not capture the real information need of the user [33] asthis need is always changing depending on the context ofthe task that she is performing. The implicit learning of userprofiles based on her interactions with the system seems toadapt better to changes in the needs of the user.

2.4 Current Approaches versus Ideal Approach

All current approaches, namely, manually rating the

objects, using only document information, or asking theuser to provide a profile, have serious disadvantages.The first is not scalable, the second does not carry enoughinsight on the quality (and thus relevance) of the object, andthe third does not integrate well into the normal workflowof the user. To enable a new generation of more usefulsearch facilities for learning objects, an ideal rankingapproach should take into account human-generatedinformation (to be meaningful). It should be possible tocalculate its value automatically, no matter the size of therepository (to be scalable) and should not require consciousintervention from the user (to be transparent).

Other communities of practice, for example Web searchand journal impact, have already developed strategies that

approximate the ideal approach. The PageRank [14] metricused to rank web pages is scalable. It is routinelyautomatically recalculated for more than 11.5 billion pages[34]. It is also meaningful as usually the top results are themost relevant pages for a query. And, it is also transparentbecause it uses the information stored in the structure of thealready existing Web graph to obtain its value. In the fieldof Scientometrics, the Impact Factor [35] metric calculatesthe relevance of a scientific journal in a given field. It isautomatically calculated from more than 12 million cita-tions, its results are perceived as useful and editors do notneed to provide any additional information than the onealready contained in the published papers. These two

examples prove that the implementation of ranking metricsthat are scalable, meaningful, and transparent to the finaluser is feasible and desirable. Following the ideas that leadto the development of these examples, this paper proposesand evaluates metrics to automatically calculate therelevance of learning objects based on usage and contextualinformation generated by the interaction of the user withlearning object tools.

3 RELEVANCE RANKING OF LEARNING OBJECTS

The first step to build metrics to rank learning objects byrelevance is to understand what relevance means in the

context of a Learning Object search. Borlund [36], after an



4/15

extensive review of previous research on the definition of

relevance for Information Retrieval, concludes that rele-

vance is a multidimensional concept with no single

measurement mechanism. Borlund defines four indepen-

dent types of relevance:

1. System or Algorithmic relevance that representshow well the query and the object match.

2. Topical relevance that represents the relation be-tween an object and the real-world topic of whichthe query is just a representation.

3. Pertinence, Cognitive, or Personal relevance thatrepresents the relation between the informationobject and the information need, as perceived bythe user.

4. Situational relevance that represents the relationbetween the object and the work task that generatedthe information need.

To operationalize these abstract relevance dimensions,

they need to be interpreted forthe specific domainof learning

object search. Duval [12] defined nine quality in context or

relevance characteristics of learning objects. These nine

characteristics can be disaggregated into 11 atomic char-

acteristics. In this paper, we map those characteristics to the

dimensions proposed by Borlund. Table 1 presents those

characteristics, the dimension to which they were mapped,

and a brief explanation of their meaning. For a deeper

discussion on the rationale behind those characteristics, we

refer to [12].Because the relevance characteristics of learning objects

deal with the information need of the user and her

preferences and context but not on how the query is

formulated, they do not map to the algorithmic dimension

of relevance.Those relevance characteristics that are related to what

the learner wants to learn are mapped into the Topical

Relevance dimension. Therefore, the only relevance char-

acteristic mapped in this dimension is the learning goal. For

example, if a learner is looking for materials about the

concept of inheritance in Object Oriented Programming,

the topical relevance of the object is related to how useful

the object has been to learners studying courses related toObject Oriented Programming.

The relevance characteristics that are intrinsic to thelearner and do not change with place and only slowly withtime are mapped into the Personal Relevance dimension.Inside this group are the motivation, culture, language,educational level, and accessibility needs. Further elaborat-

ing the previous example, we can imagine that the samelearner feels more comfortable with objects in Spanish (hermother tongue) and is more motivated by visual informa-tion. The learner will find a slide presentation with graphicsand descriptions in Spanish more relevant than a textdocument in English about the same subject, even if bothhave been successfully used to explain inheritance in ObjectOriented Programming courses.

Finally, those relevance characteristics that deal withconditions and limitations that depend on the learning task,as well as the device, location, and time, are mapped intothe Situational Relevance dimension. Continuing the ex-ample, if the learner is doing his learning on a mobile device

while commuting on the train, she will find more relevantfor that context material that could be formatted to thelimited screen size.

The information to estimate these relevance dimensionsis not only contained in the query parameters and thelearning object metadata but also in records of historicalusage and the context where the query takes place. It isassumed that this information is available to the relevanceranker. This could seem unrealistic for classical LearningObject search, where the users, usually anonymous, per-form their queries directly to the LOR through a Webinterface and the only information available is the queryterms. On the other hand, new implementations of LMSs, or

plug-ins for old implementations such as Moodle [37] andBlackBoard [38], as well as plug-ins for authoring environ-ments [39] enable the capture of information by providinglogged-in users with learning objects search capabilities aspart of user workflow during the creation and consultationof courses and lessons. Moreover, the development of CAM[16] to log the user interactions with different tools in acommon format will help with the collection and simplifythe analysis of usage and contextual information.

While this interpretation of the relevance concept isexemplified in a traditional or academic learning environ-ment, it is at least as valid in less structured or informalsettings such as corporate training or in situ learning, given

that the environments used to assist such learning also storeinformation about the user and the context where the searchtakes place. This information takes the form of personalprofiles, preferences, problem descriptions, and previousand required competences.

A ranking mechanism that could measure some combi-nation of the abovementioned learning relevance character-istics should provide the user with meaningfully orderedlearning objects in the result list. Section 4 will proposepragmatic metrics that estimate those characteristics basedon usage and contextual information in order to create a setof multidimensional relevance ranking metrics for learningobjects. While not every characteristic is considered, at least

two metrics are proposed for each relevance dimension.


TABLE 1Map of Duvals Quality in Context Characteristics into

Borlunds Relevance Dimensions


5/15

4 RANKING METRICS FOR LEARNING OBJECTS

To enable learning object search tools to estimate therelevance characteristics described in the previous section,those characteristics should be operationalized as rankingmetrics and calculated automatically. The metrics proposedhere are inspired on methods currently used to rank othertypes of objects, for example books [40], scientific journals

[35], TV programs [41], and so forth. They are adapted to becalculable from the information available from the usageand context of learning objects. These metrics, while notproposed as a complete or optimal way to compute the realrelevance of a learning object for a given user and task, are afirst step to set a strong baseline implementation withwhich the effectiveness and efficiency of more advancedlearning-specific metrics can be compared.

The following metrics are grouped according to theRelevance dimension (Table 1) that they estimate. There areat least two metrics for each dimension describing differentmethods in which that relevance can be calculated fromdifferent information sources. Each metric is described

below by 1) the raw data it requires and 2) the algorithm toconvert that data into concrete ranking values. Also for eachmetric, 1) an example is provided to illustrate its calculationand 2) methods to bootstrap the calculation of the metric ina real environment are discussed.

At the end of this section, the metrics are compared and aselection table is provided according to the desiredrelevance dimension and the information availability.

4.1 Topical Relevance Ranking Metrics

Metrics to estimate the Topical Relevance should establishwhich objects are more related to what a given user wantsto learn. The first step in the calculation of this type ofmetric is to estimate what is the topic that interests the user.

The second step is to establish the topic to which eachlearning object in the result list belongs. There are severalways in which the first part, the topic that interests the user,can be obtained: the query terms used, the course fromwhich the search was generated, and the previous interac-tions of the user with the system [42]. For the second part,establishing the topicality of the objects, the information canbe obtained from the classification in the learning objectmetadata, from the topical preference of previous learnersthat have used the object or the topic of courses that theobject belongs to. Once the topical need of the user and thetopic described by the object are obtained, the TopicalRelevance metric is calculated as the distance between the

two. The following sections describe three possible TopicalMetrics based on different sources of information.

4.1.1 Basic Topical Relevance Metric (BT)

This metric makes two nave assumptions. The firstassumption is that the topic needed by the user is fullyexpressed in the query. The second assumption is that eachobject is relevant to just one topic. As a consequence of thesetwo assumptions, the degree of relevance of an object to thetopic can be easily estimated as the relevance of the object tothat specific query. That relevance is calculated by countingthe number of times the object has been previously selectedfrom the result list when the same (or similar) query terms

have been used. Defining NQ as the total number of similar

queries of which the system keeps record, BT relevancemetric is the sum of the times that the object has beenselected in any of those queries (2). This metric is anadaptation of the Impact Factor metric [35] in which therelevance of a journal in a field is calculated by simplycounting the number of references to papers in that journalduring a given period of time:

selectedo; q 1; if o clicked in q; 1a0; otherwise; 1b

BTo; q XNQi1

distanceq; qi selectedo; qi: 2

In (1) and (2), o represents the learning object to beranked, q is the query performed by the user, and qi is therepresentation of the ith previous query. The distancebetween q and qi can be seen as the similarity between twoqueries. This similarity can be calculated either as thesemantic distances between the query terms (for example,

their distance in WordNet [43]) or the number of objectsthat both queries have returned in common. NQ is the totalnumber of queries.

Example. We assume that the query history of the searchengine consists of queries QA, QB, and QC. In QA, objects O1and O2 were selected; in QB, objects O2 and O3; and in QC,objects O1 and O2. A new query is performed, Q, and objectsO1, O2, O3, and O4 are present in the result list. The distancebetween Q and QA is 1 (both are the same query), betweenQ and QB is 0.8 (they are similar queries), and between Qand QC is 0 (not related queries). The BT metric value of O1is equal to 1 1 0:8 0 0 1 1; for O2, it is 1.8; for O3, itis 0.8, and for O4, it is 0. The order of the final result list

ranked by BT would be O2; O1; O3; O4.Data and initialization. In order to calculate this metric,

the search engine needs to log the selections made for eachquery. If no information is available, the metric assigns thevalue of 0 to all objects, basically not affecting the final rank.When information, in the form of user selections, startsentering the system, the BT rank starts to boost previousselected objects higher in the result list. One way to avoidthis initial training phase is to provide query-object pairsgiven by experts or obtained from information logged inprevious versions of the search engine.

4.1.2 Course-Similarity Topical Relevance

Ranking (CST)In the context of formal learning objects, the course in whichthe object will be reused can be directly used as the topic ofthe query. Objects that are used in similar courses should beranked higher in the list. The main problem to calculate thismetric is to establish which courses are similar. A verycommon way to establish this relationship is described bySimRank [44], an algorithm that analyzes the object-to-object relationships to measure the similarity between thoseobjects. In this metric, the relation graph is establishedbetween courses and learning objects. Two courses areconsidered similar if they have a predefined percentage oflearning objects in common. This relationship can be

calculated by constructing a two-partite graph where



6/15

courses are linked to objects published in them. This graphis folded over the object partition leaving a graphrepresenting the existing relationships and strengths be-tween courses. The number of objects shared between twocourses, represented in this new graph as the number oflinks between them, determines the strength of the relation-ship. A graphical representation of these procedures can beseen in Fig. 1. The ranking metric is then calculated bycounting the number of times that a learning object in thelist has been used in the universe of courses (5). This metricis similar to the calculation made by e-commerce sites suchas Amazon [40], where additional to the current item, otheritems are recommended based on their probability of beingbought together:

presento; c 1; if o 2 c; 3a0; otherwise; 3b

SimRankc1; c2 XNOi1

presentoi; c1 presentoi; c2; 4

CSTo; c XNCi1

SimRankc; ci presento; ci: 5

In (3), (4), and (5), o represents the learning object to beranked, c is the course where it will be inserted or used, ci isthe ith course present in the system, NC is the total numberof courses, and NO is the total number of objects.

Example (Fig. 1). We assume that three courses areregistered in the system C1, C2, and C3. Objects O1, O3, andO4 are used in C1, objects O2, O4, and O6 in C2, and objectsO2, O3, O5, and O6 in C3. The SimRank between C1 and C2 is1, between C1 and C3 is 1, and between C2 and C3 is 2. Aquery is performed from C2 and in the result list are theobjects O1, O3, a nd O5. T he C ST v al ue f or O1 is

1 1 2 0 1; for O3, it is 3; for O5, it is 2. The order ofthe final result list ranked by CST would be O3; O5; O1.

Data and initialization. To apply the CST, the searchengine should have access to the information from one orseveral LMSs, such as Moodle or Blackboard, wherelearning objects are being searched and inserted. First, itneeds to create a graph with the current courses and theobjects that they use in order to calculate the SimRankbetween courses. Second, it needs to obtain, along with thequery terms, the course where the query was performed. Ina system without this information, the CST will return 0,leaving unaffected the rank of the results. When the firstresults of insertion are obtained from the LMS, the CST

could start to calculate course similarities and therefore

ranking for the already used objects. This metric could bebootstrapped from the information already contained incommon LMSs or Open Courseware initiatives [45].

4.1.3 Internal Topical Relevance Ranking (IT)

If there is no usage information available, but there exists a

linkage between objects and courses, the Basic TopicalRelevance Rank can be refined using an adaptation of theHITS algorithm [46] proposed to rank web pages. Thisalgorithm states the existence of hubs, pages that mostlypoint to other useful pages, and authorities, pages withcomprehensive information about a subject. The algorithmpresumes that a good hub is a document that points tomany good authorities, and a good authority is a documentthat many good hubs point to. In the context of learningobjects, courses can be considered as hubs and learningobjects as authorities. To calculate the metric, a two-partitegraph is created with each object in the list linked to itscontaining courses. The hub value of each course is then

calculated as the number of in-bound links that it has. Agraphical representation can be seen in Fig. 2. Finally, therank of each object is calculated as the sum of the hub valueof the courses where it has been used:

ITo authorityo XNi1

degreeci j ci includes o: 6

In (6), o represents the learning object to be ranked, cirepresents the ith course where o has been used, and N isthe total number of courses where o has been used.

Example (Fig. 2). We assume that in response to a query,objects O1, O2, O3, O4, and O5 are returned. From the

information stored in the system, we know that O1 is usedin course C1, O2, O3, and O4 in C2, and O4 and O5 in C3. Thehub value ofC1 (its degree in the graph) is 1, of C2 is 3, andofC3 is 2. The IT metric for O1 is 1, the hub value ofC1. ForO2 and O3, the value is 2, the hub value of C2. For O4, IT isthe sum of the hub values ofC2 and C3, i.e., 5. For O5, it is 2.The order of the final result list ranked by IT would beO4; O2; O3; O5; O1.

Data and initialization. The calculation of IT needsinformation from LMSs. Similarly to CST, IT uses therelationship between courses and objects. On the otherhand, IT does not need information about the course atquery time (QT), so it can be used in anonymous Web

searches. Course-Object relationship can be extracted from


Fig. 1. Calculation of SimRank between courses for CST.

Fig. 2. Calculation of Internal Topical Relevance Ranking (IT).


7/15

existing LMS that contribute objects to the LOR and can beused as bootstrapping data for this metric. An alternativecalculation of this metric can use User-Object relationshipsin case that LMS information is not available.

4.2 Personal Relevance Ranking Metrics

As discussed in Section 3, the Personal Relevance metrics

should try to establish the learning preferences of the userand compare them with the characteristics of the learningobjects in the result list. The most difficult part in thesemetrics is to obtain transparently an accurate representationof thepersonal preferences.The richest source of informationabout these preferences is the attention metadata that couldbe collected from the user [47]. There are different ways inwhich this metadata could be used to determine a profile foreach user or similarity between users. For example,Mobasher et al. [48] presents some strategies to build userprofiles from Web access data and Pampalke et al. [49]discuss the generation of playlists based on user skippingbehavior. The second step in this metric calculation is to

obtain the characteristics of the objects. If metadata ispresent, this process is vastly simplified, because therealready exists a description of thecharacteristics of theobject.However, if metadata is not complete or inaccurate,contextual and usage information can be used to automati-cally generate the desired metadata values [11]. Thefollowing sections present the calculation of two possiblePersonal Relevance metrics for learning objects.

4.2.1 Basic Personal Relevance Ranking (BP)

The easiest and least intrusive way to generate userpreference information is to analyze the characteristics ofthe learning objects they have used previously. First, for a

given user, a set of the relative frequencies for the differentmetadata field values present in their objects is obtained:

conto ; f ; v 1; if valo; f v; 7a0; otherwise; 7b

frequ; f ; v 1

N

XNi1

contoi; f ; v j oi used by u: 8

In these equations, valo; f represents the value of the field fin the object o. The frequencies for each metadata field arecalculated by counting thenumber of times that a given valueis present in the given field in the metadata. For example, if a

user has accessed 30 objects, from which 20 had Spanish aslanguage and 10 had English, the relative frequency set forthe field Language will be es 0:66; en 0:33. Thiscalculation can be easily performedfor each of thecategoricalfields (fields that can only take a value from a fixedvocabulary). Other types of fields (numerical and free text)can also be used in this calculation if they are categorized.For example, the numerical field Duration that contain theestimated time to review the object can be transformed intocategorical clustering the duration values in meaningfulbuckets: (0-5 minutes, 5-30 minutes, 30 minutes-1 hour,1-2 hours, more than 2 hours). For text fields, keywordspresent in a predefined thesaurus could be extracted. An

example of this technique is presented in [50].

Once the frequencies are obtained, they can be comparedwith the metadata values of the objects in the result list. Ifthe value present in the user preference set is also present inthe object, the object receives a boost in its rank equal to therelative frequency of the value. This procedure is repeatedfor all the values present in the preference set and the NFselected fields of the metadata standard:

BPo; u XNFi1

freq u;fi;valo; fi j fi present in o: 9

This metric is similar to that used for automaticallyrecording TV programs in Personal Video Recorders [41].The metadata of the programs watched by the user, such asgenre, actors, director, and so forth, is averaged andcompared against the metadata of new programs to selectwhich ones will be recorded.

In (7), o represents the learning object to be ranked,f represents a field in the metadata standard, and v is avalue that the f field could take. Additionally, in (8), u is theuser, oi is the ith object previously used by u, and N is thetotal number of those objects. In (9), fi is the ith fieldconsidered for the calculation of the metric and NF the totalnumber of those fields.

Example. We assume that a given learner has previouslyused three objects: O1, O2, and O3. O1 is a Computer Science-related slide presentation in English. O2 is a ComputerScience-related slide presentation in Spanish. O3 is a Math-related text document in Spanish. If the previously men-tioned technique is used to create the profile of the learner,t he r e su l t w i l l b e learner ClassificationComputerScience 0:67;Math 0:33; LearningResourceT ypeslide 0:67; narrativetext 0:33; Languageen 0:33; es 0:67.The learner performs a query and in the result list are the

objects O4, O5, and O6. O4 is a Computer Science-related textdocument in English, O5 is a Math-related figure in Dutch,and O6 is a Computer Science-related slide presentation inSpanish. The BP value for O4 is 0:67 1 0:33 1 0:33 1 1:33. For O5, it is 0.33. For O6, it is 1.66. The orderof the final result list ranked by BP would be O6; O4; O5.

Data and initialization. The BP metric requires themetadata information about the objects previously selectedby the users. This Identifier of the user and objects can beobtained from the logs of the search engine (given thatthe user is logged in at the moment of the search). Once theidentifier is known, the metadata can be obtained from theLOR. A profile for each user can be created offline and

updated regularly. To bootstrap this metric, the contextualinformation of the user can be transformed into the firstprofile. For example, if the user is registered in an LMS, wewill have information about his major and educationallevel. Also, information collected at the registration phasecould also be used to estimate user age and preferredlanguage.

4.2.2 User-Similarity Personal Relevance

Ranking (USP)

The Basic Personal Relevance Metric relies heavily on themetadata of the learning object in order to be effective. But,metadata is not always complete or reliable [51]. A more

robust strategy to rank objects according to personal



8/15

preferences is to find the number of times similar usershave reused the objects in the result list. To find similarusers, we can apply the SimRank algorithm, previouslyused to obtain the CST metric. A two-partite graph containsthe objects linked to the users who have reused them. Thegraph is folded over the object partition and a relationshipbetween the users is obtained. The relationship graph is

used to calculate the USP metric, as in (11). The finalcalculation is performed adding the number of timessimilar users have reused the object. This kind of metric isthat used, for example, by Last.fm and other musicrecommenders [52] who present new songs based on whatsimilar users are listening to; similarity is defined in thiscontext as the number of shared songs in their playlists:

hasReusedo; u 1; if o used by u; 10a0; otherwise; 10b

USTu; o

XNU

i1

SimRanku; ui hasReusedo; ui: 11

In (10) and (11), o represents the learning object to beranked, u is the user that performed the query, ui is therepresentation of the ith user, and NU is the total number ofusers.

Example. We assume that there are four users registeredin the system: U1, U2, U3, and U4. User U1 has previouslydownloaded objects O1, O2, and O3; user U2, objects O2, O3,and O5; user U3, objects O2, O5, and O6; user U4, objects O5and O6. User U1 performs a query and objects O4, O5, andO6 are present in the result list. The SimRank between U1and U2 is 2, between U1 and U3 is 1, and between U1 and U4is 0. The USP metric for O4 is 2 0 1 0 0 0 0; for O5,it is 3; and for O6, it is 1. The order of the final result listranked by USP would be O5; O6; O4.

Data and initialization. The USP metric uses the User-Object relationships. These relationships can be obtainedfrom the logging information from search engines (if theuser is logged in during their interactions with the learningobjects). The USP does not need metadata informationabout the learning objects and can work over repositoriesthat do not store a rich metadata description. If no data isavailable, the metric returns 0 for all objects, not affectingthe final ranking. To bootstrap this metric when there is noprevious User-Object relationship information, the User-Course and Course-Object relationships obtainable fromLMS systems could be used.

4.3 Situational Relevance Ranking Metrics

The Situational Relevance metrics try to estimate therelevance of the object in the result list to the specific taskthat caused the search. In the learning object environment,this relevance is related to the learning environment inwhich the object will be used as well as the time, space, andtechnological constraints that are imposed by the contextwhere the learning will take place. Contextual informationis needed in order to establish the nature of the task and itsenvironment. When some description of the context isextracted from this information, it can be used to rank theobjects. Again, these characteristics could be extracted from

the object metadata or information already captured about

the previous usage of the objects. The following sectionspresent two alternative methods to calculate SituationalRelevance metrics.

4.3.1 Basic Situational Relevance Ranking (BS)

In formal learning contexts, the description of the course,lesson, or activity in which the object will be inserted is a

source of contextual information. Such information isusually written by the instructor to indicate to the studentswhat the course, lesson, or activity will be about. Keywordscan be extracted from these texts and used to calculate aranking metric based on the similarity between the key-word list and the content of the textual fields of themetadata record. To perform this calculation, the similarityis defined as the cosine distance between the TF-IDF vectorof contextual keywords and the TF-IDF vector of words inthe text fields of the metadata of the object in the result list:

BSo; t

PMi1

tvi ovi

ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiPMi1

tv2i PMi1

ov2is : 12

The TF-IDF is a measure of the importance of a word in adocument that belongs to a collection. TF is the TermFrequency or the number of times that the word appears inthe current text. IDF is the Inverse Document Frequency orthe number of document in the collection when the word ispresent. This procedure is based on the vector space modelfor information retrieval [53]. One parallel application ofthis type of metric has been developed by Yahoo for the Y!Qservice [54], which can perform contextualized searchesbased on the content of a web page in which the search boxis located.

In (12), o represents the learning object to be ranked,c is the course where the object will be used, tvi is theith component of the TF-IDF vector representing thekeywords extracted from the course description, ovi isthe ith component of the TF-IDF vector representing thetext in the object description, and M is the dimensionalityof the vector space (number of different words).

Example. We assume that an instructor creates a new

lesson inside an LMS with the following description

Introduction to Inheritance in Java. The instructor then

searches for learning objects using the term inheritance.

The result list is populated with three objects. O1 has as

description Introduction to Object-Oriented languages:

Inheritance, O2 has Java Inheritance, and O3 hasIntroduction to Inheritance. The universe of words,

extracted from the description of the objects would be

(introduction, inheritance, java, object-oriented,

languages). The TF-IDF vector for the terms in the lesson

description is then (1/2, 1/3, 1/1, 0/1, 0/1). For the

description in object O1, the vector is (1/2, 0/3, 0/1, 1/1,

1/1). For O2, it is (0/2, 1/3, 1/1, 0/1, 0/1). For O3, it is (1/2,

1/3, 0/1, 0/1, 0/1). The cosine distance between the vector of

the lesson description and O1 is 0:5 0:5 0:33 0 1 0

0 1 0 1=ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffi

0:52 0:332 12 0:52 12 12p

0:14.

For O2, it is 0.90, and for O3, it is 0.51. The order of the final

result list ranked by BS would be O2; O3; O1.



9/15

Data and initialization. To calculate the BS metric, theonly information needed is the text available in the contextand the object metadata. The text information of the contextshould be provided at QT. The information needed tobootstrap this metric is a corpus with the text available inthe object metadata to provide the value of the IDF of eachword.

4.3.2 Context Similarity Situational Relevance

Ranking (CSS)

A fair representation of the kind of objects that are relevantin a given context can be obtained from objects that havealready been used under similar conditions. For example, ifwe considered the case where the course represent thecontext, objects already present in the course are a goodrepresentation of what is relevant in that context. Similar tothe calculation of the BP metric, the N objects contained inthe course are averaged to create a set of relativefrequencies for different fields of the learning objectmetadata record:

freqc ; f ; v 1

N

XNi1

contoi; f ; v j oi included in c: 13

This set of frequencies is then compared with the objects inthe result list. The relative frequencies of the values presentin the objects metadata are added to compute the final rankvalue:

CCSo; c XNFi1

freq c;fi;valo; fi j fi present in o: 14

This method can be seen as creating a different user profilefor each context (in this case seen as course) wherethe learner

is involved. This method can be applied to more complexdescriptions of context. For example, if query is used duringthemorning,a frequency profilecan be obtained from objectsthat the learner has used during similar hours in themorning. That time of the day profile can latter be usedto rank the result list using the same approach presentedabove. Other contextual descriptors that can be used areplace, type of task, access device, and so forth.

In (13) and (14), o represents the learning object to beranked, c is the course where the object will be used, oi isthe ith object contained in the course c, f represents a fieldin the metadata standard, and v is a value that the f fieldcould take. valo; f returns the value of the field f in the

object o. fi is the ith field considered for the calculation ofthe metric and NF is the total number of those fields.conto ; f ; v is presented in (7). Here, N represents thenumber of objects contained in the course.

Example. We assume that a learner issues a query fromcourse C. CourseChasthreeobjects O1, O2,and O3. O1 is flashanimation whose duration is between 0 and 5 min and is forhigher education. O2 is another flash animation whoseduration is between 5 and 10 minutes and is for highereducation. O3 is a video of 20 minutes and also targeted tohigher education. The profile for that specific course willb e C LearningResourceTypeanimation 0:67; video 0:33; Duration0-5 minutes 0:33; 5-10 minutes 0:33; 10-

30 minutes 0:33; Contexthighereducation 1. Theresult

list contains the following objects: O4, a text document with

estimated learning time of 1 hour for higher education; O5,a video whose duration is between 0 and 5 minutes, targetedto primary education; and O6, a flash animation whoseduration is between 10 and 30 minutes, targeted to highereducation. The CSS value for O4 is 0 0 0 0 1 1 1.For O5, itis 0.66. For O6, itis 2.Theorderof the final resultlistranked by CSS would be O6; O4; O5.

Data and initialization. The CSS metric depends on thecontextual information that can be captured during pre-vious interactions of the user with the learning objects, aswell as during QT. The most basic context that can beobtained from an LMS is the course from which the usersubmitted the query. Also, using the course as the contextfacilitates to capture the information about previous objects

used in the same context, helping in the bootstrapping ofthe metric. Nevertheless, more advanced context definitionsare allowed to calculate variations of this metric, at the costof a more detailed logging of user actions.

4.4 Ranking Metrics Comparison

Different metrics estimate different relevance dimensionsand consume different types of raw data. Therefore, not allthe metrics can or should be implemented in differentenvironments. For example, if the searching environmentdoes not include information from an LMS or similarsystem, some of the metrics (CST and CSS), for example,could not be calculated. This section presents a comparisonbetween the proposed relevance ranking metrics based ontheir related relevance characteristics and the origin of thedata needed for their calculation.

Table 2 presents the correspondence of the rankingmetrics with the Relevance or Quality Characteristicspresented in Section 3. It can be clearly seen that eachmetric only covers a small percentage of the characteristics.Also, their correspondence has different levels. A metric cancorrespond strongly with some characteristics and weaklywith others. For example, CST, by its definition, can be usedto estimate the Learning Goal of the user. However, giventhat it is based on the similarity of courses, it is alsocorrelated in a lower level with the Learning Setting (similarcourses sometimes use similar learning approaches) and,

even a weaker level, with the Language (courses sharing the


TABLE 2Correspondence of the Ranking Metrics with the Quality

Characteristics and Relevance Dimensions

S = Strong, M = Medium, W = Weak, and A = After Adaptation.


10/15

same objects usually are in the same language) and theCultural characteristics (the decision to chose similarmaterial could be related to the cultural context). Another

interesting example is BT. This metric is based on themetadata of the object that a learner has previously used.While it is designed to estimate the Personal Relevance, thepresence on the metadata of contextual-related fields likeDuration, Interactivity Type, and Technical Requirementsalso correlate it with some of the Contextual Relevancecharacteristics. Some metrics need to be adapted to addressdifferent Contextual Relevance characteristics. For example,CSS can be calculated from different types of contextualinformation to be used to estimate the relevance fordifferent Learning Settings, Times, or Spaces.

Table 2 also shows that the proposed metrics as a wholecorrespond with most of the relevance characteristics.

However, Learner Motivation and the Contextual charac-teristics are not well covered. This is a reminder that theproposed metrics are not a comprehensive set but a firstformal proposal of multidimensional relevance rankingmetrics for learning objects.

The implementation of the metrics in real systems isbound to be dependant to the availability of the raw data fortheir calculation. Table 3 presents a summary of the dataneeded for the different metrics. It is important to note thatsome data are required at Query Time (QT), for example,the user identification in the Personal Relevance metrics.Other information is needed for offline (OL) calculations,for example, the similarity between queries used in BT

metric can be precalculated from the Query-Object Relation-ship. As expected, all metrics rely in usage and contextualinformation provided by an LMS or the capture of CAM. Ifonly information from an LMS is available, the best metricsto cover most of the relevance dimensions will be CST, IT,and BS. If the system is not connected to an LMS but hasCAM from the users, then BT, BP, and USP are the mostappropriate metrics.

All the metrics need some sort of OL calculation. EvenBT, which only use terms in the context and terms in thetext of the object, needs to have an index with the frequencyof different words in order to be calculated. Any systemthat implements those metrics is bound to provide some

kind of temporal storage. Moreover, depending on the scale

of data collection, the solutions for data storage andprocessing could be the principal concern in the metriccalculation system.

In summary, the different origin and target of theranking metrics make them strong when they are seen asa group but weak if they are taken alone. That is the reasonwhy metrics in real-world search engines are combined in

order to provide a final rank. Section 5 will discuss thedifferent methods to combine all the proposed rankingmetrics into a unique LearnRank estimation.

5 LEARNING TO (LEARN)RANK

In order to be useful, the different metrics should becombined to produce a unique ranking value that could beeasily used to order result lists. This combination of metricsis not a trivial task. A series of workshops, entitledLearning to Rank [55], has been conducted in recentyears to discuss and analyze different methods to combinemetric values to produce a final, optimal ranker. All these

methods share a similar approach:1. Obtain human-generated values of relevance (ex-

plicitly or implicitly) for different result lists.2. Calculate the metrics for the same objects.3. Use the metrics values as input and the human-

generated relevance values as output to train somemachine learning algorithm.

4. Use the resulting trained machine learning model asthe final ranking metric.

The most basic approach to learn a ranking functionbased on numerical metrics is multivariable linear regres-sion [56]. In this approach, the human ranking is consideredthe dependent variable and all the ranking metrics are the

independent variables. The coefficients that produce the bestfit of the learned function against the human-generatedrelevance values are estimated. The final function takes theform of a linear combination of the metrics. While simple,the main problem with this approach is that it overconstrainsthe problem to solve. We want to learn the order in whichobjects should be ranked, not the actual value assigned toeach object [57]. More advanced approaches do not use thenumerical value of human relevance estimation as the targetfor learning but only the relative position on the human-generated rank. The machine learning algorithm is trained torank the objects in the same order as a human would do,without caring about the individual rank values. These

approaches have shown to be much more effective [58].To generate LearnRank, a metric that combines the

different relevance metrics to rank learning objects, thispaper will use one of the order-based learning strategies.The selected algorithm was RankNet [57]. The selection wasbased on the effectiveness of this algorithm [59], as well asits commercial success (it is the rank learn algorithm behindMSN Search). RankNet uses a neural network to learn theoptimal ranking based on the value of the original metrics.The training is conducted using pairs of results ranked withrespect to each other. The Neural Net is trained to producethe smallest amount of error with respect to the trainingpairs (cost function). For example, if it is known that R1

should be ranked higher than R2 and the neural network


TABLE 3Source Data Needed to Calculate the Ranking Metrics

QT = Query Time, OL = Off-Line.


11/15

output indicates that LearnRank(R2) is higher than Learn-Rank(R1), a corrective coefficient is backpropagated toadjust the weight of the neurons of the Net. The best set ofcoefficients is selected based on the ones that produce thelower difference between the calculated and human-basedordering. More details about the properties of RankNetalgorithm are presented in [57]. The main advantage of

using a learning mechanism based on relative relevance isthat the human-generated relevance data needed to learnand improve the ranking can be automatically extractedfrom the interactions of the users with the search (orrecommendation) tool. It has been demonstrated that usersreview the result list from the first to the last item in order[26]. Therefore, the position of the object selected givesinformation about the relative relevance of it againstprevious objects. For example, if a user confronted with aresult list selects the third object only, it means that sheconsidered it of higher relevance than the first and secondobjects. That information could be converted to relativerelevance pairs that could be fed into the RankNet

algorithm in order to improve the ranking. The next timethat the user is confronted with the same result list, thethird object should be in a higher position.

6 EVALUATION EXPERIMENT

In order to evaluate the potential impact that the proposedmetrics could have on the relevance ranking of learningobject searches, an exploratory study has been performed.This study consisted of an experiment in which subjectswere asked to simulate the creation of a lesson inside anLMS. The subjects were required to quantify the relevanceof a list of top-10 learning objects, ranked using the defaulttext-based TF-IDF metric provided in Lucene [30]. They alsohad to select from the list objects they considered appro-priate for the lesson. The TF-IDF metric is compared withthe subjects ranking to create a baseline performance score.The proposed basic metrics for each one of the relevancedimensions, as well as the best-fitting linear combinationand trained RankNet are then used to reorder the list.Finally, the reordered lists are compared against thehuman-generated rank.

6.1 Experimental Setup

Ten users, eight professors, and two research assistantsfrom the Computer Science field were required to create10 lessons related to different computer science concepts

presented in Table 1. In each lesson, the subjects wererequired to write a brief description of the lesson forhypothetical students. The subject was then presented witha list of 10 objects. These objects were obtained from anLOR containing all PDF learning objects currently avail-able in the MIT OCW website [60] (34,640 objects). Theobjects belong to all the majors taught at MIT, not only toComputer Science. This LOR was queried with a differentquery phrase for each lesson, as listed in Table 4. The title,description, and keyword fields were text-matched withthe query terms. The top-10 objects of each result list wereused in the experiment. The subject then graded therelevance of each object to the lesson, to which end they

used a seven-value scale, from Not relevant at all to

Extremely Relevant. Moreover, subjects were required toselect the objects they would include in the lesson. Thedata recollection was conducted using a Web application.

The initial rank of the objects was performed by theLucene ranking algorithm, which is based on vector space

retrieval [30]. This algorithm can be considered a goodrepresentation of current algorithmic relevance ranking. Thebasic topical relevance metric (BT) was calculated bycounting the number of times each object was selected tobe included in the lesson. The selection of each subject wasleft out for his individual relevance evaluation. The basicpersonal relevance metric (BP) was calculated using histor-ical information about the objects that the subjects hadpublished in their LMS courses. Three fields were captured:main discipline classification, document type, and contextlevel. These fields were selected on the basis of informationavailable in the LOM record of the MIT OCW learningobjects and the metadata available from objects previously

published by the participants. The specific course ofprevious objects was not taken into account because theparticipants not necessarily teach the experimental topics.The basic situational relevance ranking (BS) captured thetext fed by the subjects into the description of the lesson. Anystopwords were eliminated and the resulting keywordswere used to expand the query made to Lucene. The revised10 objects were extracted then from the new result list.

Once the values of the metrics were calculated, they werecombined. In order to have into the combination a metricfrom each one of the relevance dimensions, the relevancescore provided by Lucene was used as an estimate of theAlgorithmic relevance. Two methods were used to obtain

the combination of the metrics: First, the assigned humanrelevance was used to compute the coefficients of thelinear combination of the metrics through multivariablelinear regression. This combination will be referenced asLinear Combination. Second, the relative relevance pairs,also generated as a result of the human ranking, were usedto train a two-layer RankNet with 10 hidden neurons. Thevalues of the different metrics were used as the input ofthe neural net. This combination will be referenced asRankNet Combination.

In order to avoid overfitting the combined metrics, thetraining-testing was conducted using a 10-fold approach.The human-generated rank data were divided in 10 sets

according to the reviewer to which it belonged. The


TABLE 4Task Performed during the Experiment and

Their Corresponding Query Phrase


12/15

learning algorithm was trained with nine of the sets andtested in the remaining set. The results reported in thisexperiment are the ones obtained in the test phase.

Once all the metrics and the two combinations werecalculated, they were compared against the manual rankperformed by the human reviewers. In order to measure thedifference between the manual rank and each of theautomated ranks, a variation of the Kendall tau metric[61], which deal with ties in the rank, was used. This metricmeasures the distance between two permutations and is

proportional to the number of swaps needed to convert onelist into the other using bubble sort. If two ranks areidentical, the Kendall tau is equal to 0, and if they are ininverse order, Kendall tau is equal to 1.

6.2 Results

Only 12 percent of the objects presented to the users weremanually ranked Very Relevant (5), Highly Relevant(6), or Extremely Relevant (7). This implies that purealgorithmic relevance ranking does a mediocre job atproviding relevant results to the user in the top-10 positionsof the result list, especially if the repository contains a largeamount of objects in different topics. Some searches, forexample, human-computer interaction return almost onlyNot Relevant at All results, even if in the repository therewere material for courses about Interface Design andHuman Centered Computing. This was due to the fact thatseveral unrelated objects in the test repository contained thewords human, computer, and interaction.

The Kendall tau distance between the Base Rank (basedon the Lucene algorithmic relevance metric) and the humanranking has a mean value of 0.4 for all the searches. Forquery terms that are common to appear in other disciplinesbesides Computer Science such as trees (5) and human-computer interaction (6), it borders 0.5. This means thatthere is no relation between the relevance given by theautomatic ranker and the human review. For example, the

Lucene algorithm considered of high relevance objectsabout the biological evolution of natural trees. However, forvery specific Computer Science query terms, such as xml(6) and operating systems (7), it provides a lower value,0.3, implying a stronger correlation between manual andautomatic ranks. These tau values are consistent with thelow quality of the retrieval.

If the top-10 results provided by Lucene are reorderedusing the basic metrics, topic relevance metric (BT) providesthe best ranking, with an improvement of 31 percent overthe Lucene ranking. The situational relevance ranking (BS)provide an improvement of 21 percent. The less performingmetric was the personal relevance (BP), but it alone still

produces an improvement of 16 percent over the baseline

ranking. If the metrics are combined, the Rank Netcombination produces a much better result than the LinearCombination and any of the individual metrics with animprovement of 51 percent over the Lucene ranking. TheLinear Combination, on the other hand, produces a result(22 percent) comparable with the ones of the individualmetrics. The summary of the results, as well as theirstatistical significance, can be seen in Table 5. Figs. 3 and 4show the disaggregated tau values for each one of thequeries for the individual and combined metrics.

6.3 Discussion of the ResultsThe basic topic relevance metric (BT) provides the bestcorrelation with manual ranking of the individual metrics.It was the metric most directly related to human choice, asnormally highly relevant items were selected for inclusionin the lessons. It performed better than the baseline rankingin all the searches. However, this result has been affected bythe fact that all the subjects participating in the experimentbelong to the same field. In a real situation, a lowerperformance is expected as noise is present in the data usedto calculate this metric. This noise comes from unrelatedsearches using similar query terms. For example, if abiology professor was also using the same search engine, it

is expected that the query term tree produce two different


TABLE 5Results of the Evaluation of the Basic Metrics

Fig. 3. Results of the Kendall tau distance from the manual ranking ofthe individual metrics.


13/15

patterns of selections. This problem can be solved byapplying a more advanced Topical Relevance, such as CST.

The basic personal relevance metric (BP) presented someproblems in certain queries. This can be explained as errorsor unexpected values in the metadata records of the objects.While the object was relevant for a given lesson, metadatavalues do not always match user preferences. For example,in search number 10 (routing), the topical classification of

the objects was Electrical Engineering, different from theComputer Science value that all the subjects had in theirprofile. Another case that exemplifies this problem waspresent in search number 1 (inheritance). The objectsfound more relevance came from a Programming course ofthe Civil Engineering department. This value was differentfrom the value present in the subjects profile. This problemcould be addressed by measuring the distance betweendifferent metadata values instead of the current Booleancomparison. An interesting method to measure this distanceusing classification ontologies is proposed by Olmedilla [28].

Basic situational relevance metric (BS) provided animprovement over the baseline rank in all but one search.It performed better for ambiguous query terms (note searchnumbers 4 and 5) while almost not affecting the perfor-mance of very specific query terms (searches 6 and 7). Thisresult was expected given similar studies on queryexpansion using contextual descriptions [62].

By far, the best performance was obtained by theRankNet Combination of the metrics. It outperformed thebaseline ranking and all the other rankings in most ofthe searches. However, given that it is still a combination ofthe metrics, it is bound to underperform individual metricsin specific situations, especially when all the metrics providea very similar ranking. The clearest example of this effect isthe tau value obtained for the operating system (case 7). Inthis case, all the metrics provide a similar tau value. The

neural network does not have sufficient input information

to produce a better ranking. The Linear Combination, on theother hand, behaves like an average between the individualmetrics. It is better than the baseline and the BP, but it isworse than BT. The use of this linear regression is notrecommended to learn the ranking function.

In conclusion, the combination of the ranking usingRankNet provides a significant increase in the performance

of the ranking compared with the baseline Rank (Lucenetext-based ranking). These results suggest that a full-fledged implementation of these metrics in a real environ-ment, learning from the continuing interaction of the userswith the system, will lead to a meaningful, scalable, andtransparent way to rank learning objects according to theirrelevance for a given query, topic, user, and context.

6.4 Experiment Limitations

Given its exploratory nature, the experiment has severallimitations that are being taken into account. The two mostimportant are: 1) Reordering of the same objects. Onlyobjects present in top-k results of the algorithmic relevancesearch were used. The reason for this choice was to limit the

amount of manual relevance ranking needed. This limita-tion, nonetheless, does not affect the result of the evaluationfor two reasons. First, the evaluation only compares relativeordering and not absolute relevance score. Second, thebias introduced works against the proposed metrics asthey were not able to bring more relevant results from thepost-10 objects. Given that the results show that the metricsoutperformed the baseline rank, the elimination of this biaswill only reinforce the conclusion and 2) limited subjectvariety. All the subjects were selected from the same fieldand had similar teaching styles. While this homogeneityboost the result of the basic topical metric because of theabsence of noise in the data, it can also be seen as the resultof applying a filtering based on user topic preference before

calculating BT. Future evaluation in a real system shouldwork with a multidisciplinary sample.

7 CONCLUSION

The main contribution of this paper is the development andevaluation of a set of metrics related to different dimensionsof learning object relevance. The conclusions of this papercan be summarized in the following points:

. Information about the usage of the learning objects,as well as the context where this use took place, canbe converted into a set of automatically calculablemetrics related to all the dimensions of relevanceproposed by Borlund [36] and Duval [12]. Thisinformation can be obtained implicitly from theinteraction of the user with the system.

. The evaluation of the metrics through an explora-tory study concludes that all the proposed basicmetrics outperformed the ranking based on puretext-based approach. This experiment shows that themost common of the current ranking methods is farfrom optimal, and the addition of even simplemetrics could benefit the relevance of the results forLORs users.

. The use of methods to learn ranking functions, forexample RankNet, leads to a significant improve-

ment of more than 50 percent over the baseline


Fig. 4. Results of the Kendall tau distance from the manual ranking of

the combined metrics.


14/15

ranking. This result is very encouraging for thedevelopment of ranking metrics for learning objects,given that this improvement was reached with onlyfour metrics as contributors to the ranking function.

The metrics proposed here have the characteristicsneeded by the theoretical LearnRank. The very nature ofthe presented metrics and their combination makes them

scalable. They consume information implicitly collectedthrough attention metadata, making them transparent forthe user. Finally, the results of the experiment suggest thatthey are good estimators of the human perception of therelevance of the learning object, making them at least moremeaningful than text-based algorithms. Even if not pro-posed as an optimal solution, these metrics could be used toimprove current LORs. More important for this field ofresearch, these metrics could be considered the newbaseline against which new, more advanced metrics couldbe compared.

The main task left for further work is to execute anempirical study with both a full implementation of themetrics andreal.Once there is enoughdata collected, theuserinteraction with the system and the progress of the differentmetrics could be analyzed to shed light on these questions.We also hope that other researchers start proposing improve-ments to this initial approach.

ACKNOWLEDGMENTS

This work was supported by the cooperation agreementbetween FWO (Belgium) and Senacyt (Ecuador) under theproject Smart Tools to Find and Reuse Learning Objects.

REFERENCES

[1] R. McGreal, Learning Objects: A Practical Definition, Intl J.Instructional Technology and Distance Learning, vol. 1, no. 9, p. 9,2004.

[2] F. Neven and E. Duval, Reusable Learning Objects: A Survey ofLOM-Based Repositories, Proc. 10th ACM Intl Conf. Multimedia(MULTIMEDIA 02), pp. 291-294, 2002.

[3] E. Duval, K. Warkentyne, F. Haenni, E. Forte, K. Cardinaels,B. Verhoeven, R. Van Durm, K. Hendrikx, M.W. Forte, N. Ebel,and M. Macowicz, The Ariadne Knowledge Pool System, Comm.

ACM, vol. 44, no. 5, pp. 72-78, 2001.[4] J. Najjar, J. Klerkx, R. Vuorikari, and E. Duval, Finding

Appropriate Learning Objects: An Empirical Evaluation, Proc.Ninth European Conf. Research and Advanced Technology forDigital Libraries (ECDL 05), A. Rauber, S. Christodoulakis,and A.M. Tjoa, eds., pp. 323-335, 2005.

[5] J. Najjar, S. Ternier, and E. Duval, User Behavior in LearningObjects Repositories: An Empirical Analysis, Proc. World Conf.

Educational Multimedia, Hypermedia and Telecomm. (ED-MEDIA04), L.C. Cantoni and C. McLoughlin, eds., pp. 4373-4378, 2004.[6] L. Sokvitne, An Evaluation of the Effectiveness of Current Dublin

Core Metadata for Retrieval, Proc. VALA (Libraries, Technology,and the Future) Biennial Conf., 2000.

[7] H. Chu and M. Rosenthal, Search Engines for the World WideWeb: A Comparative Study and Evaluation Methodology, Proc.59th Ann. Meeting of the Am. Soc. Information Science, S. Hardin, ed.,vol. 33, pp. 127-135, 1996.

[8] B. Simon, D. Massart, F. van Assche, S. Ternier, E. Duval,S. Brantner, D. Olmedilla, and Z. Miklos, A Simple QueryInterface for Interoperable Learning Repositories, Proc. FirstWorkshop Interoperability of Web-Based Educational Systems(WBES 05), N. Saito, D. Olmedilla, and B. Simon, eds., pp. 11-18, May 2005.

[9] H. Van de Sompel, M. Nelson, C. Lagoze, and S. Warner,Resource Harvesting within the OAI-PMH Framework, D-Lib

Magazine, vol. 10, no. 12, pp. 1082-9873, 2004.

[10] K. Verbert, J. Jovanovic, D. Gasevic, and E. Duval, RepurposingLearning Object Components, Proc. Move to Meaningful InternetSystems 2005: OTM Workshops, R. Meersman, Z. Tari, andP. Herrero, eds., pp. 1169-1178, 2005.

[11] X. Ochoa, K. Cardinaels, M. Meire, and E. Duval, Frame-works for the Automatic Indexation of Learning ManagementSystems Content into Learning Object Repositories, Proc.World Conf. Educational Multimedia, Hypermedia, and Telecomm.(ED-MEDIA 05), P. Kommers and G. Richards, eds., pp. 1407-

1414, June 2005.[12] E. Duval, Policy and Innovation in EducationQuality Criteria,

European Schoolnet, chapter LearnRank: The Real QualityMeasure for Learning Materials, pp. 457-463, 2005.

[13] S. Kirsch, Infoseeks Experiences Searching the Internet, SIGIRForum, vol. 32, no. 2, pp. 3-7, 1998.

[14] L. Page, S. Brin, R. Motwani, and T. Winograd, The PagerankCitation Ranking: Bringing Order to the Web, technical report,Stanford Digital Library Technologies Project, 1998.

[15] X. Ochoa and E. Duval, Use of Contextualized AttentionMetadata for Ranking and Recommending Learning Objects,Proc. First Intl Workshop Contextualized Attention Metadata(CAMA 06), pp. 9-16, 2006.

[16] J. Najjar, M. Wolpers, and E. Duval, Attention Metadata:Collection and Management, Proc. 15th Intl Conf. World WideWeb Workshop Logging Traces of Web Activity (WWW 06), C. Gobleand M. Dahlin, eds., p. 4, 2006.

[17] J. Nesbit, K. Belfer, and J. Vargo, A Convergent ParticipationModel for Evaluation of Learning Objects, Canadian J. Learningand Technology, vol. 28, no. 3, pp. 105-120, 2002.

[18] R. Zemsky and W. Massy, Thwarted Innovation: What Hap-pened to e-Learning and Why, technical report, Univ. ofPennsylvania and Thomson Corporation, 2004.

[19] J. Vargo, J.C. Nesbit, K. Belfer, and A. Archambault, LearningObject Evaluation: Computer-Mediated Collaboration and Inter-Rater Reliability, Intl J. Computers and Applications, vol. 25,pp. 198-205, 2003.

[20] S. Weibel, Border Crossings: Reflections on a Decade ofMetadata Consensus Building, D-Lib Magazine, vol. 11, nos. 7/8, p. 6, 2005.

[21] A. Agogino, Visions for a Digital Library for Science, Mathe-matics, Engineering and Technology Education (SMETE), Proc.Fourth ACM Digital Libraries Conf. (DL 99), pp. 205-206, 1999.

[22] G. Salton and C. Buckley, Term-Weighting Approaches inAutomatic Text Retrieval, Information Processing and Management,vol. 24, no. 5, pp. 513-523, 1988.

[23] G. Salton and M. McGill, Introduction to Modern InformationRetrieval. McGraw-Hill, 1986.

[24] R. Stata, K. Bharat, and F. Maghoul, The Term Vector Database:Fast Access to Indexing Terms for Web Pages, ComputerNetworks, vol. 33, nos. 1-6, pp. 247-255, 2000.

[25] V. Chellappa, Content-Based Searching with Relevance Rank-ing for Learning Objects, PhD dissertation, Univ. of Kansas,2004.

[26] T. Joachims and F. Radlinski, Search Engines that Learnfrom Implicit Feedback, Computer, vol. 40, no. 8, pp. 34-40,2007.

[27] T. Landauer, P. Foltz, and D. Laham, An Introduction toLatent Semantic Analysis, Discourse Processes, vol. 25, nos. 2-3,pp. 259-284, 1998.

[28] D. Olmedilla, Realizing Interoperability of e-Learning Repo-sitories, PhD dissertation, Universidad Autonoma de Madrid,May 2007.

[29] E.L.-C. Law, T. Klobucar, and M. Pipan, User Effect inEvaluating Personalized Information Retrieval Systems, Proc.First European Conf. Technology Enhanced Learning (EC-TEL 06),W. Nejdl and K. Tochterman, eds., pp. 257-271, 2006.

[30] E. Hatcher and O. Gospodnetic, Lucene in Action (in Action Series).Manning, 2004.

[31] P. Dolog, N. Henze, W. Nejdl, and M. Sintek, Personalization inDistributed e-Learning Environments, Proc. 13th Intl World WideWeb Conf. (WWW 04), M. Najork and C. Wills, eds., pp. 170-179,2004.

[32] K. Sugiyama, K. Hatano, and M. Yoshikawa, Adaptive WebSearch Based on User Profile Constructed without Any Effortfrom Users, Proc. 13th Intl Conf. World Wide Web (WWW 04),

pp. 675-684, 2004.



15/15

[33] L.M. Quiroga and J. Mostafa, Empirical Evaluation of Explicitversus Implicit Acquisition of User Profiles in InformationFiltering Systems, Proc. Fourth ACM Conf. Digital Libraries(DL 99), N. Rowe and E.A. Fox, eds., pp. 238-239, 1999.

[34] A. Gulli and A. Signorini, The Indexable Web Is More than11.5 Billion Pages, Proc. 14th Intl Conf. World Wide Web(WWW 05), F. Douglis and P. Raghavan, eds., pp. 902-903, 2005.

[35] E. Garfield, The Impact Factor, Current Contents, vol. 25, no. 20,pp. 3-7, 1994.

[36] P. Borlund, The Concept of Relevance in IR, J. Am. Soc.Information Science and Technology, vol. 54, no. 10, pp. 913-925,May 2003.

[37] J. Broisin, P. Vidal, M. Meire, and E. Duval, Bridging the Gapbetween Learning Management Systems and Learning ObjectRepositories: Exploiting Learning Context Information, Proc.

Advanced Industrial Conf. Telecomm./Service Assurance with Partialand Intermittent Resources Conf./E-Learning on Telecomm. Workshop(AICT-SAPIR-ELETE 05), pp. 478-483, 2005.

[38] P. Vandepitte, L. Van Rentergem, E. Duval, S. Ternier, andF. Neven, Bridging an LCMS and an LMS: A BlackboardBuilding Block for the Ariadne Knowledge Pool System, Proc.World Conf. Educational Multimedia, Hypermedia, and Telecomm.(ED-MEDIA 03), D.L.C. McNaught, ed., pp. 423-424, 2003.

[39] K. Verbert and E. Duval, Evaluating the Alocom Approach forScalable Content Repurposing, Proc. Second European Conf.

Technology Enhanced Learning (ECTEL 07), E. Duval, R. Klamma,and M. Wolpers, eds., vol. 4753, pp. 364-377, 2007.

[40] G. Linden, B. Smith, and J. York, Amazon.com Recommenda-tions: Item-to-Item Collaborative Filtering, IEEE InternetComputing, vol. 7, no. 1, pp. 76-80, Jan. 2003.

[41] A. Pigeau, G. Raschia, M. Gelgon, N. Mouaddib, and R. Saint-Paul, A Fuzzy Linguistic Summarization Technique for TVRecommender Systems, Proc. IEEE Intl Conf. Fuzzy Systems(FUZZ-IEEE 03), O. Nasraoui, H. Frigui, and J.M. Keller, eds.,vol. 1, pp. 743-748, 2003.

[42] E.H. Chi, P. Pirolli, K. Chen, and J. Pitkow, Using InformationScent to Model User Information Needs and Actions and theWeb, Proc. SIGCHI Conf. Human Factors in Computing Systems(CHI 01), J. Jacko and A. Sears, eds., pp. 490-497, 2001.

[43] A. Budanitsky and G. Hirst, Semantic Distance in Wordnet: AnExperimental, Application-Oriented Evaluation of Five Mea-sures, Proc. Workshop WordNet and Other Lexical Resources, Second

Meeting of the North Am. Chapter of the Assoc. ComputationalLinguistics, pp. 29-34, 2001.

[44] G. Jeh and J. Widom, Simrank: A Measure of Structural-ContextSimilarity, Proc. Eighth ACM SIGKDD Intl Conf. KnowledgeDiscovery and Data Mining (KDD 02), D. Hand, D. Keim, andR. Ng, eds., pp. 538-543, 2002.

[45] S. Downes, Models for Sustainable Open EducationalResources, Interdisciplinary J. Knowledge and Learning Objects,vol. 3, pp. 29-44, 2007.

[46] J.M. Kleinberg, Authoritative Sources in a HyperlinkedEnvironment, J. ACM, vol. 46, no. 5, pp. 604-632, 1999.

[47] A.M.A. Wasfi, Collecting User Access Patterns for Building UserProfiles and Collaborative Filtering, Proc. Fourth Intl Conf.Intelligent User Interfaces (IUI 99), M. Maybury, P. Szekely, andC.G. Thomas, eds., pp. 57-64, 1999.

[48] B. Mobasher, R. Cooley, and J. Srivastava, Automatic Persona-

lization Based on Web Usage Mining, Comm. ACM, vol. 43, no. 8,pp. 142-151, 2000.

[49] E. Pampalk, T. Pohle, and G. Widmer, Dynamic PlaylistGeneration Based on Skipping Behavior, Proc. Sixth Intl Conf.

Music Infor matio n Retri eval (ISMI R 05), T. Crawford andM. Sandler, eds., pp. 634-637, 2005.

[50] O. Medelyan and I. Witten, Thesaurus Based AutomaticKeyphrase Indexing, Proc. Sixth ACM/IEEE CS Joint Conf.Digital Libraries (JCDL 06), M.L. Nelson and C.C. Marshall, eds.,pp. 296-297, 2006.

[51] M. Sicilia, E. Garcia, C. Pages, and J. Martinez, CompleteMetadata Records in Learning Object Repositories: Some Evidenceand Requirements, Intl J. Learning Technology, vol. 1, no. 4,pp. 411-424, 2005.

[52] S. Upendra, Social Information Filtering for Music Recommenda-tion, masters thesis, Massachusetts Inst. of Technology, http://

[53] A. Aizawa, An Information-Theoretic Perspective of TF-IDFMeasures, Information Processing and Management, vol. 39, no. 1,pp. 45-65, Jan. 2003.

[54] R. Kraft, F. Maghoul, and C.C. Chang, Y!Q: Contextual Searchat the Point of Inspiration, Proc. 14th ACM Intl Conf.Information and Knowledge Management (CIKM 05), F. Dougl

ranking metrics

Documents