requirements-oriented methodology for evaluating ontologies

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Information Systems

Information Systems 34 (2009) 766–791

0306-43

doi:10.1

$ Exp

More d� Cor

E-m

james.th

journal homepage: www.elsevier.com/locate/infosys

Requirements-oriented methodology for evaluating ontologies$

Jonathan Yu, James A. Thom �, Audrey Tam

School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, Australia

a r t i c l e i n f o

Keywords:

Ontology evaluation

Browsing

User studies

Wikipedia

79/$ - see front matter & 2009 Elsevier B.V. A

016/j.is.2009.04.002

anded version of paper originally published

etails are in Yu’s thesis [31].

responding author. Tel.: +613 9925 2992.

ail addresses: [email protected] (J. Yu)

[email protected] (J.A. Thom), audrey.tam@rm

a b s t r a c t

Many applications benefit from the use of a suitable ontology but it can be difficult to

determine which ontology is best suited to a particular application. Although ontology

evaluation techniques are improving as more measures and methodologies are

proposed, the literature contains few specific examples of cohesive evaluation activity

that links ontologies, applications and their requirements, and measures and

methodologies. In this paper, we present ROMEO, a requirements-oriented methodology

for evaluating ontologies, and apply it to the task of evaluating the suitability of some

general ontologies (variants of sub-domains of the Wikipedia category structure) for

supporting browsing in Wikipedia. The ROMEO methodology identifies requirements

that an ontology must satisfy, and maps these requirements to evaluation measures. We

validate part of this mapping with a task-based evaluation method involving users, and

report on our findings from this user study.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

Many applications benefit from the use of a suitableontology. However, when faced with the task of choosingan ontology for a particular application, perhaps from anumber of similar ontologies, it can be difficult todetermine which ontology is best suited to the applica-tion: how can the suitability of an ontology be measuredor evaluated, before it is deployed? Although ontologyevaluation techniques are improving as more measuresand methodologies are proposed, the literature containsfew specific examples of cohesive evaluation activity thatlinks ontologies, applications and their requirements, andmeasures and methodologies. In this paper, we presentROMEO, a requirements-oriented methodology for evalu-ating ontologies, and apply it to the task of evaluating thesuitability of particular ontologies to supporting browsingin Wikipedia.

ll rights reserved.

at CIKM 2007 [32].

,

it.edu.au (A. Tam).

Wikipedia allows users to create and edit articles acrossa broad range of subject areas, and maintains a categorystructure that users can browse to access articles. Wikipe-dia’s category structure may be seen as an informationhierarchy. It is by no means a strict and logically groundedontology as it has many inconsistencies and is loose in itsdefinition of relationships. However, it can be seen as anontology because it has an explicit, shared and agreed uponconceptualisation even though it does not meet require-ments for a simple ontology in the spectrum of ontologyspecifications described by McGuinness [20]. In thismanner, it can be seen as one of the largest publicontologies available on the web having a large coverageof information, utilised by many users and is constantlyevolving. Sub-domains in the Wikipedia category structureare also ontologies; in this paper, we vary two sub-domains, to obtain comparable ontologies, and apply theROMEO methodology to perform ontology evaluationmeasures on the two sub-domains and their variants.

Section 2 presents an overview of ontology evaluationapproaches, criteria and measures. Section 3 describesROMEO, our requirements-oriented methodology forevaluating ontologies. Section 4 discusses the require-ments that the Wikipedia category structure shouldsatisfy in order to support browsing, and applies the

www.sciencedirect.com/science/journal/is

www.elsevier.com/locate/infosys

dx.doi.org/10.1016/j.is.2009.04.002

mailto:[email protected]



ARTICLE IN PRESS

J. Yu et al. / Information Systems 34 (2009) 766–791 767

ROMEO methodology to map these requirements toevaluation measures. In Section 5, we seek to validateone of these mappings with a task-based evaluationmethod involving users, and report on our findings fromthis user study. Section 6 summarises our conclusions anddiscusses some future directions.

2. Ontology evaluation

With more ontologies being made available, evaluatingwhich ontology is suitable becomes a problem. It isdifficult to discern whether one ontology is better thananother. If one is picked, it may lack definitions, axioms orrelations required in a domain or application. If none issuitable, an ontology may need to be built from scratch orbe composed of several smaller ontologies or ontologymodules. However, the process in which ontologies arespecified can be ad hoc at times. Whether an ontology is tobe selected from a set of candidate ontologies or anontology is to be constructed, methods for evaluating itssuitability and applicability are needed. In this section, wediscuss the main existing approaches for ontologyevaluation, including criteria and measures.

2.1. Ontology evaluation approaches

There are three main approaches to ontology evaluation:Gold standard evaluation: This approach compares an

ontology with another ontology that is deemed to be thebenchmark. Typically, this kind of evaluation is applied toan ontology that is generated (semi-automatically oraccording to a learning algorithm) to assess the effective-ness of the generating process. Maedche and Staab [18]give an example of a gold standard ontology evaluationand they propose ways to empirically measure similaritiesbetween ontologies both lexically and conceptually basedon the overlap in relations. These measures determine theaccuracy of discovered relations generated from theirproposed ontology learning system compared with anexisting ontology, but are not so useful outside thedomain of ontology learning because if a known goldstandard ontology exists then there is no need to evaluateother ontologies.

Criteria-based evaluation: This approach takes theontology and evaluates it based on criteria [7] such asconsistency, completeness, conciseness, expandability andsensitivity. It depends on external semantics to performthe kind of evaluation that only humans are currently ableto do, since it is difficult to construct automated tests tocompare ontologies using such criteria [2]. These criteriafocus on the characteristics of the ontology in isolationfrom the application. So, while ontology criteria may bemet, it may not satisfy all the needs of the applicationeven if some needs may correspond with the ontologycriteria.

Task-based evaluation: This approach evaluates anontology based on the competency of the ontology incompleting tasks. In taking such an approach, we canjudge whether an ontology is suitable for the applicationor task in a quantitative manner by measuring its

performance within the context of the application. Thedisadvantage of this approach is that an evaluation for oneapplication or task may not be comparable with anothertask. Hence, evaluations need to be taken for each taskbeing considered.

2.2. Ontology evaluation methodologies

Broadly speaking, the purpose of evaluation is ‘‘to findareas for improvement and/or to generate an assessment’’[3] by ‘‘the systematic determination of the quality orvalue of something’’ [24]. Evaluation also helps determinewhen the desired level of quality has been attained.Moeller and Paulish [21] state (rephrasing deMarco [4] andinspired by Lord Kelvin), ‘‘you cannot easily manage whatyou cannot measure’’. Another motivation for evaluation isto challenge assumptions and accepted practices.

In the case of ontologies, evaluation is carried out forontology selection and for tracking progress in ontologydevelopment. Ontology evaluation methodologies aredistinguished from ontology engineering methodologiesas they provide a framework for defining appropriatemethods for evaluating ontologies. In the following,we present two influential ontology evaluation methodo-logies: OntoClean and OntoMetric.

2.2.1. OntoClean

The OntoClean methodology evaluates ontologiesusing formal notions from philosophy with the goal ofmaking ‘‘modelling assumptions clear’’ [28]. ApplyingOntoClean may help an ontology meet the evaluationcriterion of correctness. Correctness refers to whether themodelled entities and properties in an ontology correctlyrepresent entities in the world being modelled. OntoCleanaddresses this by introducing meta-properties to capturevarious characteristics of classes, and constraints uponthose meta-properties, which help to assess the correctusage of the subsumption relation between classes in anontology [14,28].

The OntoClean methodology seeks to correct classes inan ontology that are modelled as subclasses when theyshould be modelled as properties, or as a subclass ofanother concept, or even a separate class on its own. Theprinciples of rigidity, identity, unity and dependence areused to determine this. These four meta-properties areused as ontological tools to analyse subclasses in anontology [12].

2.2.2. OntoMetric

Lozano-Tello and Gomez-Perez [17] propose the Onto-Metric methodology that uses application constraints as abasis for ontology selection. OntoMetric uses an adaptedversion of the Analytic Hierarchy Process (AHP) methodproposed by Saaty [23] to aid decision-making on multiplecriteria. A key component is the multilevel tree ofcharacteristics (MTC), which is a taxonomy of character-istics. The top level of this taxonomy has five dimensions ofan ontology: content, language, methodology used to buildthe ontology, tools used to build the ontology, and the costto utilise the ontology. A set of factors is associated with

ARTICLE IN PRESS

Table 1Proposed ontology evaluation criteria.

Researcher Proposed criteria

Gruber [9] Clarity

J. Yu et al. / Information Systems 34 (2009) 766–791768

each dimension and for each factor, there are characteristicsassociated also. These characteristics are taken fromexisting work and include design qualities, ontologyevaluation criteria, cost, and language characteristics. Themethodology uses the following steps.

Step 1: Analyse project aims. From the analysis of anontology engineer, a set of objectives is specified accord-ing to guidelines for a suitable ontology given by therespective organisation seeking to adopt an ontology.‘‘They must decide on the importance of the terms of theontology, the precision of definitions, the suitability ofrelations between concepts, the reliability of the metho-dology used to build the ontology, etc.’’ [17].

Step 2: Obtain a customised MTC. Based on the set ofobjectives from the project aims, the MTC described aboveis modified to include a set of desired characteristics. Thisresults in a customised MTC.

Step 3: Weigh up each characteristic against each other.Pairs of characteristics are weighted against each other toindicate the importance of one characteristic over another.For each characteristic, a weight wt is assigned againsteach other. This pairwise comparison forms a comparisonmatrix and eigenvectors are calculated from this matrix.This is used to determine the suitable ontologies in Step 5.

Step 4: Assign linguistic score for each characteristic

of a candidate ontology. For each candidate ontology c,assess its characteristics and assign score wc along thelinguistic scale for the given characteristic. The linguisticscale is applied for each characteristic and varies accord-ing to the characteristic. A typical set of linguisticvalues may be hvery low, low, medium, high, very highi orhnon-supported, supportedi.

Step 5: Select the most suitable ontology. The similaritybetween the characteristics of each candidate ontology isevaluated. This is achieved by comparing vectors of theweights wt and wc of the candidate ontology c and themodified taxonomy of characteristics in the customisedMTC.

OntoMetric is a criteria-based ontology evaluation meth-odology for choosing an ontology based on a set of ontologycharacteristics. OntoMetric provides a way to compareontologies based on various objectives. The methodologybases its evaluation on multiple criteria which link directlyto the objectives. However, there are some limitations withthis methodology and we outline them below:

Coherence

Extendibility
� Minimal ontological commitment
Minimal encoding bias

Gruninger and Fox [11] Competency

Gomez-Perez [7] Consistency

Completeness

Conciseness

Expandability

Sensitiveness

Guarino [12] Correctness (identity and dependence)

Determining the customised MTC for ontology selec-tion depends on manual specification, which may besubjective and inconsistent. For example, in Step 1 ofOntoMetric, the methodology instructs ontology en-gineers to determine the importance of aspects of anontology such as the set of terms and relationships.However, OntoMetric does not help guide ontologyengineers with the process of mapping these objectivesto specific aspects of ontologies, especially evaluatingcontent of an ontology. A crucial part of this metho-dology is that users need to be familiar with the set ofontology characteristics available.
�
Guarino and Welty

[14]

Correctness (essence, rigidity, identity and

unity)

The list of characteristics for evaluating content islimited. There are other existing measures proposed inliterature which we present in Section 2.4.

�
The linguistic scale does not use specific measure-ments of an ontology characteristic. It is up to the userto assign values of an ontology characteristic for acandidate ontology according the linguistic scale.There are no quantifiable indicators for a givenassociation of a value on the scale to the criteria,which may limit how meaningful each comparison is.For example, a value of high given for the number ofconcepts in a candidate ontology may refer to 100concepts or 1000 concepts. � OntoMetric does not help ontology evaluation for the
case of ontology development. Rather, OntoMetric canonly be used to decide which ontology is the mostsuitable from a set of candidate ontologies. In the casethat a new ontology is being developed, this metho-dology does not help to determine a suitable methodfor evaluating that ontology.

2.3. Ontology evaluation criteria

Various criteria have been proposed for the evaluationof ontologies as listed in Table 1. These criteria can be usedto evaluate the design of an ontology and in aidingrequirements analysis.

Some of these criteria can be successfully determinedusing ontology tools. Reasoners, such as FaCT and RACER,provide the means to check for errors in ontologies, suchas redundant terms, inconsistencies between definitionsand definitions referred to but not defined. Dong et al. [5]have used existing software engineering tools and tech-niques to check for errors in ontologies in the militarydomain.

Some criteria, such as clarity and expandability, can bedifficult to evaluate as there are no means in place todetermine them. Moreover, while the completeness of anontology can be demonstrated, it cannot be proven.

Other criteria can be more challenging to evaluate asthey may not be easily quantifiable. They require manual

ARTICLE IN PRESS


inspection of the ontology. For example, correctness

requires a domain expert or ontology engineer to manu-ally verify that the definitions are correct with referenceto the real world. This may not always be feasible for alarge ontology or even a repository of many ontologies.

Upon analysis, some of the criteria proposed by thedifferent researchers address similar aspects and dooverlap. We have previously described existing criteriaproposed in literature and summarised these as eightdistinct criteria [33]: clarity, correctness, consistency,completeness, conciseness, minimal ontological commit-ment, expandability and minimal encoding bias.

2.4. Ontology evaluation measures

Ontology evaluation measures are a quantitative meansfor assessing various aspects of an ontology. Gomez-Perez[8] outlines a list of measures looking at possible errorsthat could manifest with regards to ontology consistency,completeness and conciseness and we present these inTable 2.

Given an application and a text corpus that representsthe knowledge in that domain, Brewster et al. [2] presentsome approaches to identify a suitable ontology from agiven set of ontologies: counting the number of overlapping

terms, vector space similarity measure, structural fit by

clustering and mapping terms, and using conditional

probability to evaluate the ‘‘best fit’’ of an ontology. Twoof the above approaches involve ontology evaluationmeasures for analysing coverage over a given domainare also presented in Table 2.

Guarino [13] proposes measures of precision andcoverage. In information retrieval [1] these are referred

Table 2Measures for consistency, completeness and conciseness proposed by Gomez-P

Criteria Measure(s) Par

Consistency Circularity error At d

Partition error Com

clas

exh

Semantic inconsistency error

Completeness Incomplete concept classification

Partition errors Dis

Partition errors Exh

Conciseness Grammatical redundancy error Red

Identical formal definition Cla

Identical formal definition Ins

Coverage Number of overlapping concepts

Vector space similarity measure

to as precision and recall (which we adopt here, and so notconfuse with the criteria coverage). Precision is a measureof correctness as it examines the overlap between what ismodelled in the ontology and the intended domain beingmodelled as a proportion of what is modelled in theontology itself. Recall is a measure of completeness as itexamines the overlap between what is modelled in theontology and the intended domain being modelled as aproportion of what is modelled in the intended domain.Thus, it can be used to examine which definitions aredeficient in the ontology.

Measures focusing on structural aspects of an ontologyhave also been proposed in literature, in particular, thequality of its instantiation and how classes interact withits instances in the knowledge base. Gangemi et al. [6]present a suite of measures focusing on the structure of anontology and we present these measures in Table 3. Tartiret al. [26] propose measures to evaluate an ontology’scapacity or ‘‘potential for knowledge representation’’ andwe present these in Table 4 along with some couplingmeasures proposed by Orm et al. [22].

2.4.1. Detailed descriptions of selected ontology measures

Below we outline specific details for a selected set ofontology measures, many of which we use later in this paper.

There are differing formal definitions of ontologiesproposed in literature [13,15,18,25]. We adopt the follow-ing simplified definition of an ontology for the purpose ofdescribing the evaluation measures.

�

ere

ame

ista

mo

ses

aus

join

aus

und

sses

tanc

An ontology O ¼ hOc ;Oi;Ori where Oc is the set ofconcepts in the ontology, Oi is the set of instances inthe ontology, and Or is the set of relationships in the

z [8] and coverage proposed by Brewster et al. [2].

ters

nce 0, 1, N

n instances in disjoint decompositions and partitions, external

in exhaustive decompositions and partitions, external instances in

tive decompositions and partitions

t knowledge omission

tive knowledge omission

ancies of ‘‘subclass-of’’, redundancies of ‘‘instance-of’’

es

ARTICLE IN PRESS

Table 4Schema and instance measures proposed by Tartir et al. [26] and

coupling measures proposed by Orm et al. [22].

Measure types Measure(s)

Schema

measures

Relationship richness, attribute richness, inheritance

richness

Instance

measures

Class richness, average population, cohesion,

importance, fullness, inheritance richness, relationship

richness, connectivity, readability

Coupling

measures

Number of external classes (NEC), reference to external

classes (REC), referenced includes (RI)

Table 3Structural measures proposed by Gangemi et al. [6].

Measure types Measure(s) Implementation

Structural measures Depth Absolute, average, maximal

Breadth Absolute, average, maximal

Tangledness

Fanoutness Absolute leaf cardinality, ratio of leaf fanoutness, weighted ratio of leaf fanoutness,

maximal leaf fanoutness, absolute sibling cardinality, ratio of sibling fanoutness,

weighted ratio of sibling fanoutness, average sibling fanoutness, maximal sibling

fanoutness, average sibling fanoutness without metric space, average sibling

fanoutness without list of values

Differentia specifica Ratio of sibling nodes featuring a shared differentia specifica, ratio of sibling nodes

featuring a shared differentia specifica among elements

Density

Modularity Modularity rate, module overlapping rate, logical adequacy, consistency ratio, generic

complexity, anonymous classes ratio, cycle ratio, inverse relations ratio, individual/

class ratio

Meta-logical adequacy Meta-consistency ratio

Degree distribution


ontology (which is also the union of the set ofrelationships between concepts Ocr and the set ofrelationships between instances Oir).
� Frame of reference F 2F� and F ¼ hFc ;Fi;Fri,
where F� is the set of all frames of reference, Fc is theset of concepts in a frame of reference, Fi is the set ofinstances in a frame of reference, and Fr the set ofrelationships in a frame of reference (which is theunion of the set of relationships between concepts Fcr

and the set of relationships between instances Fir).

The above definition includes possible frames of referencefor the ontology but for simplicity we exclude axioms,properties and attributes (as they are not required for themeasures we discuss). Examples of frames of reference(as described by Gomez-Perez [8]) are: the real world, aset of requirements, or a set of competency questions. Aframe of reference can also be the set of concepts,instances and relationships of a particular domain.

Ontology evaluation measures that use a frame ofreference can be problematic. If the frame of reference isspecified, then there is no need to also specify the ontology,thus nullifying the need for an evaluation measure. Thus,evaluation against a frame of reference may not seemoperationalisable. However, there are many situations thatrequire the evaluation of newly created or existingontologies. For example, how do we know whether theset of relationships of an ontology has been adequatelydefined? Thus, the need for measures which use some kindof comparison remains. In other words, measures using anoperationalisable frame of reference are useful.

Since a well defined frame of reference for the measurespresented above is not practicable, a possible solution tothis problem is to approximate a frame of reference usingvarious techniques such as natural language processing,and statistical text analysis. A related example of a textanalysis approach is proposed by Brewster et al. [2].Statistical text analysis is well researched in the informa-tion retrieval literature. Utilising some of these techniques,it may be possible to determine whether a concept in theontology is equivalent to a concept in the frame ofreference within some degree of confidence.

Number of overlapping terms: This measure thatBrewster et al. [2] propose identifies a count of thenumber of terms that overlap between a set of concepts inan ontology Oc, and the set of terms extracted from termsin a frame of reference Fc , such as a text corpusrepresenting a given domain.

overlapping termsðOc;FcÞ ¼ jtermsOc\ termsFc

j (1)

where termsOcis the set of terms in the labels associated

with a given concept c in Oc and termsFcis the set of

ARTICLE IN PRESS


terms in the labels associated with a given concept c

in Fc.This measure may also use parameters of Oi for the set

of instances for a given ontology.Precision: Guarino [13] proposes use of the precision

measure, which is the percentage of concepts in an ontologyOc that overlaps with the intended model, that is a set ofterms from a frame of reference Fc. This is given by

precisionðOc;FcÞ ¼jOc \Fcj

jOcj(2)

This measure may also use parameters of Oi and Or for theset of instances, and relationships between instances andconcepts for a given ontology, respectively.

Recall: Guarino [13] proposes the coverage measure,which we refer to as recall. Recall is the percentage of theoverlap between a set of terms from the domain D, and aset of concepts in an ontology O. This is given by

recallðOc;FcÞ ¼jOc \Fcj

jFcj(3)

This measure may also use parameters of Oi and Or for theset of instances, and relationships between instances andconcepts for a given ontology, respectively.

Meta-consistency ratio: Gangemi et al. [6] propose themeta-consistency ratio, which is a measure of the correct-ness criterion. The measure examines the percentage ofconcepts in the ontology that have meta-consistency withthe total number of concepts in the ontology. A meta-consistent concept subsumes another concept according tometa-logical property constraints such as those outlined inthe OntoClean methodology used by Guarino and Welty[14] in OntoClean as discussed in Section 2.2.1. An exampleof a meta-logical property from OntoClean is rigidity and isdefined as (1) ‘‘a property that is essential to all itsinstances’’, (2) ‘‘a non-rigid property is a property that isnot essential to some of its instances’’ and (3) ‘‘an anti-rigidproperty is a property that is not essential to all itsinstances’’. A constraint of a concept with the rigid propertyis that it cannot be subsumed by a concept with an anti-rigid property. Thus with regards to the rigidity meta-property constraint according to the OntoClean methodol-ogy, an example of a meta-consistent concept is a rigidconcept subsuming another rigid concept.

ratio of meta-consistent conceptsðOcÞ

¼jmeta-consistentðOcÞj

jOcj(4)

where meta-consistent ðOcÞ is a function which examinesthe subsumption relationship between concepts anddetermines a set of meta-consistent concepts.

Tangledness: Gangemi et al. [6] propose tangledness,which considers the proportion of multiple parents foreach concept to the set of concepts in the ontology. Thismeasure assumes multiple inheritance in an ontology:

tGangemiðOcÞ ¼jOcj

jfc 2 Oc : pðcÞ41gj

where pðcÞ is the number of parents for concept c 2 Oc ,and fc 2 Oc : pðcÞ41g is the set of concepts that have morethan one parent in an ontology.

We find this definition of tangledness tGangemi

to be counter-intuitive. The measure of the tanglednessof an ontology with the above definition rangesfrom 1, which denotes that each concept has multipleparents, to infinity to denote no tangledness. A moreintuitive definition is simply the inverse of what is definedsuch that

tðOcÞ ¼jfc 2 Oc : pðcÞ41gj

jOcj(5)

where pðcÞ is the number of parents for concept c 2 Oc ,and fc 2 Oc : pðcÞ41g is the set of concepts that have morethan one parent in an ontology.

The revised tangledness measure t, ranges from 0 to 1,where 0 denotes no tangledness and 1 that every conceptin the ontology has multiple parents.

Consistency measures: For the consistency criterion,Gomez-Perez [8] proposes measures of circularity errorsand inconsistent classes.

Circularity error is a measure of the consistencycriteria, that is, whether cycles occur in an ontology. Thisis useful for evaluating a taxonomy where cycles are notallowed. Gomez-Perez [8] gives three kinds of circularityerrors:

Circularity errors at distance of 0 ¼ cyclesðO;0Þ (6)

where cyclesðO;0Þ is the number of cycles detectedbetween a concept in an ontology with itself:

Circularity errors at distance of 1 ¼ cyclesðO;1Þ (7)

where cyclesðO;1Þ is the number of cycles detectedbetween a concept and an adjacent concept:

Circularity errors at distance of n ¼ cyclesðO;nÞ (8)

where cyclesðO;nÞ is the number of cycles detectedbetween a concept and another at n concepts away.

For inconsistent classes Gomez-Perez [8] proposesmeasuring the number of subclasses with common classes,and the number of classes with common instances.

Conciseness measures: Gomez-Perez [8] proposessome conciseness measures that count the numbersemantically identical concepts and instances, as wellcounting the redundant subclass-of and instance-ofrelations.

Completeness measures: Gomez-Perez [8] proposes acompleteness measure that counts the number of sub-classes for concept c that are not explicitly expressed asbeing disjoint:

incomplete concept classificationðOc ;F�Þ

¼ jfc : 8hFc ;Fi;Fri 2F�; 9ðc 2FcÞ ^ ðceOcÞgj (9)

This measure may also be applied for a frame of reference:

incomplete concept classificationðOc ;FcÞ

¼ jfc : ðc 2FcÞ ^ ðceOcÞgj (10)

Recall may also be used to measure completeness, byusing it over all relevant frames of reference to obtain ameasure of average recall. The range of values for average

ARTICLE IN PRESS


recall is from 0, which denotes none of entities in anyframe of reference is in the ontology, to 1 denoting that allentities in all relevant frames of reference are modelled inthe ontology.

average-recallðOc ;F�Þ ¼

PhFc ;Fi ;Fri2F

� recallðOc ;FcÞ

jF�j

(11)

This measure may also use parameters of Oi, Or , Ocr , Oir forthe set of all instances, the set of all relationships, the setof all relationships between concepts, and the set of allrelationships between instances for a given ontology,respectively.

3. The ROMEO methodology

Ontology evaluation assesses ontologies with regardsto a set of desirable qualities or ontology requirements,whether it is the process of ontology engineering or thetask of ontology selection. It depends on the definition ofmeasures for examining ontology qualities, as well as thecorrect use of these measures to address ontologyrequirements. However, there is currently no suitablemethod to obtain appropriate mappings of these ontologyrequirements with related measures. A part of thisproblem is that the set of requirements for a givenontology may differ from one application to another. Forexample, a method to evaluate an ontology for oneapplication based on one criterion, such as minimal

ontological commitment, may not apply to another appli-cation, such as one requiring broad coverage of a particulardomain. Thus, a methodology to determine an appropriateontology evaluation method is needed for mapping a setof ontology requirements to suitable measures.

It is with this motivation that we propose a require-

ments-oriented methodology for evaluating ontologies. Theproduct of the ROMEO methodology is a method ofevaluation with mappings of requirements to questionsand questions to measures. The role of the ontologydescribes the ways in which the ontology is used in anapplication. Ontology requirements specify the qualities,

Fig. 1. Overview of the RO

that is the degree of particular characteristics, needed froma suitable ontology for the given application. Questions

relate requirements with measures, and may be based onaspects of an ontology evaluation criterion. Measures arequantifiable but not all measures are applicable to allquestions, thus, appropriate measures are selected for eachquestion. Fig. 1 shows the components involved in ROMEO,which begins from the intended set of roles of the ontologythen links to the corresponding ontology requirements forthe respective questions and measures.

In this section, we elaborate on the ROMEO methodol-ogy and its components. We specifically consider require-

ments in Section 3.1, questions in Section 3.2 and measures

in Section 3.3. We also discuss this methodology inSection 3.5 and compare it with existing methodologies.

3.1. Ontology requirements

An ontology requirement reflects a specific competencyor quality that a given ontology must possess in thecontext of its role in the application. Ontology require-ments may be drawn from relevant application require-ments of ontology-driven applications, as someapplication requirements may be relevant in describingspecific ontology requirements of the application. Theprocess of defining a set of ontology requirementsinvolves establishing the roles of the ontology, to establishhow it is used in an application. The roles of the ontologyalso help to distinguish applicable ontology requirementsfrom other application requirements.

To aid the specification of ontology requirements, acustomised template is provided based on the GQM goal

description template. This requirements description template

is shown in Table 5. The use of the requirementsdescription template helps to constrain the definition ofontology requirements into a standard specification. Thetemplate also allows aspects of the requirement and itsrelated information to be elaborated upon, such asthe context, users, and the motivation for the requirement.A description with the motivation for the ontologyrequirement should be included to specify its relevance

MEO methodology.

ARTICLE IN PRESS

Table 5Requirements description template.

Requirement: Description of the requirement

Analyze: Candidate ontologies being evaluated

For the purpose of: Ontology evaluation, selection, or

refinement

With respect to: Quality or aspect being considered

From the viewpoint of: Stakeholders and those affected parties

from the adoption of the ontology

In the context of: A specific ontology role

Motivation:

Description of the motivation for the requirements. Include reasons for the

importance of the requirement and aspects being considered to the

evaluation of candidate ontologies

Table 6Question template.

Questions for requirement: Requirement description

Q1: Question addressing the given requirement

Discussion:

A brief discussion giving reasons why the questions are relevant to the

requirement.


to the evaluation of candidate ontologies for the givenapplication.

Establishing the roles of the ontology: Defining the rolesof an ontology helps to give an understanding of how theontology is used in the context of an application. This steprequires a brief description of the application and thequalities of a suitable ontology, and a discussion of eachrole.

By eliciting the roles of an ontology in the ontology-driven application, we can use them to determine anappropriate set of ontology requirements. The roles alsohelp to decide whether certain requirements apply asthere may be application requirements which are relevantbut are not necessarily ontology requirements.

Obtaining a set of ontology requirements: A variety ofsources may be used in obtaining the set of ontologyrequirements. We describe the following cases that mayoccur below:

Case 1: Existing ontology requirements. In the event thata set of ontology requirements for this ontology exists, weadopt it as our ontology requirements.

Case 2: Existing application requirements. A set ofontology requirements may be derived from existingapplication requirements. Application requirements arespecific to a given application, whereas ontology require-ments are specific for ontologies in the context of one ormore ontology-driven applications. The set of applicationrequirements may be examined to determine whichrequirement is affected by the content and the quality ofthe adopted ontology. There may be aspects of a givenapplication requirement that are relevant with respect tohow ontologies are used in the application. Theseapplication requirements may be drawn from interviewswith application developers or from a review of therelevant documentation.

Case 3: No application requirements or documentation

exist. For the case where application requirements havenot been specified and documented, an analysis of theapplication and its requirements may be necessary for thesole purpose of determining a set of ontology require-

ments. It is outside the scope of the ROMEO methodologyto perform a complete requirements engineering analysisfor the application.

The guiding principle in determining what is includedin the set of ontology requirements is that requirementsshould outline specific competencies and qualities of asuitable ontology in the context of its role in a givenapplication. The set of ontology requirements may varygreatly depending on the role of the ontology for a givenapplication. Without a clear idea on the role of theontology, it may be difficult to make decisions aboutwhich requirements apply.

3.2. Questions

After defining a set of ontology requirements to use forontology evaluation, one or more questions are specifiedfor each ontology requirement identified. Questionshelp to explore the various aspects of a given requirementand in turn provide a deeper understanding of therequirement.

A question is specified such that its associatedrequirement is considered to be satisfied when thequestion is answered and the answer is in a specifiedrange. Each question should consider aspects of a specificontology quality, criterion or characteristic. The aim is tointerpret a given requirement, formalise it into one ormore questions and provide a basis for the inclusion ofrelevant measures in answering each question. Withregards to the content of the question, we may base iton an aspect of an ontology evaluation criterion. Ques-tions allow an ontology engineer to map specific aspectsof an ontology evaluation criterion that relate directly to agiven requirement, rather than trying to map a wholeontology evaluation criterion.

We provide the ROMEO methodology template forquestions in Table 6. The template collects the questionsfor a given requirement and also prompts a discussion tojustify the included questions.

3.3. Measures

Measures seek to quantify and provide answers to thequestion with specific measurements. Each question istypically associated with one measure which is able toanswer the question sufficiently. We provide the ROMEOmethodology template for measures in Table 7. Thetemplate should be used to include details of the possible

range of values, and an optimal range of values for themeasure. In selecting appropriate measures, a discussion

should be included to outline reasons for each measure

ARTICLE IN PRESS

Table 7Measures template.

Measures for question: Question

M1: Name of the measure

Optimal value(s): Value indicating that the given question

has been answered

Range: Possible range of measurements

Discussion:

A brief discussion on the reasons the set of measures listed apply to the given

question. Also, a discussion on the optimal value and range expected with

the set of measures listed.

Table 8Mappings of criteria–question to measures: consistency and conciseness.

Criteria–question Measure(s)

Consistency

Does the ontology include two or more

concepts that shares the same set of

children?

subclasses with common

classes (O)

Does the ontology include two or more

concepts that shares the same set of

instances?

classes with common

instances (O)

How many circularity errors are found

in the ontology?

circularity errors at distance

0, 1, n

Conciseness

How many identical concepts are

modelled using different names?

semantically identical

concepts (Oc)


used in the template. This discussion may also include adescription of what the measure does and how it answersthe associated question as well as an accompanyingexample.

How many identical instances are

modelled using different names?

semantically identical

instances (Oi)

How many redundant subclass-of

relationships are found in the

ontology?

redundant relationships (Oc ,

isa)

How many redundant instance-of

relationships are found in the

ontology?

redundant relationships (O,

instance_of)

Does the ontology model concepts

outside of the frame of reference?

precision (Oc ;Fc)

Does the ontology model relationships


precision (Or ;Fr)

Does the ontology model instances


precision (Oi ;Fi)

3.4. Criteria–questions and suggested mappings to measures

To aid the mapping of requirements to questions andmeasures, we list some template questions for eachcriterion below in Tables 8–10 (with mappings to existingmeasures), and Table 11 (with no mappings to measures).We refer to these questions as criteria–questions. Criter-ia–questions serve as a starting point for specifyingquestions. In specifying a set of questions for a particularROMEO analysis, these criteria–questions may be adaptedto suit a given ontology requirement. We also providesuggested mappings to these criteria–questions as astarting point for users of ROMEO in determining appro-priate mappings to measures.

In specifying questions, a new ontology evaluationcriterion and associated questions may be encountered. Asthis list is not exhaustive, there may be situations whereadditional criteria–questions may be defined based on anaspect of an existing ontology evaluation criterion.Questions for a particular ROMEO analysis may also notnecessarily be from this list of criteria–questions, and maynot fit with any existing ontology evaluation criterion.Consequently, new ontology evaluation criteria and ques-tions may be encountered.

Furthermore, we contend that in the context of manyrequirements, simply identifying corresponding criteria isnot sufficient, as for a given criterion it is unlikely that allpossible measures for particular criteria will apply. Theplain language question corresponding to each measureare intended to help ontology engineers and thosespecifying the requirements to agree on choice of thecorrect appropriate criteria, questions, and measures thatare applicable.

Table 8 presents mappings of criteria–questions toexisting measures for the consistency and concisenesscriteria. Mappings of the consistency and concisenesscriteria–questions to the respective measures are proposedby Gomez-Perez [8]. For the conciseness criteria–ques-tions, we include the precision measure proposed byGuarino [13], since precision measures the proportion ofthe concepts, instances and relationships in the ontology

that are also present in the frame of reference. Thus aconcise ontology, which does not contain concepts,instances and relationships that are irrelevant, maps toan ontology that has high precision with respect to theframe of reference, that is, it does not model concepts,instances and relationships outside of a frame of reference.

Table 9 presents the mappings of criteria–questions ofcompleteness and coverage. Gomez-Perez [8] proposessome mappings for completeness. However, we alsoinclude recall as a measure for the respective complete-ness criteria–questions. For the completeness criterion,measures must account for the relevant frames ofreference of the world being modelled to help determinewhether an ontology is incomplete. For coverage, thesame set of measures are used to address the set ofcoverage criteria–questions. The exception is that it is onlycompared with a given frame of reference, which may be adomain. There are no existing measures applicable forminimal ontological commitment.

Table 10 presents mappings of criteria–questions toexisting measures for the correctness criteria. Correctnessis about whether the right concepts, relationships and

ARTICLE IN PRESS

Table 9Mappings of criteria–question to measures: completeness and coverage


Completeness

Does the ontology have concepts missing with regards to the relevant

frames of reference?incomplete concept classificationðO;F�Þ average-recallðOc ;F

�Þ

Does the ontology have subclasses concepts missing from a given

parent concept with regards to the relevant frames of reference?exhaustive subclass partition omissionðOc ;F

�Þ

Does the ontology have instances missing with regards to the relevant

frames of reference?average-recallðOi ;F

�Þ

Does the ontology have relationships between concepts missing with

regards to the relevant frames of reference?average-recallðOcr ;F

�Þ

Does the ontology have relationships between instances missing with

regards to the relevant frames of reference?average-recallðOir ;F

�Þ

Coverage

Do the concepts in the ontology adequately cover the concepts in the

domain?incomplete concept classificationðO;FÞ recallðOc ;FcÞ

Do the instances in the ontology adequately cover the instances in the

domain?recallðOi;FiÞ

Do the relationships between concepts in the ontology adequately

cover the relationships between concepts in the domain?recallðOcr ;FcrÞ

Do the relationships between instances in the ontology adequately

cover the relationships between instances in the domain?recallðOir ;FirÞ

Table 10Mappings of criteria–question to measures: correctness.


Correctness

Does the ontology capture concepts of the domain

correctly?precisionðOc ;FcÞ

Does the ontology capture instances of the domain

correctly?precisionðOi ;FiÞ

Does the ontology capture relationships between

concepts of the domain correctly?precisionðOcr ;FcrÞ

Does the ontology capture relationships between

instances of the domain correctly?precisionðOir ;FirÞ


instances have been modelled according to the frame ofreference or universe of discourse. We propose mappingsof existing measures of the ratio of meta-consistent

concepts and precision for the set of correctness criter-ia–questions. The measure of ratio of meta-consistent

concepts applies to whether concepts are capturedcorrectly for a given frame of reference, such as a domain.The measure of precision is also mapped as it is themeasure of whether the concepts in an ontology havebeen captured according to the frame of reference. Thus

precision is able to be applied to both conciseness andcorrectness.

Table 11 presents criteria–questions that have nomappings to existing measures—clarity, expandabilityand minimal ontological commitment. Clarity, introducedby Gruber [9], refers to a criterion of having definitions inan ontology defined unambiguously. A clear ontologyshould ‘‘effectively communicate the intended meaning ofdefined terms’’ and where possible the definition shouldbe stated formally [10]. The expandability criteria–question relates to whether an ontology can be extendedfurther to describe more fine-grain concepts and relation-ships while maintaining the current definitions withinthe ontology. Minimal ontological commitment refers tominimising the ontological commitment of an ontology toallow more freedom in an ontology’s usage. Ontologicalcommitment refers to an ontology being able to be agreedupon by users or ontology adopters. However, thesecriteria and the suggested criteria–questions have noexisting measures proposed and is left up to theexperience and knowledge of an ontology engineer todetermine.

3.5. Discussion

As we have seen in this section, ontology evaluationusing ROMEO is driven by ontology requirements for an

ARTICLE IN PRESS

Table 11Criteria–question with no mappings to measures: expandability, clarity,

minimal ontological commitment

Criteria–question

Clarity

Does the ontology have concepts objectively defined?

Does the ontology have instances objectively defined?

Does the ontology have relationships between concepts objectively

defined?

Does the ontology have relationships between instances objectively

defined?

Expandability

Do the set of concepts defined allow for future definitions of subclasses?

Minimal ontological commitment

Does the ontology define any concepts that are overstated for the

domain? That is, are there any concepts that do not need to be included

in the ontology?

Does the ontology define any instances that are overstated for the

domain?

Does the ontology define any relationships between concepts that are

overstated for the domain?

Does the ontology define any relationships between instances that are

overstated for the domain?


application. ROMEO seeks to associate them with relevantmeasures through a set of questions. The resultingproduct of ROMEO is a set of mappings from requirementsto questions, and from questions to measures. Theontology evaluation measures can be used as a basis fordetermining the suitability of an ontology for a givenapplication. ROMEO is also a tool for ontology refinementas it is able to identify applicable measures from ontologyrequirements and in turn, these can be used to measureaspects of the ontology that are deficient.

ROMEO is flexible and is able to accommodate theinclusion of additional measures should other measuresbe proposed in the future. As application requirementsmay change over time, ROMEO allows for a systematicapproach for reassessing which measures apply, thus,allowing for room to adapt or update the appropriate setof measures as more applicable measures are proposed.

ROMEO allows for the reuse of mappings in a ROMEOanalysis. In ROMEO, the individual mappings that aredetermined from requirements to questions (which maybe based on existing ontology evaluation criteria), andquestions to ontology evaluation measures may be reusedfrom one application to another. If there are requirementsthat are common between different applications, theresulting question ought to be similar, if not the same.However, mappings of questions to measures are morereusable as they are decoupled from a given application.In Section 5, we consider empirical validation experiments

of the mappings from questions to measures in addition tothe ROMEO methodology.

ROMEO is adapted from the GQM methodology.Although, the use of GQM as the basis of determiningappropriate measures is similar to the application of GQMto knowledge bases proposed by Lethbridge [16], ROMEOis used for the content-based evaluation of ontologies foran application. Lethbridge [16] uses GQM to derive a set ofmeasures for knowledge bases according to general tasksin knowledge engineering. ROMEO differs in that it adaptsGQM for ontology engineers to associate requirementswith already existing measures using questions fordetermining an appropriate ontology evaluation methodfor the ontology-driven application.

3.5.1. Comparison of GQM, OntoMetric and ROMEO

GQM, OntoMetric and ROMEO are all methodologiesthat rely on the analysis performed by a person, giving riseto human error or bias, which may result in inappropriateevaluation. In the case of ROMEO, there may be caseswhere an incorrect interpretation of application require-ments occurs, or where the ontology engineer has anincorrect understanding of the effect of the measures toapply. However, ROMEO and GQM provide a frameworkfor the users to justify each action taken as analysis occursat each stage. In the case of ROMEO, the ontology engineercan review and correct the decisions made at each step.

The main difference between GQM and ROMEO is that,while GQM is a general methodology for process im-provement, ROMEO is a methodology for ontologyevaluation. The GQM methodology is driven by a prioranalysis of goals, while the ROMEO methodology is drivenby a prior analysis of requirements. Goals in GQM andrequirements in ROMEO are used in the respectivemethodologies to determine which evaluations to carryout. Questions in ROMEO are also used in a similar way asin GQM. A question in GQM defines a specific aspect thatis relevant for measurement; a question in ROMEO reflectsa specific ontology quality or aspect of an ontologycriterion that is relevant for ontology evaluation. Ques-tions may be derived from the suggested list of criter-ia–questions. Measures in ROMEO and metrics in GQM areequivalent, although in ROMEO, there is a list of existingontology evaluation measures to choose from.

Both OntoMetric and ROMEO adopt a top-downapproach for determining appropriate ontology evaluationmeasures. Referred to as dimensions [17], the OntoMetricmethodology considers a wider range of issues forontology selection, such as the language used to buildthe ontology, and the cost of utilising a given ontology.However, both the ROMEO methodology and OntoMetricevaluate the content of an ontology.

In OntoMetric, ontology evaluation is carried out byhaving an objective formulated, various criteria areassociated with each objective and through a series ofsteps, a set of measures of the relevant characteristics isdetermined. Using the analysis of each objective, we beginwith the multilevel tree of characteristics, and form adecision tree by pruning and customising it. The MTC hasa complete set of dimensions, criteria and characteristics.Each candidate ontology is then compared using this

ARTICLE IN PRESS

Table 12Comparison of evaluation methodologies.

Comparison GQM OntoMetric ROMEO

Analysis relies on human judgements � � �

Determines evaluation using goals/objectives � � �

Customises appropriate evaluation method for an application � � �

Incorporates existing ontology evaluation criteria and measures � �

Useful in ontology selection � �

Ontology evaluation based on content � �

Uses quantitative measures � �

Useful in ontology refining � �

Useful in building ontologies � �


decision tree. The top-down approach of breaking downthe more conceptual objectives into specific characteristics

and measures is similar to how ROMEO considers require-ments and its relevant measures.

The difference between OntoMetric and ROMEO,however, is in the way ontology criteria, characteristicsand qualities are associated with specific measures. InROMEO, this is achieved through the specification ofquestions. Whereas in OntoMetric, an entire criterion isassociated with an objective, and also requires a completeunderstanding of the criterion to apply it effectively. Wehave found that the use of questions offers a more flexibleway to determine relevant ontology qualities or charac-teristics. For example, a criteria–question may be usedinitially and then customised for use in that specificrequirement. Questions are also not limited by the set ofcriteria–questions available. Users of ROMEO may specifya question outside of the existing set of criteria–questionswhich suits a particular requirement.

Regarding the use of measures, OntoMetric adopts alinguistic scale for each measure considered, and althoughnot designed for refining or building new ontologiesOntoMetric could be used for this purpose. In comparison,the ROMEO methodology associates relevant measures toeach question identified. The measures are used to carryout quantitative measurements and collect data foranswering a question. Thus they help to determinewhether a given requirement is met. Quantitative mea-surements are better as linguistic scales are subjective asdiscussed in Section 2.2.2.

Table 12 summarises the comparisons of GQM, Onto-Metric and ROMEO.

4. Wikipedia

Wikipedia is a free, online encyclopedia begun in 2001.It has grown to a large collection of articles on varioustopics and now has instantiations in 253 differentlanguages. The English version of Wikipedia1 has over2.3 million articles2 which are created, verified, updatedand maintained by volunteer editors and authors world-wide.

1 http://en.wikipedia.org2 http://en.wikipedia.org/wiki/Wikipedia as of January 2008.

Wikipedia deviates in many ways from a conventionalencyclopedia. Wikipedia relies on a wiki for the contentauthoring, which enables documents to be authored incollaborative fashion using simple formatting markup.Wikipedia allows a variety of access methods to itscollection of articles, for example, by search, throughportals, and by browsing the category structure. Wikipe-dia’s category structure, deviates from conventionalcategory structures in that it is defined in a bottom upmanner by allowing users to attach a set of categoryassociations to articles.

In this section, we describe the application of browsingWikipedia articles using categories, the role of ontologiesfor this application, which will be used to determine anappropriate ontology evaluation method for this applica-tion using the ROMEO methodology from Section 4.2onwards. We limit our discussion to the English instantia-tion of Wikipedia. First in Section 4.1 we give a briefbackground to Wikipedia, the article collection for theEnglish instantiation of Wikipedia, and various methodsfor navigating the article collection. We discuss themethods of exploring the article collection, and highlighta method of browsing articles, that is, using the Wikipediacategory structure as a way of browsing the articlecollection. We examine the specific role of the ontology,that is, the category structure, for this application.

4.1. Wikipedia articles and categories

A key aspect of Wikipedia is providing appropriatenavigation methods for the article collection. There arenumerous ways of exploring the Wikipedia article collec-tion. A common method of accessing article content is byusing the search tool in Wikipedia to match on article titleand content. Alternatively, users may issue a query onWikipedia content using an external online search engine.

Users may also navigate using an index, which isdesigned to help the reader find information using termsand their associated content. The two main indices inWikipedia are the alphabetical index of articles by title andthe Wikipedia category structure. The alphabetical index isan index of all articles sorted alphabetically by title. Thecategory structure is an index of categories, arranged bysubject. The Wikipedia category structure is the maintopic of our study.

http://en.wikipedia.org

http://en.wikipedia.org/wiki/Wikipedia

ARTICLE IN PRESS


Categories are used to organise articles in Wikipediatopically. However, the category structure was not part ofthe original design of Wikipedia. There were inadequacieswith regards to the reliance on the links in an article, thealphabetical indexing of articles, the use of the search tool,and the reference lists to help users to find articles [30].Thus in late 2003 a proposal for having a categorystructure was made as an additional method for findingarticles. It was subsequently implemented in early 2004,which allowed the category structure emerge and subse-quently evolve.

Categories are created by annotating an article with thecategory title, as opposed to specifying a categorystructure independently from articles. Subcategoriesmay be created by annotating a given category page withthe parent category. Each category has associated articlesand can have multiple parent and children categories. Theintention of this was for users to associate relatedcategories and articles with each other. In 2007 theWikipedia had 1,079,246 articles in 111,287 categories.3

4.1.1. Design of category structure

The Wikipedia category structure allows for thenavigation of information that would otherwise bedifficult to locate by submitting a set of query terms to asearch engine, or if the user’s information need is vaguelydefined. Brent Gulanowski, a contributor who initiated theneed for a category structure, is quoted below from anarchived discussion page in Wikipedia regarding thereason for proposing a category structure for Wikipedia.4

The primary reason for proposing this system is toensure that a user of the Wikipedia can see, at a glance,all pages which relate to a particular page that they arereading or a topic that they are interested in. Usersshould be spared the necessity of following links orperforming fruitless searches in the cases where theirunfamiliarity with a subject means they do not evenknow the terms for which they are searching.

In addition to helping with navigation, the Wikipediacategory structure may help authors and editors to grouparticles and maintain a certain consistency in style andgranularity of content, for example, the headings andcontent for articles on film actors, and marking articles forreview by editors.

4.1.2. Wikipedia category structure as an ontology

Overall, the Wikipedia category structure does notconform to a tree structure or strict hierarchy as articlesmay belong to more than one category, and categoriesmay also have multiple parents (although parts of thecategory structure may be tree-like). The category struc-ture may intersect to allow multiple schemes to co-exist.Parts of the category structure may implement multipleclassification systems such that they overlap. These

3 Source: Category statistics from http://labs.systemone.at/wikipedia3

and Article statistics from http://download.wikimedia.org/enwiki accessed

August 2007.4 Source: http://en.wikipedia.org/wiki/User:Brent_Gulanowski/

Categorization accessed March 2008.

overlapping trees or classifications are allowed in Wiki-pedia to provide different but valid views on a given set ofcategories.

The Wikipedia category structure is an evolvinginformation hierarchy that includes concepts and rela-tionships between concepts, that is, a set of categories andsubcategory relationships. Also, relationships exist be-tween a given article and a set of categories. There are,however, no complex relationships and logical constraintsin the Wikipedia category structure. It is a generalontology, in the sense that it has coverage over multipledomains. In the Wikipedia, there can be many rootcategories, or starting points. Category:Categories wasthe absolute root category for the snapshot of the datasetwe used—circa end of 2004 (this has been subsequentlyreplaced by Category:Contents). However, in our datasetwe took the Category:Fundamental as the root categoryfrom which the actual content categories could beaccessed. The subcategories Category:Fundamental were:Information, Nature, Society, Structure, and Thought.These are abstract and may be taken as upper levelconcepts, but they encompass general topic areas that agiven Wikipedia article may fall under.

As the Wikipedia category structure is used forbrowsing, the role of the category structure therefore, isto help users navigate articles through an exploratorymode of browsing between categories, despite the factthat only some users navigate in this manner. We alsoinclude the role of the administration and editing ofarticles.

4.2. ROMEO ontology requirements for Wikipedia

In this section, we outline the ontology requirementsaround the activity of browsing for relevant articles usingthe ROMEO template introduced in Section 3. We presenteach of the requirements below based on the Wikipediaguidelines for the category structure and use the appro-priate ROMEO template to aid our analysis.

For this case study, ontology requirements are eliciteddirectly from the online documentation of the guidelinesfor the Wikipedia category structure. We specifically usethe online documentation that describes Wikipedia andits category structure. These online documents have beenestablished through discussion and scrutiny of theWikipedia editorial process, which involves Wikipediaauthors, users and editors and provides guidelines forcreating, organising and implementing categories.5

For this case study, the scope of discussion for theontology requirements includes guidelines relating to thedesired quality of content and structure for the Wikipediacategory structure. In eliciting a set of ontology require-ments, we exclude the editorial role of the ontology in theROMEO analysis. We focus on the evaluation of thecontent of the category structure that is used for browsingarticles, rather than categories that are used for admin-istrative and editorial purposes.

5 http://en.wikipedia.org/wiki/Wikipedia:Categorization

http://en.wikipedia.org/wiki/User:Brent_Gulanowski/Categorization

http://en.wikipedia.org/wiki/User:Brent_Gulanowski/Categorization

http://en.wikipedia.org/wiki/Wikipedia:Categorization

ARTICLE IN PRESS


The overall requirement with regards to the contentand structure of the ontology is to ‘‘make decisions aboutthe structure of categories and subcategories that make iteasy for users to browse through similar articles’’ [29].From this requirement, there are several guidelines thateditors outline in the online documentation regarding thecategory structure. We summarise specific guidelinesregarding the content of the Wikipedia category structurebelow and determine whether they are applicable asontology requirements.

Categories do not form a tree: Editors of Wikipedia donot impose a strict hierarchy on the Wikipedia categorystructure so that any interesting and potentially usefulcategory associations may be incorporated [29]. Theguideline from Wikipedia is that ‘‘Categories do not forma tree’’, that is, the category structure is intended to bemore like a directed acyclic graph that includes theintersection of various categories. Although in practice,cycles may exist. This allows for the ability to accom-modate multiple views which co-exist. However, theusefulness of such an intersection of categories isdependent on having an adequate level of intersectionwithout including too much intersection that it impedesuser navigation. The structure is intended to give anintersection of domains to make it interesting, sensibleand useful for browsing. Thus, we refine the guideline of‘‘Categories do not form a tree’’ as the ontology require-ment OR1 of having adequate level of category intersection.

Categories should be appropriately grouped: This guide-line of appropriate grouping of categories is aboutensuring that the category structure is factual, rich,interesting, and useful for browsing articles. Standardclassifications, such as those used in science, may beincorporated and adapted to accommodate alternateviews, as long as they achieve the main purpose of aidingusers browse. Another suggested method for groupingcategories is a functional one where related categories aregrouped on a function or theme, for example, the categoryof ‘‘World War II’’ has parent categories of ‘‘ContemporaryFrench history’’ and ‘‘Nuclear warfare’’—both are differentclassifications but are combined to give an interestingcategorisation using a common intersecting category. Asthis relates to the structure and content of the categorystructure, we consider a ontology requirement OR2 thequality of the categorisation and in particular how

categories should be grouped, that is, the organisation ofparent and child categories for each category.

Cycles should usually be avoided: This guideline relatesto the relationships between a set of categories and itssubcategories. Cycles may occur for various reasons. Thesimplest way for a cycle to occur is when a given categoryis made a subcategory of its child. Although there may becases where cycles are useful, they generally impede theusability of the Wikipedia category structure, for example,with users getting lost in sequential browsing of aparticular cycle of categories. It also impedes certainautomated processing of the subcategory using computerprograms [29]. Thus, the need to detect and avoid cyclesthat are not useful. As this relates to the structure of thecategory structure, we consider this to be ontologyrequirement OR3 of avoiding cycles in the category structure.

Appropriate categorisation of articles: The guidelinesrecommend restraint on the number of categories attrib-uted to an article [29]. An article may have too manycategories associated with it due to the following reasons:

(1)
Possible lack of an appropriate set of categories for thegiven article.
(2)
Incorrect association of categories to the article. (3) Violation of Neutral Point of View (NPOV), that is,
association of articles to categories based on opinionor bias.

(4)
Spam, that is, deliberate vandalism of the article.
Item 1 relates to a possible lack of content in theWikipedia category structure. An inadequate set ofcategories may lead to articles being placed in too manycategories. As such, the specification of new categoriesmay be required. This item forms ontology requirementOR4 of ensuring the set of categories available is complete.Item 2 relates to the incorrect association of a set ofcategories to a given article. Items 3 and 4 violateWikipedia’s policies of ensuring NPOV and verifiability,respectively. These items relate to correctness of thedefined relationships between article and category withinthe category structure. Thus, this is ontology requirementOR5 of ensuring the set of categories associated is correct.

In summary, we identify five ontology requirements:

OR1:
adequate level of category intersection; OR2: how categories should be grouped; OR3: avoiding cycles in the category structure; OR4: ensuring the set of categories available is complete; OR5: ensuring the set of categories associated is correct.
In the remainder of Section 4.2, we explain theapplication of each step in the ROMEO methodology tothe first of these ontology requirements; we then list theresulting questions and measures for all the ontologyrequirements.

Table 13 presents the ROMEO analysis for ontologyrequirement OR1 using the requirements descriptiontemplate.

4.3. Questions

In considering the ontology requirement OR1 outlinedin the previous section, we present below a question thathelps to address those requirements. Accompanying thequestion is a discussion of how we arrived at it and somequalities to consider for this requirement.

OR1 is not just about how intersected an ontology is. Ahighly intersected ontology would reduce the quality ofthe ontology and would not be useful for users browsingit. Conversely, an ontology with very little categoryintersection for Wikipedia may not be as rich andinteresting for users browsing it. The key for thisrequirement is determining whether there is adequateintersectedness in the category structure, as emphasisedby the question presented for this requirement in Table 14.Intersectedness does not appear in the list of existing

ARTICLE IN PRESS

Table 16Questions and corresponding measures for OR1 to OR5.

Question for OR1: Does the category structure have an adequate

intersection of categories?

M1: Tangledness: tðOcÞ

Suitable range of values: Between 0.2 and 0.8

Range: 0–1

Question for OR2: Does the ontology capture the relationships between

concepts of the domain correctly?

M2: Meta-consistency: ratio of meta-consistent conceptsðOcÞ

Optimal value(s): 1

Range: 0–1

Question for OR3: Are there any cycles currently?

M3: Consistency: Circularity error at distance 0

Optimal value(s): 0

Range: 0–1

M4: Consistency: Circularity error at distance 1

Optimal value(s): 0

Range: 0–1

M5: Consistency: Circularity error at distance N

Optimal value(s): 0

Range: 0–1

Question for OR4: Does the ontology have concepts missing with

regards to the relevant frames of reference?

Table 15Measures for Q1.

Measures for Q1: Does the category structure have an adequate

intersection of categories?

M1: Tangledness: tðOcÞ

Suitable range of values: Between 0.2 and 0.8

Range: 0–1

Discussion:Tangledness measures the ratio of multiple parent categories. This

measure may help us understand how intersected the category structure

is. Measurements for this measure are values between 0 and 1, indicating

the level of category intersection (0 indicating no intersection and 1

indicating a completely intersected ontology). Using this measure, we

evaluate what the threshold for an appropriate value for indicating the

adequate intersection of categories. Optimal value for this measure is yet

to be determined, but having considered some categories and its

tangledness value, between 0.2 and 0.8 is the proposed threshold.

Table 13Ontology requirement 1.

Requirement OR1: Adequate level of category intersection.

Analyze: Wikipedia Category Structure

For the purpose of: Evaluating the Category Structure

With respect to: Adequate level of category intersection

From the viewpoint of: Wikipedia Users, Editors and Authors

In the context of: Browsing Wikipedia Categories

Motivation:Having an intersecting category structure facilitates user navigation of

articles based on the given information need through the ability to

browse alternate but somewhat related domains. In doing so, users may

browse useful articles that they did not expect to encounter. For

example, if we consider the Food and Drink category, it may intersect

with categories such as Culture, Health, and Digestive system. All are

related and may be in mind of users as they browse.

Table 14Questions for OR1.

Questions for OR1: Adequate intersection of categories.

Q1: Does the category structure have an adequate intersection of

categories?

Discussion:An adequate degree of intersectedness is sought in Q1. Q1 considers the

degree of intersectedness in the category structure. Having not enough

category intersection may result in an increased amount of browsing,

which impedes the task of finding information. Conversely, if a given

category is completely connected to all other concepts in the category

structure, then the resulting category structure is as good as an index of

all available categories, which is not useful.


ontology evaluation criteria and there is no definedcriteria–question for this quality. It was discovered byapplying the ROMEO methodology for this case study.

M6: Completeness: incomplete concept classificationðO;F�Þ

M7: Completeness: average-recallðOc ;F�Þ

Optimal value(s): 1

Range: 0–1

Question for OR5: Is the set of categories correctly associated with a

given article?

M8: Number of non-related associations between categories and articles

Optimal value(s): 0

Range: 0–1

M9: Number of NPOV violations in relationship between categories and

articles

Optimal value(s): 0

Range: 0–1

4.4. Measures

In this section, we associate one or more measures foreach of the questions that was specified in the previoussection. Each measure seeks to answer the questionquantitatively. Accompanying each description of themeasure is a discussion summarising its intent, anoptimal value or range of values and a range of possiblevalues for the measures. The proposed measure for Q1 isshown in the ROMEO template in Table 15.

For Q1, the appropriate measure is tangledness and weuse the formula for tangledness defined in Section 2 inEq. (5). Tangledness gives the measure of multiple parentsto the number of categories in the structure. The optimalvalues for tangledness of a category structure forWikipedia may vary and is dependent on a given domain.Overall, the Wikipedia category structure has a tangled-

ness value of 0.69. However, for different sets ofsubcategories of the category structure, this value mayvary from 0.2 and 0.8. What may be an optimal value inone domain may not apply in another domain, so we

ARTICLE IN PRESS

Fig. 2. White-box validation process.

Fig. 3. Black-box validation process.


suggest a range of values for tangledness of 0.2–0.8 to givea reasonable bound for a desirable tangledness value.

Applying the ROMEO analysis to all the ontologyrequirements leads to the measures listed in Table 16.Some of these measures were listed earlier in the paper,but M8 and M9 are new measures that are specific to theapplication requirements of Wikipedia for its category.Unfortunately several of these measures cannot beautomated and would require manual inspection ofcategories and their associated articles.

5. Validation process

The validation process is used to verify a mapping of arequirement to a question as well as a question to ameasure, that is established using the ROMEO methodol-ogy. The process validates a mapping using a set of taskscarried out in a controlled experimental setup to comparethe performance of a given task on a set of ontologies. Thevalidation environment may use an ontology-drivenapplication to benchmark the performance of each taskor direct measures to compare each ontology for eachtask. This step of validation is separate from the ROMEOmethodology. However, it complements the ROMEOmethodology in that it helps to validate the mappingsused in the ROMEO analysis. Carrying out these validationexperiments allows the ontology engineer to observe theactual working of the measures, that is, whether the rightmeasures are used to measure the ontology characteristicsfrom the ROMEO question and measure mapping.

5.1. The validation environment

The validation of a given mapping may be carried outusing one of two types of validation experimental setup.The first type of experimental setup examines aspects ofthe ontology directly like a white-box test, for example,matching the concepts of an ontology with concepts in agiven domain. Fig. 2 illustrates the overview of this type ofvalidation, where a set of tasks is performed in avalidation environment on a set of ontologies that isvaried according to a specific ontology characteristic.We then analyse measurements from the validationexperiment.

Another type of experimental setup compares theperformance of a set of tasks carried out on an ontology-driven application like a black-box test. Fig. 3 illustratesthis type of validation process, where a set of tasks isperformed in a validation environment, which includesthe use of a set of ontologies that is varied according to aspecific ontology characteristic. The performance of eachtask is then benchmarked against each other. We thencompare the effect that each ontology had on the tasksperformed. We specifically examine whether an ontologythat varies on a specific characteristic helps with thetasks. An experiment conducted in this way performs anindirect measurement of the ontology characteristic byvarying the base ontology. This seeks to observe theperformance of the base ontology and its variants in agiven application environment.

5.2. Obtaining comparable ontologies

The set of ontologies used in empirical validationexperiments ought to be comparable, for example, if wewere validating the measure of an ontology’s coverage,comparable ontologies should describe the same domain,rather than a variety of domains. They should varyaccording to the characteristic being considered in a waythat is independent of any other characteristics.

The problem with obtaining comparable ontologiesfrom a collection of existing ontologies is that they are notalways available. Furthermore, concepts and relations inthe ontology may be vastly different, as they may includeaspects of other domains or have different levels of detail.Hence, we may not be able to make a fair comparison ofthese existing ontologies.

A solution to the problem of obtaining comparableontologies is to vary an ontology to derive a set ofontologies. This solution takes a candidate ontologyand systematically alters it to be varied according to agiven characteristic or measure, for example, a givenontology may be altered to have an increased size. Thedrawback is that this approach may not be possible inevery case. Not all ontology characteristics can bevaried systematically. For example, ontology breadth isdifficult to alter unless more children are introduced for aset of concepts or there is an increase in the set ofconcepts to allow for more breadth. Nevertheless, forcharacteristics that are able to be varied systematically,the set of ontologies produced can be used to comparethe performance of ontologies via empirical validationexperiments.

ARTICLE IN PRESS

Fig. 4. Black-box validation process for the tangledness measure on Wikipedia category structure.


5.3. Select appropriate tasks and benchmarking standards

In carrying out a task-based approach to ontologyevaluation, we propose to model the task on the browsingof an information space using a given category struc-ture—much in the same way users would do whenbrowsing categories from Wikipedia. In this section, wedescribe the dataset used and ontologies taken fromWikipedia’s category hierarchy, the experimental designfor the evaluations and present outcomes from a userstudy we undertook.

6 http://download.wikimedia.org/enwiki7 http://labs.systemone.at/wikipedia3

5.4. Validating intersectedness mapping

We validate a mapping of a question regardingadequate category intersection to the measure of tangled-

ness. This mapping is taken from the ROMEO analysisperformed on the Wikipedia application in Section 4 andis shown in Table 15.

This question examines the adequate intersection ofcategories in the ontology, or in this case, the categorystructure. The appropriate measure for this question istangledness, as determined in the ROMEO analysisconsidered in Section 4 with Wikipedia. The definitionof the tangledness measure is the proportion of nodes inthe graph that have more than one parent (as defined inEq. (5)).

The validation experiment involves a group ofusers browsing an information space using a givencategory structure—much in the same way userswould do when browsing categories from Wikipedia.The set of tasks used includes different browsing tasktypes. As we are examining tangledness, the set ofontologies we use in this experiment varies in tangled-ness. If this mapping is valid, we will observe acorresponding variation of the performance on the taskscarried out.

In this section, we present the experimental design forthe evaluations that are undertaken, specifically, wedescribe the dataset used and the ontologies that weretaken from Wikipedia’s category structure. Last, wepresent outcomes from a user study we undertook forvalidating our mapping of the adequate category inter-section question to the tangledness measure.

5.4.1. Experimental setup

The experimental setup used here compares theperformance of a set of ontologies that vary on tangled-ness. This experiment involves a user study that observesthe performance of users browsing Wikipedia articlesusing the set of varied ontologies. The goal was to examinethe browsability of an ontology in completing a range oftasks to find information in the set of Wikipedia articles,that is, the ability of users to locate information bybrowsing using the category structure for articles about agiven information. The set of Wikipedia articles is takendirectly from the Wikipedia database, which is availableonline.6 Fig. 4 summarises our task-based evaluation forvalidating this ROMEO mapping.

Ontologies used: For this user study, we used thecategory structure and the associated set of articles fromthe English language version of Wikipedia. The Wikipediacategory structure is obtained from System One’s RDFrepresentation of it, also available online.7 This wasconstructed from the Wikipedia database dated March2006, in which each article and category is represented asan RDF triple with category and inter-article relations. Therelations represented in the Wikipedia categories arecategory–subcategory, category–article and article–article

relations. For a given category, no restrictions are put onthe number of parent and subcategories. There may bemultiple parent and child categories. Also, there are norestrictions on the number of categories to associate anarticle with (as long as they are related). However, thereare some limitations with regards to the Wikipediacategories. Some categories are administrative in nature,for example, ‘‘Sporting stubs’’. An article in such acategory has little information written for it but has beenlinked from another article previously written. Also, agiven article may not have any categories associated withit. This means that some articles are not viewable fromnavigating the category structure. Despite this, theWikipedia category structure is a rich organisation, andis used here as the basis for the validation experiment. Inprocessing the categories, we traversed the subtree inbreadth-first search fashion starting from the category

http://download.wikimedia.org/enwiki

http://labs.systemone.at/wikipedia3

ARTICLE IN PRESS

Table 17Measures for the Wikipedia category structure.

Measure Value

Number of categories 111,287

Number of articles 1,079,246

Average articles per category 25.7

Categories with multiple parents 7788

Number of parents 23,978

Average no. parents 2.0

Maximum parents for any given child 56

Number of leaf categories 87,309

Average no. children 4.64

Maximum children 1760

Average breadth 8559.5

Maximum breadth 33,331

Average depth 5.8

Maximum depth 13

Fanout factor 0.78

Tangledness 0.69

8 http://glaros.dtc.umn.edu/gkhome/views/cluto


‘‘Category:Fundamental’’, which we take to be the root ofthe content section, to obtain a set of measures of theWikipedia category structure. We present these measuresin Table 17.

From Table 17, we observe that the Wikipedia cate-gories have a ratio of about 1:10 between the number ofcategories and their associated articles. Also, the categorystructure is not deep considering the number of articlesand categories, with the number of levels as 14. Instead, itis quite broad with an average breadth of 8559.5 in a givenlevel. The overall Wikipedia category structure is alsoquite tangled with 69% of all Wikipedia categories havingmultiple parents.

For this user study, we needed to obtain a set ofontologies which vary on tangledness. Additionally, theseontologies had to be based on the original subtree,semantically reasonable, utilised all the categories in thesubtree and was comparable to the original subtree. Wewere faced with two options—either vary the originalWikipedia subtree or generate a subtree category struc-ture according to an automated technique—which was avariation on a document clustering technique. We carriedout both methods for producing untangled ontologies inthis experiment. We present the two methods below.

Method for removing tangledness: Removing tangled-ness meant removing occurrences of multiple parents in agiven category. The specific algorithm we used wasDijkstra’s algorithm for finding a single-source shortestpath tree. This is the most appropriate shortest pathalgorithm available as we know the root of the subtree.Where there were more than one parent candidatecategory, we chose the category that was most similar tothe category being considered. For the similarity measurehere, we used the cosine similarity from TF-IDF measuresof the article titles within the categories considered. We

found this kept the subtree semantically equivalent. Forexample, untangling the excerpt of the Food subtree inFig. 5(a) resulted in Fig. 5(b). In this example, therelationship between Cereals and Barley is omitted toremove tangledness, since the shortest path from theFoods category to Barley is via the Staple Foods category.

Method for generating subtrees: For a given subtree ofthe Wikipedia category hierarchy, we removed all cate-gory relations from it and applied a document clusteringtechnique over the categories contained in the basesubtree. We used partitional-based criterion-driven docu-

ment clustering based on features gathered from acombination of the category title and associated articleinformation [34] provided in the Cluto clustering toolkit.8

Algorithm 1. Varying a subtree

Let N:¼ maximum number of elements in cluster

Add root of subtree to queue q

repeat

Let cl:¼ next item in q

Obtain clusters I for cl from elements in its cluster

for all i in I do
Nominate element in i as representative category r for i
Add r as subcategory of cl

Let clustersize:¼ number of elements in cluster cl� 1

if clustersize4 ¼ N then
Add i to queue
end if
end for
until queue has no more clusters to process

We used the category title and clustered on a fewvarying data parameters: category title, category title andthe associated article titles, and category title and theassociated article text. We also varied the clusteringtechnique based on the number of features consideredand also the resulting number of clusters on eachclustering event. We used the cosine similarity function

for this.Using the two methods discussed above, we obtained

from the original subtree a two varied subtrees b and c fora given domain, where

a:
Wikipedia original. b: Wikipedia original (tangledness removed). c: Generated (untangled).
Tasks and domains: Below we outline the tasks anddomains in our user studies. First, we discuss somebackground to a browsing activity from the literature,and determine what types of tasks are appropriate toperform in the validation experiment.

Browsing: A browsing activity is different from a search

activity. Both have goals in mind, however, Baeza-Yatesand Ribeiro-Neto [1] differentiate search from browse bythe clarity of user goals. In search, users enter into asystem keywords that are related to their informationneed. They are then presented with a list of results thesystem returns as similar and users can decide to select

http://glaros.dtc.umn.edu/gkhome/views/cluto

ARTICLE IN PRESS

Fig. 5. Removing tangledness example. (a) Foods subtree excerpt (original). (b) Foods subtree excerpt (untangled).

Table 18Tasks used for comparing ontologies for browsing articles in Racing

Sports/Foods domains.

Domain Task description

Racing Sports (X) T1: International racing competitions

T2: Racing Sports without wheeled vehicles

T3: Makers of F1 racing cars

Foods (Y) T4: Non-alcoholic beverages

T5: Different cuisines of the world

T6: Wine regions in Australia

Table 19Task sequence of Tasks 1–6 (t1–t6) for each user in validation user

experiment comparing subtrees a (base), b (untangled) and c (gener-

ated).

Participant X Y Participant Y X

a b c a b c a b c a b c

User 1 t1 t2 t3 t4 t5 t6 User 10 t4 t5 t6 t1 t2 t3



b c a b c a b c a b c a




c a b c a b c a b c a b





one or more of the relevant results or refine their searchquery. In comparison to this, browsing is a different type ofbehaviour. There may not be a specific query as suchassociated. However, answers to user goals and informa-tion needs can be readily recognised in a browsingactivity. Thus, the clarity and mode of accessing thisinformation differs in browsing.

Marchionini [19] outlines the range of browsing typesfrom a directed browse to an undirected browse. A directedor intentional browsing behaviour is usually associatedwith tasks that are closed or specific. These refer to a taskwhere there are usually not more than a few answers tothe information need. On the other hand, an undirectedbrowse is exploratory in nature and this browsingbehaviour is associated with tasks that are more open or

broad. These refer to a task where there may be manyanswers to the information need.

The efficiency of browsing is affected by the user’sknowledge of the domain and the specificity of the browsetask. It is characterised by movement. Thompson andCroft [27] describe browsing as an ‘‘informal or heuristicsearch through a well connected collection of records inorder to find information relevant to one’s need’’. In abrowsing activity, users evaluate the information that iscurrently displayed, its value to their information need,and what further action to take.

The type of browsing that users would perform on theWikipedia category structure is ‘‘Structure Guided Brows-ing’’ [1] and can encompass broad and undirectedbrowsing to directed and specific browsing. Thus, thetasks that were used incorporated the different kinds ofbrowsing behaviour which was discussed above.

Browsing tasks: Each participant was given a set oftasks to complete within a 10 min duration (inclusive ofpre- and post-task questions). The given tasks weredomain specific, and hence would not be comparable inother domains. We chose to use domains that were asseparate from each other as possible so as to reduce thelearning effect from completing tasks on a given domain.Also, we chose three levels of specificity regarding thenature of the tasks (see Table 18). We proposed Tasks 1–3and 4–6 to have increasing levels of specificity, from broad

to specific, in their respective domains X and Y. Forexample, International racing competitions (Task 1) covereda broad range of possible answers within the RacingSports domain (X). Whereas Makers of F1 racing cars

(Task 3) was a very specific task type in the same domain.Table 19 outlines the task sequence for each user for

the experiment we used to compare various aspects of thegenerated subtrees. In Table 19, the original Wikipedia

ARTICLE IN PRESS

Table 20Characteristics of the set of subtrees used.

Measure Racing Sports (X) Foods (Y)

Number of categories 1185 642

Number of articles 18,178 12,630

Average articles per category 15.3 19.7

Subtree Subtree

a b c a b c

Levels 7 7 4 9 9 3

Number of parents 305 213 292 187 135 51

Categories with multiple parents 293 0 0 132 0 0

Average no. parents 1.3 0.995 0.997 1.2 0.993 0.981

Maximum no. parents for a given child 5 1 1 4 1 1

Leaf nodes 880 972 893 455 507 580

Average children 4.9 5.6 4.1 4.2 4.8 12.4

Maximum children 54 53 20 48 48 51

Tangledness 0.25 0.00 0.00 0.21 0.00 0.00


subtree a is compared with the same subtree altered toremove multiple parents b, hence being untangled. Weconsidered an additional subtree for this experiment,which appears as subtree c. This subtree was generatedusing a document clustering technique.

For this experiment, we used the Latin squares method ofdetermining in what order the participants use the subtreesto be compared. We did this to remove the learning factorof users progressing from one subtree to another in a givendomain. Using this configuration each user has a uniquetask sequence. We also applied blocking on the domain.Last, we rotated the domain after nine users.

5.4.2. Analysis of varied ontologies

After varying the subtree for each of the two domains,we took measurements on these to analyse the changesand present them in Table 20. The Racing Sports domain(X) has 1185 categories. The Foods domain (Y) has 652categories. These were ideal sizes for the time given toeach user to browse through in that they were sufficientlylarge such that users would probably not look at allcategories. The average number of articles per category is15 and 20.

We observe that for each domain, subtree b does nothave any multiple parents. Having an untangled subtreereduces the total number of parents compared with theWikipedia original subtree (subtree a). The generatedsubtree (c) had fewer levels as they were generallybroader than the others. The effect of this is presentingthe user with about twice as many narrower categorylinks compared with the other subtrees. Figs. 6(a) and (b)present a visualisation of the original and untangledsubtrees used in the two domains.9

9 OWL source for these subtrees available from http://www.cs.rmit.

edu.au/�jyu/wikipedia-userstudies/owl/.

5.4.3. Benchmarking

To benchmark the performance of users with regardsto browsing and marking relevant articles for a given task,we observed the browsing efficiency and effectiveness ofusers. For efficiency, we looked at the number of back-tracking actions a user does. Included are the number ofclicks a user made to:

(1)
go back to the previous category; (2) go back to the top category; (3) click on a past category or article from history links;
For effectiveness, we considered the number of relevantarticles users marked for each task. For each articlemarked, we evaluated the marked article as:

Not relevant: Does not relate to the task.Somewhat-relevant: Has bits of relevant information.Mostly relevant: Most of article is relevant.Definitely relevant: All of article is relevant.

Significance testing: For the significance testing, we useda two-tailed unpaired unequal variance t-test. The p-valueindicates the probability that the values for the users’performance for the specific comparison are from thesame distribution. We may consider the performance of agiven subtree to be different from another with statisticalsignificance if the p-value is lower than 0.05. That is, thereis less than 5% chance that the two distributions are fromthe same population.

5.4.4. Results

We present the results of the user study experimentbelow. Specifically, we present the major performancecomparisons for tasks that users completed using the setof ontologies. The results for the user study experiments

http://www.cs.rmit.edu.au/~jyu/wikipedia-userstudies/owl/



ARTICLE IN PRESS

Fig. 6. Subtrees. (a) Racing Sports subtrees. (b) Foods subtrees.


carried out are summarised in the graphs in Figs. 7–9.Fig. 7 presents an overview of the main differencesbetween the results of the tasks undertaken by users.The graph shows the results for relevant articles foundand the amount of backtracking clicks performed by usersin completing the set of tasks. Fig. 8 outlines detailedresults showing a comparison of the average clicks ofusers for a given task on a given system and thebreakdown of those clicks into average backtracks made,average number of category clicks, average number of

article clicks and other clicks. ‘‘Other clicks’’ refer to clicksusers made to mark relevant articles and view markedarticles. Fig. 9 shows a comparison of systems on a giventask for relevant articles retrieved by a user. The break-down on each bar includes articles ranging from definitelyrelevant to not relevant at all.

Best method for obtaining untangled ontology: Bothsubtrees b and c are ontologies that are not tangled, thatis, do not have any multiple parents. For subtree b, theresulting category organisation for each domain is based

ARTICLE IN PRESS

30

25

20

15

10

5

0

5

10

15

201 2 3 4 5 6

No. Clicks(Nacktrack)

subtree-asubtree-bsubtree-c

Task

Rel

ated

Arti

cles

Fou

nd (D

efin

ite)

substree-asubstree-bsubstree-c

Fig. 7. Relevant articles found (definite) vs. backtracking clicks for subtrees a (base), b (untangled), and c (generated).

Other clicksArticleCategoryBacktrack

0

10

20

30

40

50

60

70

80

90

100

cbacbacbacbacbacba

Rel

evan

t art

icle

s

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Fig. 8. Average number of clicks by users for subtrees a (base), b (untangled), and c (generated).

0

5

10

15

20

25

30

a b c a b c a b c a b c a b c a b c

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Rel

evan

t art

icle

s

Non-relevantSomewhat relevantMostly relevantDefinitely relevant

Fig. 9. Number of relevant articles retrieved for subtrees a (base), b (untangled), and c (generated).


ARTICLE IN PRESS


on the original Wikipedia category structure. Subtree c is agenerated category organisation based on a clusteringtechnique. The resulting category organisation for subtreec was selected from several versions that is varied on

Table 21Results and significance tests for subtrees a (base) and b (untangled).

Task Measure Subtree

a

1 % backtrack 4.48%

Definitely 8

Mostly 0.5

Somewhat 2.8

Relevant 11.3

Non-relevant 1.7

2 % backtrack 6.93%

Definitely 8

Mostly 0.3

Somewhat 6.2

Relevant 14.5

Non-relevant 0.7

3 % backtrack 5.22%

Definitely 7.5

Mostly 0

Somewhat 2.5

Relevant 10

Non-relevant 1.2

4 % backtrack 5.96%

Definitely 6.8

Mostly 0.2

Somewhat 0.5

Relevant 7.5

Non-relevant 3.2

5 % backtrack 3.58%

Definitely 22.2

Mostly 0

Somewhat 0.8

Relevant 23

Non-relevant 4

6 % backtrack 6.70%

Definitely 3.5

Mostly 0

Somewhat 0.3

Relevant 3.8

Non-relevant 3

*Denotes po0:05.

different parameters and the best subtree was chosen onthe basis of the number of category–subcategory relationsin the generated subtree (c) that appeared in the originalWikipedia subtree (a). This tended to yield broad subtrees

t-Test

b a–b

3.00% 0.4641

11.5 0.2782

0.7 0.6867

2 0.3813

14.2 0.4237

2.2 0.6890

2.96% 0.3243

7.5 0.8996

0.2 0.6643

9.2 0.4524

16.8 0.5638

3.3 0.1350

8.28% 0.6076

2.5 0.0309*

0.3 0.3409

2 0.5490

4.8 0.0339*

1.2 1.0000

3.67% 0.3928

11.7 0.1758

0.2 1.0000

0.8 0.5413

12.7 0.1909

4.3 0.5849

5.07% 0.6880

18 0.5374

0 –

0.8 1.0000

18.8 0.5361

1.7 0.2198

24.88% 0.0065*

0 0.0571

0 –

0 0.1449

0 0.0437*

0.5 0.0646

ARTICLE IN PRESS


using our method for generating subtrees. We alsoobserved that broader subtrees had an effect in reducingsubtree depth.

The results from Figs. 8 and 9 show that users generallyperformed the set of tasks better using subtree b thanwith subtree c. The exception is the results forusers performing Task 6 and we account for this infurther discussion of results later in this section. Becauseof the difference in performance, we limit our analysisto comparing tangledness using the better ontology(subtree b) with the original ontology (subtree a).

Comparing subtrees a (base) and b (untangled): Wesummarise our findings in Tables 21 and 22, and givefurther details on the main measures for each subtree andinclude significance tests. The major rows in Table 21 aregrouped by individual tasks followed by task types inTable 22.

The minor rows are grouped into backtracking clicksand measures of relevant articles found. We sum upmeasures of definitely-, mostly- and somewhat-relevantarticles found and present this in the table as relevantarticles found. The first column shows the task compar-isons. We list the corresponding averages for eachmeasure on each subtree beside it. The last columnin each of Tables 21 and 22 present p-values for thesignificance tests carried out on each measure.

A table of results with more specific observations fromthe experiment of significant differences and is presented

Table 22Results and significance tests combining tasks for subtrees a (base) and b

(untangled).

Task Measure Subtree t-Test

a b a–b

1 and 4 % backtrack 5.17% 3.33% 0.2462

Definitely 7.4 11.6 0.0663

Mostly 0.3 0.4 0.7314

Somewhat 1.7 1.4 0.6985

Relevant 9.4 13.4 0.1193

Non-relevant 2.4 3.3 0.4981

2 and 5 % backtrack 5.07% 3.91% 0.6713

Definitely 15.1 12.8 0.6082

Mostly 0.2 0.1 0.6591

Somewhat 3.5 5.0 0.5348

Relevant 18.8 17.8 0.8140


3 and 6 % backtrack 5.93% 15.82% 0.0223*

Definitely 5.5 1.3 0.0066*

Mostly 0.0 0.2 0.3282

Somewhat 1.4 1.0 0.4919

Relevant 6.9 2.4 0.0169*


*Denotes po0:05.

in Table 23. For each observation that is included inTable 23, results for both subtrees a and b are presented aswell as the p-value from the t-test conducted. The lastcolumn in this table shows the p-values from thesignificance test using the two-tailed unpaired unequalvariance t-test.

Overall, we found that in comparison to subtrees b

(untangled), a (original) enabled users to be more efficientin finding relevant articles. Of the set of articles viewed,the number of definitely relevant answers found insubtree a was 6% better than subtree b. This wasstatistically significant.

Looking at the individual tasks in Table 21, the maindifferences found between subtrees a and b were high-lighted in Tasks 3 and 6. Of the set of tasks, Tasks 3 and 6were the most specific.

In Task 3, users were asked to find articles about‘‘Formula one car makers’’. Here we found that usersmarked articles from a more diverse set of categoriesusing subtree a than with subtree b. On average, usersmarked articles in 3.5 subcategories when browsing withsubtree a compared to users marking articles in 1.7subcategories when browsing with subtree b. This wasstatistically significant according to the t-test.

On average, users browsing with subtree a had 7.5definitely relevant articles compared users browsing withsubtree b, which had 2.5. However, this was notstatistically significant. Upon closer inspection of thecategory structure, the Formula One section of the subtreehad many categories with multiple parents that wererelated which explains how users were able to browsemore effectively. We also found that users performedthree times better using subtree a in finding more relevantarticles.

In Task 6, users were asked to find wine regions inAustralia. Subtree a did significantly better than subtree b.Using subtree a, four out of the six users found relatedarticles for this task—of which three users founddefinitely relevant articles, while all users using subtreeb failed to find any related articles.

In observing users perform Tasks 3 and 6, the key wasfinding the specific gateway category. This gatewaycategory opened up relevant categories which were oftenclustered together around the gateway. For Task 6, thisgateway category was more difficult to find in subtree b

because there were relations missing from the categorieswhere users were looking. The key category for Task 6 insubtree b was located in a related but obscure categorycalled ‘‘Herbs and Medicinal herbs’’. In contrast, usersperforming the task on subtree a tended to find the keycategory Wine as a multiple parent of ‘‘Grape varieties’’which helped them perform this task well.

5.4.5. Outcome of validation experiment

In this section, we carried out user studies on thebrowsing of Wikipedia articles using a set of ontologieswhich varied on tangledness. This was to validate theperformance of ontologies based on the ROMEO mappingidentified in Section 4. The results from the validationexperiment showed that tangledness impacts on a user’sperformance in browsing articles using a given category

ARTICLE IN PRESS

Table 23Results for subtree a vs. b.

Measure Wikipedia t-Test

Orig ðaÞ Untangled ðbÞ

Overall

% Definitely relevant articles/total no. articles 40% 34% 0.0375*

Domain X

% Categories with marked articles/number of visited categories 55.4% 41.7% 0.0143*

Task 3

Categories with marked articles 3.5 1.7 0.0199*

Definitely relevant articles 7.5 2.5 0.1133

Task 6

% backtracking 6.7% 24.9% 0.0041*

Path length 16.2 7.5 0.0010*

Unique categories visited 24.8 16.2 0.0101*

Marked articles 6.8 0.5 0.0041*

Relevant articles 10.8 0 0.0692

*Denotes po0:05.


structure. Overall, subtree a, which had a degree oftangledness, enabled users to perform tasks more effi-ciently than subtree b. For the specific task types,specifically Tasks 3 and 6, users performed with greatereffectiveness and efficiency using subtree a to browse thecategories to find relevant articles than they did usingsubtree b. Thus, in carrying out this empirical validationexperiment, the ROMEO mapping between the question of‘‘Does the category structure have an adequate intersec-tion of categories?’’ and the measure of tangledness isfound to be valid.

6. Conclusion and future work

In this paper, we presented ROMEO, a requirements-oriented methodology for evaluating ontologies, andapplied it to the task of evaluating the suitability of somegeneral-purpose ontologies for supporting browsing inWikipedia. Following the ROMEO methodology, we identi-fied requirements that an ontology must satisfy in order tosupport browsing in Wikipedia, and mapped these require-ments to evaluation measures. We validated part of thismapping by conducting a task-based user study thatcompared variants of two sub-domains of the Wikipediacategory structure; variants were obtained by untanglingthe original (tangled) sub-domains. We also experimentedwith a technique for generating a variant ontology byclustering categories, but the resulting ontologies were notcomparable with the original Wikipedia sub-domains. Theuser study confirmed that tangledness might be desirablein ontologies and category structures that support brows-ing in general knowledge application areas like Wikipedia;this is especially significant for tasks that require the userto locate specific information.

However, we found no significant differences inperformance for tasks that require the user to locate

more general information. Thus for future work, wepropose further task-based user studies into the effectsof depth, breadth and fanout. Other avenues for furtherwork are identifying better untangling algorithms as wellas looking at the effect of adding tangledness according toan automated process and evaluating those for browsing.

Acknowledgements

We would like to thank Mingfang Wu at RMIT for heradvice regarding the user experiments, and also theanonymous referees who provided valuable feedback.Jonathan Yu was supported with a scholarship from theAustralian Research Council.

References

[1] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACMPress, Addison-Wesley, 1999.

[2] C. Brewster, H. Alani, S. Dasmahapatra, Y. Wilks, Data drivenontology evaluation, in: Proceedings of the 4th InternationalConference on Language Resources and Evaluation (LREC), Lisbon,Portugal, 2004. European Language Resources Association.

[3] E.J. Davidson, Evaluation Methodology Basics, Sage Publications,London, UK, 2005.

[4] T. deMarco, Controlling Software Projects: Management, Measure-ment, and Estimates, Prentice-Hall PTR, Upper Saddle River, NJ, USA,1986.

[5] J.S. Dong, C.H. Lee, H.B. Lee, Y.F. Li, H. Wang, A combined approach tochecking web ontologies, in: Proceedings of the InternationalConference on World Wide Web, ACM Press, 2004, pp. 714–722.

[6] A. Gangemi, C. Catenacci, M. Ciaramita, J. Lehmann, Ontologyevaluation and validation, Technical Report, Laboratory for AppliedOntology, 2005.

[7] A. Gomez-Perez, Towards a framework to verify knowledgesharing technology, Expert Systems With Applications 11 (4)(1996) 519–529.

[8] A. Gomez-Perez, Evaluation of ontologies, International Journal ofIntelligent Systems 16 (2001) 391–409.

[9] T.R. Gruber, A translation approach to portable ontology specifica-tions, Knowledge Acquisition 5 (2) (1993) 199–220.

ARTICLE IN PRESS


[10] T.R. Gruber, Toward principles for the design of ontologies usedfor knowledge sharing, International Journal Human–ComputerStudies 43 (5–6) (1995) 907–928.

[11] M. Gruninger, M. Fox, Methodology for the design and evaluation ofontologies, in: Workshop on Basic Ontological Issues in KnowledgeSharing, IJCAI’95, 1995.

[12] N. Guarino, Some ontological principles for designing upper levellexical resources, in: Proceedings of the 1st International Con-ference on Lexical Resources and Evaluation (LREC), May 1998.

[13] N. Guarino, Towards a formal evaluation of ontology quality (inWhy evaluate ontology technologies? Because they work!), IEEEIntelligent Systems 19 (4) (2004) 74–81.

[14] N. Guarino, C. Welty, Evaluating ontological decisions withontoclean, Communications of ACM 45 (2) (2002) 61–65.

[15] P. Haase, A. Hotho, L. Schmidt-Thieme, Y. Sure, Collaborative andusage-driven evolution of personal ontologies, in: Proceedings of the2nd European Semantic Web Conference, Lecture Notes in ComputerScience, vol. 3532, Springer, London, UK, 2005, pp. 486–499.

[16] T. Lethbridge, Metrics for concept-oriented knowledge bases, Soft-ware Engineering and Knowledge Engineering 8 (2) (1998) 161–188.

[17] A. Lozano-Tello, A. Gomez-Perez, OntoMetric: a method to choosethe appropriate ontology, Journal of Database Management 15 (2)(2004) 1–18.

[18] A. Maedche, S. Staab, Measuring similarity between ontologies, in:Proceedings of the 13th International Conference on KnowledgeEngineering and Knowledge Management, Springer, London, UK,2002, pp. 251–263.

[19] G. Marchionini, Information Seeking in Electronic Environments,Cambridge University Press, Cambridge, 1995.

[20] D.L. McGuinness, Spinning the semantic web: bringing the worldwide web to its full potential, in: Ontologies Come of Age, MITPress, Cambridge, MA, 2002, pp. 171–195 (Chapter 6).

[21] K. Moeller, D. Paulish, Software Metrics: A Practitioner’s Guide toImproved Product Development, IEEE Press, New York, 1993.

[22] A. Orme, H. Yao, L. Etzkorn, Coupling metrics for ontology-basedsystems, IEEE Software 23 (2) (2006) 102–108.

[23] T. Saaty, How to make a decision: the analytic hierarchy process,Interfaces 24 (6) (1994) 19–43.

[24] M. Scriven, Evaluation Thesaurus, fourth ed., Sage Publications,Newbury Park, CA, 1991.

[25] J. Sowa, Knowledge Representation: Logical Philosophical andComputational Foundations, Brooks Cole, 2000.

[26] S. Tartir, I. Arpinar, M. Moore, A. Sheth, B. Aleman-Meza, OntoQA:metric-based ontology quality analysis, in: Proceedings of theWorkshop on Knowledge Acquisition from Distributed, Autono-mous, Semantically Heterogeneous Data and Knowledge Sources atIEEE International Conference on Data Mining (ICDM), Texas, USA,November 2005, pp. 45–53.

[27] R. Thompson, W. Croft, Support for browsing in an intelligent textretrieval system, International Journal of Man–Machine Studies 30(6) (1989) 639–668.

[28] C. Welty, N. Guarino, Supporting ontological analysis of taxonomicrelationships, Data and Knowledge Engineering 39 (1) (2001)51–74.

[29] Wikipedia online editing guidelines for categories hhttp://en.wiki-pedia.org/wiki/Wikipedia:Categoryi, Accessed January 2008.

[30] Wikipedia guidelines for category proposals and implemen-tations (original) hhttp://meta.wikimedia.org/wiki/Categorization_requirementsi, Accessed January 2008.

[31] J. Yu, Requirements-oriented methodology for evaluating ontolo-gies, Ph.D. Thesis, RMIT University, Melbourne, Australia, 2008.

[32] J. Yu, J. Thom, A. Tam, Ontology evaluation: using Wikipediacategories for browsing, in: Proceedings of the 16th Conferenceon Information and Knowledge Management, ACM Press, 2007,pp. 223–232.

[33] J. Yu, J.A. Thom, A. Tam, Evaluating ontology criteria for require-ments in a geographic travel domain, in: Proceedings of Interna-tional Conference on Ontologies, DataBases and Applications ofSemantics, 2005.

[34] Y. Zhao, G. Karypis, Hierarchical clustering algorithms for documentdatasets, Data Mining and Knowledge Discovery 10 (2) (2005)141–168.

http://en.wikipedia.org/wiki/Wikipedia:Category

http://en.wikipedia.org/wiki/Wikipedia:Category

http://meta.wikimedia.org/wiki/Categorization_requirements

http://meta.wikimedia.org/wiki/Categorization_requirements

requirements-oriented methodology for evaluating ontologies

Documents