ios press using ontologies to model human navigation ... · semantic web 0 (2013) 1 1 ios press...

Semantic Web 0 (2013) 1 1IOS Press

Using ontologies to model human navigationbehavior in information networks:A study based on WikipediaDaniel Lamprecht a, Markus Strohmaier a, b ∗ Denis Helic a, Csongor Nyulas b, Tania Tudorache b,Natalya F. Noy b and Mark A. Musen b

a Knowledge Management Institute, Graz University of Technology, Austriab Stanford Center for Biomedical Informatics Research, Stanford University, USA

Abstract. The need to examine the behavior of different user groups is a fundamental requirement when building informationsystems. In this paper, we present Ontology-based Decentralized Search (OBDS), a novel method to model the navigationbehavior of users equipped with different types of background knowledge. Ontology-based Decentralized Search combinesdecentralized search, an established method for navigation in social networks, and ontologies to model navigation behaviorin information networks. The method uses ontologies as an explicit representation of background knowledge to inform thenavigation process and guide it towards navigation targets. By using different ontologies, users equipped with different types ofbackground knowledge can be represented. We demonstrate our method using four biomedical ontologies and their associatedWikipedia articles. We compare our simulation results with base line approaches and with results obtained from a user study. Wefind that our method produces click paths that have properties similar to those originating from human navigators. The resultssuggest that our method can be used to model human navigation behavior in systems that are based on information networks,such as Wikipedia. This paper makes the following contributions: (i) To the best of our knowledge, this is the first work todemonstrate the utility of ontologies in modeling human navigation and (ii) it yields new insights and understanding about themechanisms of human navigation in information networks.

Keywords: Navigation, Decentralized Search, Ontology

1. Introduction

One of the challenges in building information sys-tems is the need to develop interfaces suited to a rangeof different types of users. Different types of users,such as novices, experts, generalists or specialists will,in general, display considerably different knowledgeabout a given domain. This specific knowledge in turninfluences their interactions with an information sys-tem. Gaining insight into human navigation behaviorsupports the construction of easy-to-use software andinformation systems that are ready to accommodate abroad range of user types.

*Corresponding author – [email protected]

In this paper, we investigate ways of modeling nav-igational behavior of human users in information net-works. Humans navigating an information network(such as Wikipedia) generally do not know the net-work topology in its entirety. They are therefore notalways familiar with the global network structure butnavigate based on assumptions and local informationonly. Experiments by Stanley Milgram and others [30][20] have shown that humans are very effective at find-ing short paths based on local information in offline aswell as in online social networks.

In this paper we present a novel method for simu-lating human navigational click behavior in informa-tion networks using ontologies as background knowl-edge and examine its suitability to model actual hu-man navigation behavior. The method, which we call

1570-0844/13/$27.50 c© 2013 – IOS Press and the authors. All rights reserved

2 Using ontologies to model human navigation behavior in information networks: A study based on Wikipedia

Ontology-based Decentralized Search (OBDS), buildson decentralized search [15], a well-established nav-igation method in social networks which is based onlocal information only. Decentralized search has beensuccessfully applied to navigation in information net-works in previous research, where it has been usedto model the behavior of users and to produce simu-lated click data [11]. OBDS uses decentralized searchwith ontologies as background knowledge to model thesearch process and to point an algorithmic searcher to-wards the direction of the target.

This method is new in that it uses an explicit repre-sentation of the background knowledge in the form ofan ontology. Research in psychology suggests that hu-mans store concepts in their minds hierarchically [7].In our method, we model different groups of users byusing different ontologies as background knowledge.

Research Questions: In this work, we will address thefollowing three research questions:

RQ1 Can ontologies contribute useful informationto modeling navigation in information networks? Andhow does OBDS perform in comparison to randomlygenerated ontologies and random walks?

RQ2 Does Ontology-based Decentralized Search(OBDS) produce valid results, i.e., are the simulatednavigation paths similar to those produced by humannavigation?

RQ3 When using OBDS, what ontology is bestsuited to produce human-like navigation results?

To demonstrate our method, we use the informationnetwork formed by a set of biomedical Wikipedia ar-ticles and the connections (hyperlinks) between them.We show that several different biomedical ontologiescan be used as background knowledge to inform navi-gation simulations, much as humans use their acquiredknowledge for navigation.

Contributions: Our main contribution is the demon-stration of the general suitability of existing real-worldontologies to inform models of human navigation,specifically decentralized search on information net-works such as Wikipedia. To the best of our knowl-edge, our work presents a novel method and a novelapplication of ontologies. By comparing the naviga-tional paths generated by our simulations with severalbaseline approaches and with data obtained from a userstudy, we show that our method yields results similar

to those produced by actual human users. The resultssuggest that OBDS can be used to simulate humannavigational behavior in information networks, whichcan be useful for addressing issues arising in the de-velopment of systems that are based on networked in-formation. These findings are relevant for researchersinterested in new applications for ontologies and forresearchers interested in modeling navigation in in-formation networks using ontologies as backgroundknowledge.

The rest of this paper is structured as follows: InSection 2 we place our work in the context of previ-ous research and related work. In Section 3 we discussmaterials and methods, and we present the results inSection 4. We end with a discussion.

2. Related work

In the context of this paper, the related areas can bedivided in three: navigation in social networks, naviga-tion in information networks and ontologies.

2.1. Navigation in social networks

This paper particularly addresses navigation in so-cial networks via decentralized search algorithms.Fundamentally, decentralized search describes a wayof solving a pathfinding problem in a social network.Starting from an arbitrary start node (i.e., a person)within the network, the objective of decentralizedsearch is to find a way to a given target node. Thealgorithm, however, does not possess global knowl-edge of the network and can therefore only take deci-sions based on local knowledge. The term decentral-ized stems from the fact that the search proceeds byforwarding the search problem from one node to an-other, which, in a social network, involves a differentperson taking the decisions at every node.

The idea of decentralized search, as used in our nav-igation simulations, was made popular by Stanley Mil-gram’s widely discussed small-world experiment [30][20] in the 1960s. In the experiment, participants inBoston and Nebraska received a letter containing in-formation about a target person (a Boston stock bro-ker). They were then asked to forward the letter to oneof their acquaintances, so as to bring the letter closer tothe target person. The resulting median chain length ofsix intermediates for successful chains of letters coinedthe term "six degrees of separation". By taking only thelimited knowledge of each participants into account

Using ontologies to model human navigation behavior in information networks: A study based on Wikipedia 3

at each step, the search effectively constituted a formof decentralized search. The result illustrated the so-called small world phenomenon, as it seemed possi-ble to connect two arbitrary persons across the UnitedStates through a very small number of hops.

In 1998, Watts and Strogatz [32] characterized net-works exhibiting small-world characteristics as hav-ing a high clustering coefficient and a low characteris-tic path length and demonstrated the actual existenceof this type of small-world networks in a film actorcollaboration network, the power grid of the westernUnited States and the neural network of C. elegans.

In 2000, Jon Kleinberg proved that for the type ofsmall-world networks proposed by Watts and Strogatz[13], no effective decentralized search algorithm couldexist that always found a path connecting two nodesin subpolynomial time. However, Kleinberg presenteda more generalized version of the model for which hethen proved that a decentralized algorithm capable offinding short paths existed.

Kleinberg later extended his model of decentralizedsearch to include hierarchies [14], where the term hier-archy denotes a tree of that includes all network nodes(and may contain more nodes). He showed that whenthe network nodes were embedded as the leaf nodes ofa hierarchy and links in a network were formed pro-portional to distances in this hierarchy, the resultingnetwork was also efficiently searchable. To form an ef-fectively searchable graph, nodes were connected witha probability proportional to their distance in the tree,i.e., the height of their closest common ancestor. Pro-vided the hierarchy information as background knowl-edge, the search could then proceed to the target effec-tively. In this paper, we use ontologies as this type ofbackground knowledge.

Miao et al. [19] have studied decentralized search incollaboration networks. Collaboration networks differfrom information or social networks in that the infor-mation flow in them is driven by tasks. This means thatthe edges in the network are formed by collaborationon tasks. In their study, the tasks were software bugs.Developers who were assigned a bug they could noteliminate themselves forwarded it to another developerwho they believed could handle it. By establishing sev-eral forwards in a row, this of work flow consisted atype of decentralized search, as all decisions about thenext hop were taken independently by multiple partic-ipants.

Adamic and Adar [4] studied decentralized searchin the e-mail network of HP labs and found that decen-

tralized search according to hierarchies based on con-nectedness and office cubicle distance worked best.

Decentralized search is also used in peer-to-peer filesharing protocols such as Gnutella or KaZaA. With alow characteristic path length and a high cluster co-efficient, the Gnutella network displayed small-worldcharacteristics in 2003 [17].

2.2. Navigation in information networks

In this paper, decentralized search, a navigationmodel originally developed for social networks, is ap-plied to information networks.

One of the most prominent related model to searchin information networks is information foraging [25].Information foraging is based on foraging theory inbiology. In order to survive, animals have adoptedmethods which maximize the energy gained from foodsources. In the theory of information foraging, searchin information networks is not guided by backgroundknowledge but by information scent, with each articleand link emanating a distinct scent, which is dependenton the target of the search. For instance, when search-ing for information on penguins, a link leading to anarticle about Antarctica would provide more scent thana link leading to an article about the Sahara desert.

In this paper, information networks are studied onthe example of Wikipedia. However, genuine naviga-tion paths from Wikipedia are difficult to obtain, asthe goals of users are often hidden and not explic-itly visible and logs of click trails are hard to obtain.With 60 − 70%, the fraction of teleports is further-more significantly higher on Wikipedia than on gen-eral web sites [8]. This might be due to the fact thatusers visit Wikipedia to satisfy specific information de-mands rather than to browse articles. However, thereexist valid reasons to navigate Wikipedia, which willbe detailed in the description of the navigation scenar-ios in Section 3.5.

Due to the difficulty of obtaining Wikipedia nav-igation paths, wiki games have been a popular re-placement for Wikipedia navigation paths in recent re-search. Wiki games, such as Wikispeedia1, WikipediaMaze2 or Wiki Game3 allow users to play games onthe information network formed by the Wikipedia ar-ticles and links between them. Click trails from wikigames have enabled researchers to gain insight into

1www.wikispeedia.net2www.wikipediamaze.com3www.thewikigame.com


navigational behavior on Wikipedia. In 2009, West etal. [34] used wiki game data to infer semantic dis-tances between concepts by studying game click paths.In 2012, West and Leskovec [33] found that in wikigames, players tend to navigate to hubs (articles with alarge number of outlinks) first, and subsequently homein on targets node.

In our own group, we have used decentralizedsearch (with non-ontological background knowledge)in different contexts:

In 2011, Helic and Strohmaier compared the naviga-bility of different tag hierarchy generation algorithmson data from Bibsonomy, CiteULike, Delicious, Flickrand LastFm [10]. The paper evaluated the suitability oftag hierarchies for navigation on tagging networks andproposed a novel tag hierarchy generation algorithm.

In 2012, Strohmaier, Helic et al. compared differ-ent folksonomy induction algorithms through decen-tralized search [28] . They showed that, based on eval-uation through navigation, clustering algorithms devel-oped for social tagging systems performed better thanstandard hierarchical clustering algorithms.

Helic et al. applied decentralized search to broadand narrow folksonomies on data from Mendeley [9]and found broad folksonomies better suited to support-ing navigation.

Trattner et al [29] compared decentralized searchand human navigation behavior in information net-works and showed that the simulation of decentralizedsearch yielded very similar results to actual humannavigation data on Wikipedia. In their work, Trattner etal. investigated different types of hierarchies as back-ground knowledge and found that decentralized searchbased on a hierarchy generated from network featuressuch as in- and outdegree simulated human navigationbetter than comparable hierarchies generated from ex-ternal knowledge.

In ongoing research, Helic, Strohmaier et al. arestudying the influence of stochasticity and differentmethods of selecting the next hop in decentralizedsearch [11].

The previous work did not tap into existing on-tologies as background knowledge, but used other ap-proaches (such as automated methods) for this pur-pose. This paper goes beyond previous research by ex-tending the simulation framework with ontologies andby applying Ontology-based Decentralized Search tothe case of Wikipedia and for concrete ontologies forthe biomedical domain.

2.3. Ontologies

Ontologies have been used in previous research tofacilitate navigation in digital libraries. Papazoglouand Hoppenbrouwers [23] have used ontologies to re-trieve related work when searching digital libraries.The research of Rajapakse et al. [27] shows effortsto navigate the digitally available literature related todengue fever. Villela Dantes et al. [31] have studiedthe ontology-guided insertion of links into web pages.In their work, they classified web pages according toan ontology and subsequently inserted links to relatedtopics into web pages to facilitate navigation.

These research papers share the effort to use ontolo-gies to aid navigation. The objective of this paper liesin explaining and modeling user behavior by using on-tologies as background knowledge. The ontologies arehence not used to guide human users but to simulateand possibly explain behavior.

This paper uses three ontologies from the biomed-ical domain. Biomedical ontologies play an impor-tant role in biomedical research [6] and are used for arange of purposes. In the biomedical domain, ontolo-gies have been adapted more frequently than in otherdisciplines [22].

3. Materials and methods

3.1. Introductory example

To illustrate our work, let us introduce the followingexample, depicted in Figure 1: Alice accompanied herfather to a physician, who diagnosed him with a cer-tain cardiovascular disease. Back at home, Alice re-alizes that she forgot the exact name of the condition.However, she remembers that the disease was some-how related to heart rhythm problems. Trying to re-cover the exact name, she goes to Wikipedia, but sinceshe does not know the exact name of the target arti-cle she cannot use the search function to jump to thearticle directly. Alice instead starts from a (hypotheti-cal) Wikipedia portal containing links to a number ofcommon diseases. She first chooses to click the portallink leading to the article on Cardiovascular disease, asthis seems to be a good starting point. Next, she navi-gates to the article on Vascular disease, then to Stroke,clicks the link to Cardiac dysrythmia and finally arrivesat Supraventricular tachycardia, which she recognizes asthe disease the doctor had diagnosed her father with.


ICD-10

D E I

I10-I15 I30-I52 I60-I69

I50 I51

Hypertension

Heart Failure

Cardiovascular disease

Vascular disease

I63 I62 I61 I64

Stroke

I47 I48 I49

Cardiac dysrythmia I47.1

Supraventricular tachycardia

E65-E68

E66

Obesity

D10-D36

D35

D35.0

Pheochromocytoma

I51.6

Portal

ICD-10 relation

Wikipedia mapping

ICD-10 path

shortest path (a)

(b)

(c)

(d)

(e)

4

4

5 1

0

(a) Background knowledge (ICD-10)

Cardiovascular disease

Portal

Pheochromocytoma

Vascular disease

Stroke Cardiac Dysrhytmia

Supraventricular tachycardia

Hypertension

Obesity

Heart Failure

Wikipedia link

ICD-10 path

shortest path

(a)

(b)

(c)

(d)

(e)

(b) Wikipedia graph

Fig. 1. Alice’s Wikipedia Navigation Scenario Looking for a disease, Alice goes to Wikipedia and starts from a hypothetical portal containinglinks to a number of common diseases. Alice then navigates her way through the Wikipedia network.In the figure, we assume that Alice’s background knowledge is represented by the ICD-10 ontology. Figure a) shows a part of ICD-10 and thecorresponding Wikipedia articles. Figure b) shows a subgraph of the Wikipedia link network. Alice’s path in the graph (red, dashed) is guidedby ICD-10, which differs from the shortest path (green, solid). The numbers along the ICD-10 path show the distance to the target, according toICD-10.

At each step, Alice is only aware of the links lead-ing away from the current article. She is familiar withsome of the article titles, and is able to relate themto one another through what we refer to as her back-ground knowledge. She recognizes some of the linksand knows what their target article could likely beabout. Since Alice is only making use of the local ar-ticle content and its outgoing links at each step, sheperforms what is called decentralized search.

To simulate Alice’s usage of Wikipedia, we firstmapped a subset of biomedical Wikipedia articlesto their corresponding ontology concepts in threebiomedical ontologies. Given these mappings, the sim-ulation could then calculate distance information onthe ontology. For each potential outgoing link that Al-ice could click, the simulation computed the shortestpath between the article behind that link and the targetarticle. This distance information was used as a proxymeasure to estimate the distance to the target articlein the Wikipedia network. The distance informationgained this way was not necessarily optimal or evencorrect, but generally provided a good guess to guidethe navigation.

In this manner, the simulator was able to make aneducated guess about what link to follow, just as Al-ice could roughly place the outgoing articles into cate-gories.

In the rest of the Materials and Methods section, wedescribe the ontologies used to inform the simulator(Section 3.2), the Wikipedia articles and how we ob-tained them (Section 3.3), our method of Ontology-based Decentralized Search (Section 3.4), our naviga-tion scenarios (Section 3.5), the user study (section3.6) and finally the simulator implementation (section3.7).

3.2. Biomedical ontologies

We used the following four ontologies and termi-nologies (all from the biomedical domain) as back-ground knowledge. With these ontologies, we wereable to i) extract articles from Wikipedia and ii) guidethe next-step selection in the simulator.

The International Classification of Diseases, 10th re-vision (ICD-10) is a classification of diseases, signsand symptoms first published in 1992 and maintainedby the World Health Organization (WHO). ICD-10 hadits origins in the classification of causes of deaths andis presently used by over 100 countries to report mor-tality statistics. It is also widely used for epidemiol-ogy, health management as well as clinical purposesand is available in 46 languages [1]. The version weused contained 12,417 concepts. ICD-10 consists of


(a) ICD-10 (b) MeSH (c) SNOMED CT

Fig. 2. Structure of the four top levels of the ontologies used in our research. The figure shows the structure for ICD-10, MeSH and SNOMEDCT. The root node is displayed in black and bold in the middle of each plot. The figures show all ontology concepts up until a distance of fourfrom the root node. Color indicates distance, with red being close to the root and blue being farther away. SNOMED CT (depth 16) is clearlybroader than MeSH (depth 14), which stems from the fact that the latter contains roughly four times more concepts than the former.

22 top-level nodes termed chapters and assigns a code(or a range of codes) to every disease in its domain. Inour experiments, we used Wikipedia articles mappingto concepts from all 22 chapters.

Medical Subject Headings (MeSH) is a controlledvocabulary thesaurus for journal articles in the medicaldomain. MeSH is maintained by the U.S. National Li-brary of Medicine. The ontology forms a tree-structurewith 16 top-level concepts and contains 26,142 terms(dubbed descriptors) [2] . Descriptors are graph leavesand attached to one or more tree nodes (which are notdescriptors). As such, the complete graph we used con-tained 80,689 nodes. MeSH extends beyond biomedi-cal concepts and comprises terms from other domainssuch as Geography, Technology or Publication Char-acteristics. In our experiments, 96% of the Wikipediaarticles mapped to the subgraph represented by theDiseases concept, and the rest to the Psychiatry andPsychology subgraph.

The Systematized Nomenclature of Medicine–ClinicalTerms (SNOMED CT) [26] is a clinical health-care terminology used in electronic health record sys-tems. The revision we used contained 295,482 con-cepts, which made it by far the largest ontology inour simulations. SNOMED CT consists of 19 top-level concepts. In our experiments, 98% percent of theWikipedia articles mapped to the Clinical finding sub-tree.

Table 1 displays statistics about the data sets usedfor this paper. The row denoted density was calculatedas

D =|relations|

|concepts|(|concepts| − 1)

for the ontologies (which were regarded as undirectedgraphs) and as

D =|links|

|articles|(|articles| − 1)

for the Wikipedia article network, which formed a di-rected graph. Figure 2 depicts the examined ontologygraphically for the first four hops from the root node.

3.3. Wikipedia articles

We used a dump of the English Wikipedia from De-cember 2011 to extract articles from the biomedicaldomain corresponding to ontology concepts. We thenmapped the articles to the ontologies by parsing thearticles’ info boxes.

In disease articles, the Infobox disease4 iscommonly used. It offers several options to referencemedical ontologies such as ICD-10 or MeSH (see Fig-ure 3 for an example). We used template fields in theInfobox disease as well as two other infoboxtemplates to map Wikipedia articles to their ontologycounterparts in ICD-10 and MeSH.

4http://en.wikipedia.org/wiki/Template:Infobox_disease


Table 1Characteristics of the data sets used for our work. The tablesdisplays statistics about the examined ontologies as well as the setof Wikipedia articles mapping to those articles.

ICD-10 MeSH SNOMED CTOntology concepts 12,417 80,689 295,482

top-level 22 16 19relations 12,416 112,463 440,408

density 8.05×10−5 1.73×10−5 5.04×10−6

depth 4 14 16relation is-a is-a, part-of is-a

Wikipedia articles 1,593links 14,539

density 5.73×10−3

Fig. 3. Example for an infobox template used in disease articleson Wikipedia. Disease articles commonly make use of an Infoboxdisease template, which offers fields for ontology codes. We usedtemplate fields in the Infoboxes to map Wikipedia articles to theirontology counterparts.

SNOMED CT is proprietary and not present inWikipedia info boxes. As a consequence, we could notdirectly relate Wikipedia articles to the ontology con-cepts. We therefore used semantic mappings from Bio-Portal [35] to map Wikipedia articles to SNOMED CT.We mapped a total of 1,593 Wikipedia articles fromboth ICD-10 and MeSH to SNOMED CT with thismethod.

3.4. Ontology-based Decentralized Search

Decentralized search is a method of solving apathfinding problem in a network without a centralcontrol unit. Starting from an arbitrary start nodewithin the network, the objective of decentralizedsearch is to find a way to a given target node. Theterm decentralized stems from the fact that the searchproceeds by forwarding the search problem from onenode to the next, until the target is reached. In StanleyMilgram’s small world experiment [30], decentralizedsearch was established through humans forwarding let-ters to acquaintances in order to find a target person.

Each human along the chain of letters acted indepen-dently of all others and thus made the search decen-tralized, i.e., acting without a central control unit in-volved in the decisions at every step. Further examplesfor decentralized search include bug forwarding in adeveloper network, where software bugs are assignedto a starting person, and then forwarded to other devel-opers until it is fixed [19], or job recommendations insocial networks [4].

In a social network, the decision of where to forwardthe problem is generally based on the expected knowl-edge and capability of that particular next node (per-son). For our simulations, we assumed that all nodesshared a common background knowledge expressed asan ontology. This assumption made our algorithm less"decentralized" in a certain sense, because all the de-cisions were now made by the same entity (our sim-ulator). Just like in the original decentralized searchhowever, at each node the simulator could only ac-cess information about that particular node’s local net-work neighborhood. The background knowledge rep-resented additional knowledge about the network nec-essary to effectively find a short path to the target.When looking for an employee in a company for exam-ple, this knowledge could represent the organizationalhierarchy - with the restriction that the search can onlybe forwarded to acquainted employees, which woulde.g., be the case with personal recommendations.

In the theory of network navigability, Jon Klein-berg showed that networks that are formed accordingto a background hierarchy (i.e., a tree) are efficientlynavigable [14], provided the search agent has accessto that background hierarchy during the search. Thismethod, called Hierarchical Decentralized Search, hasbeen successfully applied in previous research [11][28] . This paper extends this application by a usingontologies as the background knowledge.

In our simulation, the target article was directlyknown to the simulation. This was used to model the


somewhat familiar article Alice was trying to reach.Alice did not know the exact name of her target, butshe could roughly place it in a category, to which shethen navigated using her own background knowledge.Our simulations modeled this by calculating distancedirectly to the target node on the background knowl-edge to determine the best link to click (see Figure 1for an example).

To avoid loops, the simulation explored each node inthe network only once. However, the simulation couldbacktrack to the last visited nodes (up until the start-ing portal, if necessary), just as Alice would use herbrowser’s back button. This was used in case of deadends (articles with no unvisited outgoing links) or atarticles providing only links leading further away fromthe target (according to the ontology information). Atany given point, the simulation could also jump backto the starting portal directly, modeling a home buttonin an information system.

This use of existing ontologies represents a substan-tial change in the motivation of the background knowl-edge: As opposed to previous work in this area, thebackground knowledge is now exogenous to the net-work. What this implies is that the hierarchy is basedon knowledge independent of the network that theagent navigates on. All ontologies used in the appli-cation of Ontology-based Decentralized Search in thispaper play a key role for their corresponding domain intheir research fields. They are hence representative fora good part of the knowledge in these domains. Thisprovides OBDS with a foundation to more accuratelyrepresent the intuitions of human navigation behavior.

The use of ontologies and the associated semanticinformation open up a range of new possibilities forthe application of the background knowledge:

– Filtering by relations and properties: Ontologiesare (in general) made up of different types of rela-tions (such as is-a or part-of, or regulates), whichcan be used to extract different varieties of back-ground knowledge from one and the same ontol-ogy. For example, a hierarchical version of the on-tology could be extracted by following only theis-a relations. Furthermore, ontologies may as-sign properties to their concepts. A backgroundknowledge can hence also be restricted to ontol-ogy concepts with a certain property. An ontol-ogy could be filtered to contain only contain con-cepts stemming from a single domain, such as ge-ography. This could then be compared with otherfiltered versions of the ontology.

– Modeling different user groups: Ontologies canalso be used to model different types of users.A good example for this is the case of ICD-10,which provides a classification of diseases. In theontology, the depth of a disease (i.e., its distancefrom the root node) corresponds to its specificity.This could be used to model the knowledge of dif-ferent hospital personnel. For instance, a medicalspecialist could be modeled by the entire depth ofknowledge of one section of the ontology, and adepth-limitation in the other sections. A layper-son could be modeled by having a certain depth-limitation in all areas. This could be effectivelyused to simulate different user groups in medicalinformation systems, without having to carry outactual human user studies.

– Inference: Ontologies permit inference on theirentities. For hierarchical relations this could meanthat subconcepts could be assigned the type oftheir superconcepts (e.g., the perhaps unfamil-iar Supraventricular tachycardia is a subconcept ofHeart Disease in ICD-10, which is more com-monly known). In the case of the cut-off back-ground knowledge, more specific ontology con-cepts could then be substituted by their inferredsuperconcepts and provide more information tothe navigation process than a pure random guess.

In the experiments conducted for this paper, Ontology-based Decentralized Search was used with three dif-ferent ontologies (ICD-10, MeSH and SNOMED-CT)that were not filtered. This meant, that all concepts andrelation types present in the ontologies (and mappingto the data sets) were used as the background knowl-edge. The results of the simulations were then com-pared on the same information network, which meantthat the three ontologies effectively modeled differentuser groups on the same set of data. While all three on-tologies represent expert knowledge, they still serverdifferent purposes: ICD-10 is a disease classificationwidely used by insurance companies, physicians andhospitals. SNOMED-CT is a terminology of clinicalterms, and MeSH is a terminology for the indexing ofjournal articles. We hence assume these ontologies tobe representative of experts in their respective fields.

3.5. Navigation scenarios

We studied two different search scenarios, both ofwhich started from a hypothetical Wikipedia portal.


Fig. 4. Starting portals used in navigation simulations. For ICD-10, MeSH and SNOMED CT we used a portal obtained by mapping navigationbar articles from WebMD.com to Wikipedia articles (subfigure a).

Starting Portal We started the navigation from ahypothetical Wikipedia portal featuring a selectionof suitable articles: The 25 health conditions listedin the main navigation toolbar of WebMD.com (seeFigure 4). We manually mapped these conditions toWikipedia articles from our dataset and used the arti-cles as the outgoing links from our artificial portal. In away, the artificial portal thereby resembles the naviga-tional structure of the WebMD front page - a popularhealth information web site. Medical web sites, suchas WebMD are frequently [5] used to obtain informa-tion about diseases or as a first information before con-sulting a medical doctor.

Single-target search Our first scenario was analo-gous to Alice’s introductory example. In single-targetsearch, the simulation started at the portal and pro-ceeded to a single target article using Ontology-basedDecentralized Search.

As discussed, single-target search modeled the sce-nario of having a concept on the tip of one’s tongue,and navigating to rediscover it.

Multiple-target Search For multiple-target search,the difference was in the targets, which consisted oftarget sets of two to ten articles (instead of only a sin-gle one). The rest of the simulation (starting portal, de-centralized search, background knowledge) was con-ducted in the same way as the single-target search.

We used multiple-target search to model a scenarioof exploratory search. In exploratory search, users ex-plore a space of resources rather than trying to find onespecific target [18].

We used clusters of semantically similar Wikipediaarticles as our target sets and applied k-means clus-tering to arrange similar articles into clusters basedon TF-IDF features (using scikit-learn [24]). We usedthose resulting clusters containing two to fifteen arti-cles in our simulations. Examples for clusters are givenin Table 2.

3.6. User study

To evaluate our simulations, we carried out a userstudy on Wikipedia navigation. Eight participantswithout any particular background in medicine wereasked to navigate Wikipedia, modeling the scenario ofnavigating to find diseases. All of them were gradu-ate students in different fields (but not in medicine) atStanford University at the time of the user study.

The study used the data set of ICD-10, SNOMEDCT and MeSH, containing 1,593 Wikipedia articles.As a large share of these articles turned out to betoo specialized for test subjects not particularly famil-iar with the medical domain (with article names suchas Halitosis, Aniseikonia or Milroy’s disease, which leftusers puzzled in a pilot study), we manually selected


Table 2Examples for clusters of Wikipedia articles used in exploratory search. The table shows three examples of clusters used in our simulations.We used TF-IDF features and k-means clustering to automatically group Wikipedia articles into semantically related groups of two to ten articles.

Nausea-related Stomach-related Cough-relatedVomiting Linitis plastica BronchitisNausea Stomach cancer Chronic bronchitisMotion sickness Gastritis Acute bronchitisMorning sickness Atrophic gastritis CoughDrooling Ménétrier’s disease SputumHyperemesis gravidarum Achlorhydria

GastroparesisDuodenal cancerGastric dumping syndromeStomach disease

100 generally better known targets (such as Pneumonia,Stomach cancer or Asthma), out of which we also man-ually formed 20 clusters of four articles each. We thenset up our testing environment containing the subsetof Wikipedia, and asked subjects to perform naviga-tion tasks. As in our simulations, backtracking (usingthe back button in the browser) and jumping back tothe portal by clicking a home link were enabled at alltimes.

The setup for the user study consisted of a web sitesimilar to Wikipedia, which contained the articles usedthe study, as well as information about the current task.Each step of the user was logged. This setup is visiblein Figure 4.

Each participant completed a total of 15 navigationtasks. A navigation task consisted of finding a giventarget node (or a set of target nodes) in the subset ofthe Wikipedia network. As in the simulator, the start-ing point for a task was always the portal, and partic-ipants could only click on links to articles within thedata set. To deal with potential frustration, participantswere given the possibility to abort the current task ifthey had not found the target(s) after half of the maxi-mum number of steps (20 for single targets and 40 formultiple targets).

3.7. Implementation

The experiments presented in this paper were con-ducted on a decentralized search simulator. This sim-ulator was an extension of previous work by Helic,Strohmaier et al ([10], [28]) and implemented inC++ based on the Stanford Network Analysis Projectframework [3]. It permitted the simulation of decen-tralized search on a given network and used a providedbackground knowledge to calculate the distances. The

simulator was used to run a total of 1794 simulationsof decentralized search with two navigation scenarios.

4. Results

4.1. Evaluation metrics

Based on work by Krioukov and Papadopoulos [16]we used success ratio and stretch to evaluate naviga-tion paths.

In accordance with Strohmaier and Helic [11], wedefine success ratio s to be the fraction of target nodesfound and stretch τ to be the average ratio of foundpath lengths to shortest path lengths.

Let P be the set of target nodes and W be the set oftarget nodes that were successfully navigated to by oursimulator. Then we have that the success ratio s is

s =|W ||P |

.

Thus, the success ratio measures the extent to whichthe simulator is successful in finding a target, e.g., asuccess ratio of 90% states that 90% of the targets havebeen found. Furthermore, let l(t) be the length of theshortest path from the portal to the target node t andlet h(t) be the length of the path to the target found bythe agent. The stretch τ is then defined as

τ =1

|W |∑t∈W

h(t)

l(t).

Stretch measures the efficiency of search. For example,a stretch of 1.2 states that the paths an agent was able


to find are - on average - 20% longer than the shortestpaths for these targets. As in work by Helic, Trattneret al ([10], [29]), we report success ratio and stretchsplit by path length of the underlying node pairs. Thesemetrics give us a means of analyzing what paths werefound by the simulator and how much longer than theshortest paths they were.

We further extend these metrics with the accumu-lated success ratio as, which we define as the fractionof nodes found up until a certain number n of steps.

as(n) =|Wn||P |

,

whereWn is the set of target nodes reached by the sim-ulation in n steps or less.

For all our evaluations, we assumed a maximumnumber of 20 clicks for the single-target scenario and40 clicks for the multiple-target scenario.

4.2. Comparison with random baselines and optimalsolutions

We established comparisons with random and opti-mal solutions by including a random walk, randomlygenerated ontologies and a shortest-path solution.

Random Walk The random walk consisted of follow-ing a random link (or tracking back) at each step, nottaking already visited nodes or potential targets amongthe neighboring nodes into account. The comparisonwith the random walk showed us how much more in-formation the OBDS approach provided to the naviga-tion in comparison to a completely random behavior.

Randomly generated Ontologies For this compari-son, we constructed a randomly generated ontologycounterpart to each ontology used in our simulations.To this end, we used the number of nodes and edges asinput for the configuration model approach of generat-ing a random graph with the same number of nodes andedges [21]. As the resulting graph was not necessar-ily connected, we subsequently randomly connectedall graph components and then removed the numberof additional edges created in this process from otherparts of the resulting graph (without deconnecting it).

This comparison showed us how much informa-tion the OBDS approach gained by taking the struc-ture of the ontologies into account (but not yet thecorrect mappings). Furthermore, evaluating with ran-domly generated ontologies took the structured search

behavior of decentralized search into account: De-centralized search, in our implementation, did not re-explore already visited nodes, could backtrack and al-ways recognized links leading to a target node amongthe current node’s neighbors. This gave this method adistinct advantage over the pure random walk.

Shortest-path solution Finally, for the optimal so-lution, we computed a shortest-path solution. In thesingle-target scenario, this meant that we always usedthe shortest possible path in the graph for connect-ing the portal to the target node. For the multiple-target scenario, an exact solution would have requiredsolving an instance of the traveling-salesman prob-lem, which is computationally expensive. To circum-vent this issue, we approximated the perfect solutionwith a nearest-neighbor approach that always took theshortest possible path to the nearest neighbor. This al-lowed the us to compare to the (approximately) opti-mal solution. It is important to note that this was onlypossible with global knowledge of the graph topology,which users do not posses in a decentralized searchscenario.

4.3. Evaluation

To compare the performance of OBDS with differ-ent ontologies as background knowledge, we evaluatedmultiple ontologies on the same set of Wikipedia ar-ticles. This allowed us to inspect multiple ontologiesside by side, facilitating comparison.

The results (Figure 5) show that the success ratioswere well above both the random walk and the ran-domly generated ontologies. When comparing OBDSwith different ontologies, the results show that OBDSwith ICD-10 performed best, followed by MeSH andSNOMED CT for the success ratios. For the stretch,SNOMED CT fared slightly better than MeSH (withan average stretch of 2.45 resp. 2.49).

4.4. User study

For the user study, we compared the results of hu-man navigators with OBDS. The targets were 100manually selected targets and 20 manually selectedclusters (which were the same for both the users andthe simulator). This limitation of targets also meantthat targets were a maximum distance of three hopsaway from the portal. The evaluations hence do not in-clude any data points for longer shortest paths.

Figure 5 shows that the success ratios for the userstudy were fairly close to the simulator. For the single-


1 2 3 4 5 6Shortest Path (#hops)

0

20

40

60

80

100

Succ

ess

Rati

o (

%)

Single-target: Success Ratio

ICD-10 (0.55)MeSH (0.53)SNOMED CT (0.45)Optimal (1.00)Rand. Walk (0.01)Rand. ICD-10 (0.11)Rand. Mesh (0.10)Rand. SNOMED CT (0.11)

(a) Success Ratio for ICD-10/MeSH/SNOMED CT


0

20

40

60

80

100

Succ

ess

Rati

o (

%)

Single-target: Success Ratio

ICD-10 (0.89)MeSH (0.91)SNOMED-CT (0.79)User Study (0.92)Optimal (1.00)Rand. Walk (0.05)Rand. ICD-10 (0.45)Rand. Mesh (0.44)Rand. SNOMED CT (0.46)

(b) Success Ratio for the User Study


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Str

etc

h

Single-target: Stretch

(c) Stretch for for ICD-10/MeSH/SNOMED CT


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0Str

etc

hSingle-target: Stretch

(d) Stretch for the User Study

0 5 10 15 20 25 30 35 40Step

0

20

40

60

80

100

Found T

arg

ets

(%

)

Multiple-target: Accumulated Success Ratio

(e) Accumulated Success Ratio for ICD-10/MeSH/SNOMEDCT

0 5 10 15 20 25 30 35 40Step

0

20

40

60

80

100

Found T

arg

ets

(%

)

Multiple-target: Accumulated Success Ratio

(f) Accumulated Success Ratio for the User Study

Fig. 5. Success ratio, stretch and accumulated success ratio for ICD-10/MeSH/SNOMED CT and the user study. The first column showsthe results for ICD-10, MeSH and SNOMED CT, the second column the results for the user study. The rows show stretch, success ratio andaccumulated success ratio, respectively. The legends in the first row are valid for the entire columns. The numbers in parentheses display theoverall values for the success ratio. Note that the stretch plots do not include the random baselines, as this measure can only be usefully applied tocompare simulations with a similar number of found paths. The figures show that the results produced by Ontology-based Decentralized Searchare noticeably better than the results for randomly generated ontologies. The figures also show that the results of Ontology-based DecentralizedSearch on a limited data set are in the range of the results produced by human test subjects.


0 5 10 15 20Path Length

0

10

20

30

40

50

60

Pro

duce

d P

ath

s (%

)

Single-target: Path Lengths

ICD-10MeSHSNOMED-CTUser StudyOptimal

(a)

0 5 10 15 20 25 30 35 40Path Length

0

10

20

30

40

50

60

Pro

duce

d P

ath

s (%

)

Multiple-target: Path Lengths

ICD-10MeSHSNOMED-CTUser StudyOptimal

(b)

Fig. 6. Path lengths produced by the user study and the simulator. The figures show the resulting path lengths for the single-target (a) andmultiple-target (b) search scenarios. Navigation was limited to 20 resp. 40 steps, hence the high number of paths for these lengths (i.e., not alltargets were found). The path distributions for the random walk and the randomly generated ontologies were left out for reasons of clarity.

target scenario, the overall success ratio was 92% forthe user study and ranged from 79 - 91% for the on-tologies. For the multiple-target scenario, the accumu-lated success ratio shows that the user study fell withinor just below the range of the three ontologies. It isworth noting that after 20 steps, the users in our studydid not find any more targets. This coincides with thepoint from where on users where given the possibil-ity to abort a search task if they could not find the tar-get. For the single-target stretch again, with an overallstretch of 1.74 the user study performed slightly betterthan the simulator, which displayed stretches between1.78 and 1.84.

To obtain qualitative insight into the navigation pro-cess, we compared the produced path lengths of theuser study and the simulator. To this end, we examinedthe distribution of path lengths produced by both theuser study and the simulations. This distribution canbe seen in Figure 6. We then computed the Kullback-Leibler divergence from the user study distributionto the other distributions. The Kullback-Leibler diver-gence measures the number of additional bits neededto encode the path length distribution, if the other dis-tribution is used in place of the original (user study)path length distribution. The resulting values can beseen in Table 3. For the single-target search scenario, itis clearly visible that only OBDS with produced pathlength distributions close to the user study: All threeontology path length distributions had a very small KLdivergence (0.08 - 0.18 bits) to the user study. Thismeans that it is justifiable to replace human navigation

data with data produced by OBDS and a fitting ontol-ogy (as far as produced path lengths are considered).The same cannot be said about randomly generated on-tologies (nor the random walk or the optimal solution),which cannot be easily taken in lieu of the ontologiesand yield similar results.

For the multiple-target search scenario, this asser-tion cannot be made this clearly. However, the pathlength distribution for the multiple-target scenario wasrather sparse, as there were merely twenty search sce-narios, all of which were very likely to produce a pathof a different total length. This meant that a single pathaccounted for five percent of the path lengths, which isalso reflected in Figure 6 b).

In addition, we analyzed several further aspects ofthe user study in comparison with the ontologies, dis-played in Table 4. First, we looked into the first vis-ited nodes and the found targets. To compare these, wearranged the nodes into vectors and computed cosinesimilarities.

For the found targets, all three ontologies displayedhigh cosine similarity values. This reflects the resultsfrom Figure 5, and is caused by high success ratios forthe limited target set used in our user study which leadsto the majority of the vectors containing ones at thesame positions.

For the first hops (i.e., the very first clicks in thesearch), the clicks were distributed rather evenly. Atruly random distribution would see each link clicked3.7% of times. Our results showed distributions rang-ing from 1 to 17% and were thus fairly evenly dis-


Table 3Kullback-Leibler divergence for the path length distributions produced by the simulator and the user study. The table shows the KLdivergence from the user study to the ontologies and the optimal and random solutions. The KL divergence measures the number of additionalbits required to encode the original distribution, if another distribution is used in its place. The Randomly Generated Ontology column wascomputed using an average over the three randomly generated ontologies considered. The table shows, that the user study was more similar tothe ontologies than to the base lines for the single-target scenario. The results closest to the user study are displayed in bold.

User Study ICD-10 MeSH SNOMED CT Optimal RandomlyGeneratedOntology

Random Walk

single-target 0.12 0.08 0.18 0.46 0.97 2.56multiple-target 1.01 0.74 0.84 1.63 0.55 1.29

tributed, explaining the values of the cosine similaritybeing close together. For the first hops, ICD-10 dis-played the most similar values to the user study.

In addition to calculating similarities, we also in-spected the average per-step probability of backtrack-ing or clicking the home button.

Both the simulation and the users had access to aback button (leading to the previously visited page)and a home button (leading back to the portal) at alltimes. The simulations used the home button only im-mediately after having found a target in multiple-targetsearch. In all other cases, the best strategy given byour simulation constraints turned out to be backtrack-ing. The user study showed different behavior from thesimulator in several aspects: For single-target search,users backtracked less frequently (9% of clicks wereback button clicks, versus 11-13% for the simula-tions) but used the home button in 2% of clicks. Forthe multiple-target search, users backtracked more fre-quently (27% versus 17-18% for the simulator) andused the home button less frequently (1% versus 2-3%).

In conclusion, backtracking was the most widelyapplied strategy for navigating out of dead ends andbacktrack from less promising areas of the network.This was especially true for the user study.

5. Discussion

In this work, we studied simulated user naviga-tion behavior via decentralized search. We intro-duced Ontology-based Decentralized Search (OBDS),a novel navigation simulation method based on de-centralized search which uses ontologies as back-ground knowledge. We showed that our method canbe successfully applied to navigation in informationnetworks, and demonstrated it can be applied on ofWikipedia supported by biomedical ontologies.

In the following, we want to focus our discussion onthe research questions raised in this work:

RQ1 Can ontologies contribute useful informationto modeling navigation in information networks? Andhow does OBDS perform in comparison to randomlygenerated ontologies and random walks?

We found that ontologies can indeed inform naviga-tion in information networks. OBDS with medical on-tologies as background knowledge was able to outper-form the random baseline approaches significantly.

RQ2 Does Ontology-based Decentralized Search(OBDS) produce valid results, i.e., are the simulatednavigation paths similar to those produced by humannavigation?

We addressed this question by comparing certainproperties of the simulated navigation paths with prop-erties produced by humans in a study. We found thatthe click paths produced by OBDS matched certainproperties of human paths better than pure randomwalks and randomly generated ontologies.

RQ3 When using OBDS, what ontology is bestedsuited to produce human-like navigation results?

From our results, it seems that ICD-10 and MeSHseem to perform best. However, the overall differencesbetween the ontologies were not very strong, and it issubject of ongoing research to further identify differ-ences in the performance of OBDS with different on-tologies.

5.1. Further comments

We’ve limited our work to ontologies and Wikipediaarticles from the biomedical domain in this paper. Inthis domain, ontologies have been adapted more fre-quently than in other disciplines [22], play an impor-tant role in biomedical research [6] and are used for arange of purposes. Another important aspect was theready availability of infobox templates on Wikipedia


Table 4Details of the user study and the compared data sets The table displays statistics about the user study and the ontologies. The most similarvalues to the user study are displayed in bold face. For the first three measures, we viewed the information about found targets, visited WikipediaPages and first hops as a vector of values, for which we calculated the angle to the vector containing the information for the user study (i.e.,the cosine similarity). For the random walk, we averaged over 1000 random walks for each portal-target pair. The last two measures displaythe average per step usage of the back and home buttons for the different scenarios. In summary, the results confirm that what has appearedsomewhat apparent from the success ratios and the stretch, i.e., that ICD-10 and MeSH displayed the most similar behavior to the user study.The randomly generated ontology column was computed using an average over the three randomly generated ontologies considered.

User ICD-10 MeSH SNOMED Optimal Random RandomStudy -CT Ontology Walk

Found targets Single 1.00 0.93 0.95 0.89 0.95 0.78 0.72(Cosine Similiarity) Multiple 1.00 0.94 0.94 0.91 0.95 0.90 0.67

First Hops Single 1.00 0.89 0.85 0.69 0.88 0.77 0.80(Cosine Similiarity) Multiple 1.00 0.64 0.62 0.56 0.68 0.64 0.71Back Button Uses Single 0.09 0.13 0.11 0.13 0.00 0.26 0.07(average per step) Multiple 0.27 0.17 0.18 0.18 0.01 0.21 0.09

Home Button Uses Single 0.02 0.00 0.00 0.00 0.00 0.01 0.00(average per step) Multiple 0.01 0.03 0.02 0.03 0.00 0.05 0.00

articles, which facilitated the mappings to the ontolo-gies. However, the principles of our method apply forother domains as well.

Influence of ICD-10 The International Classificationof Diseases (ICD-10) has found widespread use andprobably influences and inspires Wikipedia editors. OnWikipedia, disease articles are almost always indexedby ICD-10 as the first entry in the article infoboxes.Furthermore, the category system for the disease arti-cles of the English Wikipedia is organized accordingto ICD-10. These two facts and the wide use of ICD-10 have quite possibly also influenced the link creationbehavior on the encyclopedia as well as the generalknowledge of the test subjects. This might be an expla-nation of why ICD-10 seems to be best suited to modelhuman navigation behavior in our case study.

User Study In comparison to the simulator’s perfor-mance, participants in the user study performed betterfor single-target search and worse for multiple-targetsearch. This is also influenced by the fact that usersaborted 30% of their multiple-target navigation tasksbefore having found all of the targets, while the simu-lations ran for whole number of possible steps.

Building User Models By using different ontologiesas background knowledge, our results could help re-searchers and engineers build and evaluate user inter-faces with different user types. The ontologies com-pared in the results were rather similar and mostlyshared the same domain. In future work, it will be in-teresting to compare ontologies that do not cover theentire domain, modeling specialist users, or combining

ontologies to form a more complete coverage of thedomain. Another idea might be to prune the ontologiesat a certain depth, modeling broad generalist knowl-edge that does not extend beyond a certain depth.

Action Selection The simulations in the form we pre-sented followed a deterministic greedy action selectionmodel, in that it always selected the most profitablelink according to the background knowledge. Relatedresearch has shown that users might be better mod-eled using epsilon-greedy action selection mechanismswith dynamically changing epsilons [11]. In follow-up research, our work could be extended with stochas-tic action selection mechanism such as epsilon-greedy.This would also lead to another potentially crucial as-pect of the present simulations, namely the need toevaluate games multiple times with potentially vary-ing results. One could expect that these adaptationswould help to finetune and validate any future attemptsat modeling human navigation. However, we leave thistask to future research.

5.2. Future work

The user study we presented was limited in that itwas restricted to a subset of target nodes because ofthe requirement that the target should be familiar totest subjects without a medical background. Since thesimulation behavior for these targets was very close tothe test subjects, it can be hypothesized that the userbehavior for the whole set of targets is likewise similar.It is up to future research to show more details of thecomparison of human users and decentralized search.


Another aspect was the limitation of the user studyto a subset of the target articles. The user studytried to approximate non-medical students with expertbiomedical ontologies. While this worked to a certainextent, it will be interesting to see further user studieswith medical experts and compare their results on theentire data.

The chosen portal, based on WebMD.com, undoubt-edly influenced the navigation results. It is up to futurework to compare different portals and shed a light onpossible differences.

The idea to navigate to one single predefined tar-get might seem somewhat artificial in the case of userbehavior concerning explorative tasks. However, oneidea to improve on this might be calculate the TF-IDF features of the target node beforehand and subse-quently navigate until a page (or a number of pages)similar enough to the TF-IDF features has been found(which does not need to be the predefined target page).This could model the case of users exploring areas ofthe network.

Other potential research questions might include thelimitation of visible links to links in the upper part ofWikipedia articles and comparing the results on non-English editions of the encyclopedia. Past research[12] has already compared different methods of ex-tracting background knowledge from the actual net-work used for navigation. These background knowl-edges were based on network features such as central-ity or degree. It would be interesting to directly com-pare these extracted background knowledges with on-tologies and analyze the differences.

6. Conclusions

In this work, we have presented a novel, ontology-based method (Ontology-Based Decentralized Search)for simulating human navigation in information net-works such as Wikipedia. Our results provide tech-nical answers to several questions regarding the useof ontologies in decentralized search: We have notonly presented a method to integrate ontological back-ground knowledge into decentralized search, but alsofound that ontologies can serve as efficient back-ground knowledge. With appropriate ontologies andWikipedia link networks, our simulations using OBDS(i) found targets more efficiently than two baseline ap-proaches (random walks or randomly generated on-tologies) and (ii) produced navigational paths that weremore similar to actual human navigational paths than

to baseline approaches. While our human subject studywas limited in terms of - for example - size, the resultsreported in this paper are encouraging in several ways.First, our method opens up ways to explore the effectsof assuming different kinds of background knowledgeof users in a navigation task. For example, swappingdifferent kinds of ontologies in future work would al-low us to explore their impact on the efficiency of de-centralized search in information networks. Second,our results can be seen as additional corroboration thatontologies indeed capture useful knowledge about adomain. In some of our experiments, OBDS with med-ical ontologies as background knowledge was able tooutperform baseline approaches significantly.

Summarizing, our findings are relevant for re-searchers interested in new applications for ontologiesor interested in modeling navigation in informationnetworks using ontologies as background knowledge.

References

[1] International classification of diseases, revision 10.http://www.who.int/classifications/icd/en, 2012. last accessed22-April-2013.

[2] Medical subject headings. http://www.nlm.nih.gov/mesh/,2012. last accessed 22-April-2013.

[3] Stanford network analysis project. http://snap.stanford.edu,2012. last accessed 22-April-2013.

[4] L. A. Adamic and E. Adar. How to search a social network,November 2004.

[5] L. Baker, T. H. Wagner, S. Singer, and M. Bundorf. Use of theinternet and e-mail for health care information: Results from anational survey. JAMA, 289(18):2400–2406, 2003.

[6] O. Bodenreider et al. Biomedical ontologies in action: role inknowledge management, data integration and decision support.Yearb Med Inform, 67:79, 2008.

[7] J. Fang and J. Evermann. Evaluating ontologies: Towards acognitive measure of quality. In Proceedings of the 16th IEEEInternational Enterprise Distributed Object Computing Con-ference Workshops, EDOC’07, pages 109–116. IEEE Com-puter Society, 2007.

[8] D. F. Gleich, P. G. Constantine, A. D. Flaxman, and A. Gu-nawardana. Tracking the random surfer: empirically measuredteleportation parameters in pagerank. In Proceedings of the19th International Conference on World-wide Web, WWW ’10,pages 381–390, New York, NY, USA, 2010. ACM.

[9] D. Helic, C. Körner, M. Granitzer, M. Strohmaier, and C. Trat-tner. Navigational efficiency of broad vs. narrow folksonomies.In Proceedings of the 23nd ACM Conference on Hypertext andHypermedia, HT’12, 2012.

[10] D. Helic and M. Strohmaier. Building directories for socialtagging systems. In Proceedings of the 20th ACM Conferenceon Information and Knowledge Management, CIKM’11, 2011.

[11] D. Helic, M. Strohmaier, M. Granitzer, and R. Scherer. Modelsof human navigation in information networks based on decen-


tralized search. In Proceedings of the 24th ACM Conferenceon Hypertext and Social Media, HT’13, 2013.

[12] D. Helic, M. Strohmaier, C. Trattner, M. Muhr, and K. Lerman.Pragmatic evaluation of folksonomies. In Proceedings of the20th international Conference on World-Wide Web, WWW’11,pages 417–426. ACM, 2011.

[13] J. M. Kleinberg. The small-world phenomenon: an algorithmperspective. In Proceedings of the 32nd Annual ACM Sym-posium on Theory of Computing, STOC ’00, pages 163–170,New York, NY, USA, 2000. ACM.

[14] J. M. Kleinberg. Small-world phenomena and the dynamics ofinformation. In Proceedings of the 20th Conference on Neu-ral Information Processing Systems, NIPS’12, pages 431–438,2001.

[15] J. M. Kleinberg. Complex networks and decentralized searchalgorithms. In Proceedings of the International Congress ofMathematicians, ICM’06, pages 1019–1044, 2006.

[16] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, andM. Boguñá. Hyperbolic Geometry of Complex Networks.Physical Review E, 82(036106), Oct 2010.

[17] Z. Li, X. Zhao, D. Huang, and J. Huang. An improved networkbroadcasting method based on gnutella network. In M. Li, X.-H. Sun, Q. Deng, and J. Ni, editors, Proceedings of the 2ndGrid and Cooperative Computing, volume 3033 of GCC’03,pages 404–407. Springer, 2003.

[18] G. Marchionini. Exploratory search: from finding to under-standing. Commun. ACM, 49(4):41–46, Apr. 2006.

[19] G. Miao, S. Tao, W. Cheng, R. Moulic, L. E. Moser, D. Lo, andX. Yan. Understanding task-driven information flow in col-laborative networks. In Proceedings of the 21st InternationalConference on World-Wide Web, WWW ’12, pages 849–858,New York, NY, USA, 2012. ACM.

[20] S. Milgram. The small world problem. Psychology Today,1(1):61–67, 1967.

[21] M. E. J. Newman. The structure and function of complex net-works. SIAM Review, 45(2):167–256, 2003.

[22] N. F. Noy and T. Tudorache. Collaborative ontology devel-opment on the (semantic) web. In Proceedings of the AAAISpring Symposium on Semantic Web and Knowledge Engineer-ing, 2008.

[23] M. Papazoglou and J. Hoppenbrouwers. Knowledge naviga-tion in networked digital libraries. Knowledge Acquisition,Modeling and Management, pages 13–32, 1999.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-chine Learning in Python . Journal of Machine Learning Re-search, 12:2825–2830, 2011.

[25] P. Pirolli. Information Foraging Theory: Adaptive Interactionwith Information. Oxford University Press, 2007.

[26] C. Price and K. Spackman. Snomed clinical terms. BJHC&IM-British Journal of Healthcare Computing & Information Man-agement, 17(3):27–31, 2000.

[27] M. Rajapakse, R. Kanagasabai, W. T. Ang, A. Veeramani, M. J.Schreiber, and C. J. Baker. Ontology-centric integration andnavigation of the dengue literature. Journal of biomedical in-formatics, 41(5):806–815, 2008.

[28] M. Strohmaier, D. Helic, D. Benz, C. Körner, and R. Kern.Evaluation of folksonomy induction algorithms. ACM Trans-actions on Intelligent Systems and Technology, 3(4):74:1–74:22, Sept. 2012.

[29] C. Trattner, P. Singer, D. Helic, and M. Strohmaier. Exploringthe differences and similarities between hierarchical decentral-ized search and human navigation in information networks. InProceedings of the 12th International Conference on Knowl-edge Management and Knowledge Technologies, IKNOW’12,page 14, 2012.

[30] J. Travers and S. Milgram. An experimental study of the smallworld problem. Sociometry, 32:425–443, 1969.

[31] J. R. Villela Dantas and P. P. Muniz Farias. Conceptual navi-gation in knowledge management environments using navcon.Information processing & management, 46(4):413–425, 2010.

[32] D. J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 393(6684):440–442, June 1998.

[33] R. West and J. Leskovec. Human wayfinding in informationnetworks. In Proceedings of the 21st international Conferenceon World-Wide Web, WWW’12, pages 619–628. ACM, 2012.

[34] R. West, J. Pineau, and D. Precup. Wikispeedia: An onlinegame for inferring semantic distances between concepts. InProceedings of the 21st International Joint Conference on Ar-tificial Intelligence, IJCAI’09, 2009.

[35] P. L. Whetzel, N. F. Noy, N. H. Shah, P. R. Alexander, C. Nyu-las, T. Tudorache, and M. A. Musen. Bioportal: Enhancedfunctionality via new web services from the national centerfor biomedical ontology to access and use ontologies in soft-ware applications. Nucleic Acids Research, 39(Web-Server-Issue):541–545, 2011.

ios press using ontologies to model human navigation ... · semantic web 0 (2013) 1 1 ios press...

Documents