linked data enabled generalized vector space model to ...ceur-ws.org/vol-1581/paper4.pdf · linked...

12
Linked Data Enabled Generalized Vector Space Model To Improve Document Retrieval org Waitelonis, Claudia Exeler, and Harald Sack Hasso-Plattner-Institute for IT-Systems Engineering, Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany [email protected], [email protected], [email protected] Abstract. This paper presents two approaches to semantic search by incorporating Linked Data annotations of documents into a Generalized Vector Space Model. One model exploits taxonomic relationships among entities in documents and queries, while the other model computes term weights based on semantic relationships within a document. We publish an evaluation dataset with annotated documents and queries as well as user-rated relevance assessments. The evaluation on this dataset shows significant improvements of both models over traditional keyword based search. 1 Introduction Due to the increasing demands of information seekers, retrieving exactly the right information from document collections is still a challenge. Search engines try to overcome the drawbacks of traditional keyword based search systems, such as ambiguities of natural language (vocabulary mismatch), by the use of knowledge bases, as e. g. Google’s Knowledge Graph 1 . These knowledge bases enable the augmentation of search results with structured semantic information gathered from various sources. With these enhancements, users can more easily explore the result space and satisfy their information needs without having to navigate to other web sites [22,16]. Linked Open Data (LOD) provides numerous structured knowledge bases, but there are even more ways in which LOD can improve search. Many information needs go beyond the retrieval of facts. Full documents with comprehensive textual explanations have much greater power to provide an actual understanding than any structured information will ever have. On the web, there are relevant documents for almost any imaginable topic. Thus, to find the right documents is a matter of accurately specifying the query keywords. Typically, users start with a general query and refine it when the search results do not contain the expected result. For most cases this process works fine, because any query string matches at least some relevant documents. The challenge of web 1 http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph- things-not.html

Upload: vuminh

Post on 27-Apr-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Linked Data Enabled Generalized Vector SpaceModel To Improve Document Retrieval

Jorg Waitelonis, Claudia Exeler, and Harald Sack

Hasso-Plattner-Institute for IT-Systems Engineering, Prof.-Dr.-Helmert Str. 2-3,14482 Potsdam, Germany

[email protected], [email protected],

[email protected]

Abstract. This paper presents two approaches to semantic search byincorporating Linked Data annotations of documents into a GeneralizedVector Space Model. One model exploits taxonomic relationships amongentities in documents and queries, while the other model computes termweights based on semantic relationships within a document. We publishan evaluation dataset with annotated documents and queries as well asuser-rated relevance assessments. The evaluation on this dataset showssignificant improvements of both models over traditional keyword basedsearch.

1 Introduction

Due to the increasing demands of information seekers, retrieving exactly theright information from document collections is still a challenge. Search enginestry to overcome the drawbacks of traditional keyword based search systems,such as ambiguities of natural language (vocabulary mismatch), by the use ofknowledge bases, as e. g. Google’s Knowledge Graph1. These knowledge basesenable the augmentation of search results with structured semantic informationgathered from various sources. With these enhancements, users can more easilyexplore the result space and satisfy their information needs without having tonavigate to other web sites [22, 16]. Linked Open Data (LOD) provides numerousstructured knowledge bases, but there are even more ways in which LOD canimprove search.

Many information needs go beyond the retrieval of facts. Full documentswith comprehensive textual explanations have much greater power to providean actual understanding than any structured information will ever have. Onthe web, there are relevant documents for almost any imaginable topic. Thus, tofind the right documents is a matter of accurately specifying the query keywords.Typically, users start with a general query and refine it when the search results donot contain the expected result. For most cases this process works fine, becauseany query string matches at least some relevant documents. The challenge of web

1 http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-

things-not.html

search is thus to determine the highest quality results among a set of matchingdocuments. In contrast to web search, query refinement can quickly lead to emptyresult sets in document collections of limited size, such as blogs, multimediaarchives, or libraries, because the wrong choice of keywords may eliminate theonly relevant document. One approach to cope with these shortcomings is toexplicitly map the document content to entities of a formal knowledge base, andto exploit this information by taking into account semantic similarity as well asrelatedness of documents and queries.

For this purpose, we have developed and comprehensively evaluated a Se-mantic Search system, which combines traditional keyword based search withLOD knowledge bases, in particular DBpedia2. Our approach shows that re-trieval performance on less than web-scale search systems can be improved bythe exploitation of graph information from LOD resources. The main contribu-tions of this paper are two novel approaches to exploit LOD knowledge bases inorder to improve document search and retrieval based on:

1. the adaption of the generalized vector space model (GVSM) with taxonomicrelationships,

2. measuring the level of connectedness of entities within documents instead oftraditional term frequency weighting,

Furthermore, we have published a manually assembled and carefully verifiedevaluation data set with semantically annotated documents, search queries, aswell as relevance assessments at different relatedness levels3.

In the remainder of this paper, Section 2 provides a technical instruction andreferences related work. Section 3 describes the two proposed approaches andSection 4 addresses their evaluation. The final section summarizes and discussesthe achieved results, and gives a brief outlook on future work.

2 Preliminaries and Related Work

Semantic Search makes use of explicit semantics to solve core search tasks, i. e.interpreting queries and data, matching query intent with data, and rankingsearch results according to their relevance for the query. In modern semanticretrieval systems the ranking also makes use of underlying knowledge bases toobtain the degree of semantic similarity between documents and queries [6]. Oneof the most popular retrieval models to determine similarity between documentsand queries is the Vector Space Model (VSM). Basically, it assumes pairwiseorthogonality among the vectors representing the index terms, which meansthat index terms are independent of each other. Hence, VSM does not take intoaccount that two index terms can be semantically related. Therefore, Wong et. al.have introduced the Generalized Vector Space Model (GVSM), where the indexterms are composed of smaller elements and term vectors are not consideredpairwise orthogonal in general [23]. The similarity function which determines

2 http://dbpedia.org/3 The ground truth dataset is published at: http://s16a.org/node/14

the similarity among documents dk and the query q is extended with a termcorrelation ti · tj :

simcos(dk, q) =

∑nj=1

∑ni=1 dk,jqjtitj√∑n

i=1 d2k,i

√∑ni=1 q

2i

, (1)

where dk,i, qi represent the weights in the document and query vectors, n thedimension of the new vectors. The term correlation ti · tj can be implementedin different ways. Wong et. al. have used co-occurrences [23] and the model in[17] uses WordNet, a large lexical database of words grouped into sets of syn-onyms (synsets), each expressing a distinct concept [11]. Our approach insteadutilizes LOD resources and their underlying ontologies to determine a correlationbetween related index terms.

Before exploiting the semantic relatedness, the document contents must beannotated via Named Entity Linking (NEL). NEL is the task of identifying men-tions of named entities in a text and linking them to the corresponding entitiesin a knowledge base. A wide range of approaches for NEL exists and most ofthem integrate natural language processing, such as named entity recognition,co-reference resolution, and word sense disambiguation (WSD) with statistical,graph-based, and machine learning techniques [19].

Our retrieval approach is inspired by the idea of Concept-based document re-trieval, which uses WSD to substitute ambiguous words with their intended un-ambiguous concepts and then applies traditional IR methods [5]. Several knowl-edge bases have been exploited to define concepts. One of the first concept-basedIR approaches in [21] uses the WordNet taxonomy, whereas the more recent Ex-plicit Semantic Analysis [3, 4] is based on concepts that have been automaticallyextracted from Wikipedia.

Some approaches have already attempted to include semantic relationshipsin a retrieval model. Lexical relationships on natural language words have beenapplied by [7] for query expansion and by [17] in a GVSM. However, their lack ofdisambiguation introduces a high risk for misinterpretation and errors. The lattermodel nevertheless shows small improvements, but is limited by the knowledgerepresented in WordNet. The fact that named entities play an important role inmany search queries has been considered by [1, 12]. Their approach focuses oncorrectly interpreting and annotating of the query and extending the query withnames of instances of found classes.

Another approach is applied by [20, 2]. They use formal SPARQL4 queriesto identify entities relevant to the user’s information need and then retrievedocuments annotated with these entities. Since this requires knowledge of aformal query language, it is not suited for the ordinary user.

None of the above approaches provides an allround service for end-user cen-tered semantic search, which simultaneously builds on a theoretically sound re-trieval model and is proven to be practically useful. Neither do any of them takeadvantage of the relationships of concepts represented in a document. These

4 http://www.w3.org/TR/sparql11-overview/

are also the main points that the approaches presented in the following sectionsaddress.

3 Linked Data Enabled GVSM

With the goal to increase search recall, the taxonomic approach uses taxonomicrelationship within the knowledge base to determine documents containing en-tities that are not explicitly mentioned in, but strongly related to the query. Wego beyond any of the previous GVSM approaches by exploiting the semanticrelationships to also identify documents that are not directly relevant, but re-lated to the search query. Related documents serve as helpful recommendationsif none or only few directly relevant documents exist, which is a frequent sce-nario when searching in limited document collections. Furthermore, taxonomiesprovide subclass relationships necessary for effectively answering class queries,a special kind of topical searches, where any members of a class are consideredrelevant. For example, the class search query “Tennis Players” should also re-turn documents about instances of the class “Tennis Players”, such as “AndrewAgassi”, “Steff Graf”, etc., even if the term or entity “Tennis Players” does notexplicitly occur in these documents.

In addition to the taxonomic approach, we also propose a connectedness ap-proach, which aims at increasing search precision. In general, this approach com-putes an improved term weighting by analyzing semantic relationships betweenthe entities within a document.

DocumentsTextual

PreprocessingStemming, Stopwords, etc.

Index

Index

Index

Query

TextualPreprocessing

Stemming, Stopwords, etc.

TaxonomicRetrieval Model

ConnectednessRetrieval Model

Traditional Retrieval Model

Ranking C

Ranking B

Ranking A

2

1

1

3b

3a

4b

TaxonomicEnrichment

Named Entity Linking

manually verified2

3a

4aTaxonomicEnrichment

ConnectednessScoringNamed Entity

Linkingmanually verified

Fig. 1. Evaluation architecture overview.

Fig. 1 shows the entire workflow of both proposed semantic search approachesin addition to traditional keyword based search. The workflow consists of fourprocessing steps: (1) traditional syntactic document and query processing, (2)

semantic document and query annotation with LOD resources, (3) annota-tion enrichment with semantic information, and (4) query to document match-ing and result ranking. These steps have been implemented into the ApacheSOLR/Lucene5 indexing and retrieval system. Their purpose, theoretical back-ground and implementation are explained in detail in the following.

The process starts with textual preprocessing (step 1), which applies stop-word removal, basic synonym expansion, and word stemming to the documenttexts. The resulting textual index terms constitute one part of the index forboth proposed approaches, so that textual and semantic index terms are treatedequally. In parallel, step 2 performs semantic document annotation by NEL withDBpedia entities. Because NEL does not always guarantee correct results, theannotation of documents has been manually revised with [16].

3.1 Taxonomic Enrichment & Retrieval

The taxonomic approach is a variation of the GVSM. In a GVSM, term vectorsare not necessarily orthogonal to reflect the notion that some entities or wordsare more closely related or similar to each other. This section describes theconstruction of term vectors from a taxonomy (step 3a) and the derived retrievalmodel for matching documents to a query (step 4a).

For index term associated with an entity, the term vectors ti are constructedfrom the entity vector ei of the entity it represents and the set of its classesc(ei):

ti = αeei + αcvi

|vi|, with vi =

∑cj∈c(ei)

w(cj , ei)× cj . (2)

The ei and cj are pairwise orthogonal vectors with n dimensions, whereeach dimension stands for either an entity or a class. Accordingly, n is the sumof the number of entities and the number of classes in the corpus. Taking thecj to be orthogonal suggests that the classes are mutually independent, whichis not given since membership in one class often implies membership in another.This model is thus not suited to calculate similarity between classes, but it doesprovide a vector space in which entities and documents can be represented incompliance with their semantic similarities.

The factors αe and αc determine how strongly exact matches of the queryentities are favored over matches of similar entities. They are constant acrossthe entire collection to make sure that the similarity of the term vectors cor-responding to two entities with the same classes is uniform. To keep ti a unitvector, they are calculated on a single value α ∈ [0, 1]:

αe =α√

α2 + (1− α)2and αc =

1− α√α2 + (1− α)2

. (3)

With higher α, documents with few occurrences of the queried entity will bepreferred over documents with many occurrences of related entities.

5 http://lucene.apache.org/

Since not every shared class means the same level of similarity between twoentities, not all classes should contribute equally strong to the similarity score.Assigning weights w(cj , ei) to the classes within a term vector achieves thiseffect. Without them, the cosine similarity of two document (or query) vectorssolely depends on the number of classes shared by the entities they contain.The w(cj , ei) should express the relevance of the class cj to the entity ei. Wefound that Resnik’s relatedness measure [13], i.e. wResnik = maxc′∈S(cj ,ei)IC(c′)performed best in our evaluations (cf. Sec. 4). IC(c′) expresses the specificity, orinformation content, of the class c′, and can be calculated by measures like lineardepth, non-linear depth [14], or Zhou’s IC [24]. It is a valuable component for ourapproach because generic classes hold less information. For example, BritishExplorers is a more precise description of James Cook than the general classPerson. Other similarities that include IC have been proposed in [9], [10], and[18].

For our implementation, we have employed the classes from YAGO6, a largesemantic knowledge base derived from Wikipedia categories and WordNet, whichinterlinks with DBpedia entities [15]. Its taxonomy, which we have extend withrdf:type statements to include the instances, is well suited for the taxonomicapproach for two reasons. First, it is fine-grained and thereby also allows for afine-grained determination of similar entities. Second, since the main taxonomicstructure is based on WordNet, it has high quality and consistency. This isadvantageous when using the taxonomy tree for similarity calculations. Othertaxonomies, such as the DBpedia Ontology7 or Umbel8, also qualify.

The text index integrates into the model by appending the traditional doc-ument vector to the entity-based document vectors. Fig. 2 shows an exampledocument d and query q annotated with entities (dbp:) and classes (yago:), aswell as the corresponding document vector d and query vector q.

0

0.7

0

0.3

0.2

0.6

0

0

0

0

1

1

dbp:Moon

dbp:Yuri_Gagarin

dbp:Neil_Armstrong

yago:Astronaut

yago:Person

yago:SovietCosmonauts

yago:CelestialBody

armstrong

land

moon

yuri

gagarin

ENTITIES

CLASSES

WORDS

0.7

0

0.7

0.5

0.3

0

0.7

1

1

1

0

0

“Armstrong dbp:Neil_Armstrong,

yago:Astronaut, yago:Person landed on themoon dbp:Moon, yago:CelestialBody “

“ Yuri Gagarin dbp:Yuri_Gagarin,

yago:SovietCosmonauts, yago:Astronaut,

yago:Person “

! #!:

#:

Fig. 2. Example document and query vectors in the taxonomic model.

6 http://www.mpi-inf.mpg.de/departments/databases-and-information-

systems/research/yago-naga/yago/7 http://wiki.dbpedia.org/services-resources/ontology8 http://www.umbel.org

3.2 Connectedness Approach

To determine the relevance of a term within a document, term frequency (tf)is not always the most appropriate indicator. Good writing style avoids wordrepetitions, and consequently, pronouns often replace occurrences of the actualterm. However, the remaining number of occurrences of the referred-to worddepends on the writer. In documents with annotated entities, term frequency isespecially harmful when annotations are incomplete. Such incompleteness mightresult, for example, when only the first few occurrences of an entity are marked,annotations are provided as a duplicate free set on document level, or not allsurface forms of an entity are recognized.

We therefore suggest to replace the term frequency weight by a new measure,connectedness, which requires adaptations of indexing (step 3b in Fig. 1) andsimilarity scoring (step 4b). The connectedness of an entity within a documentis a variation of the degree centrality based on the graph representation of theunderlying knowledge base [8]. It describes how strongly the entity is connectedwithin the subgraph of the knowledge base induced by the document (documentsubgraph). This approach has the desirable side effect that wrong annotationstend to receive lower weights due to the lack of connections to other entities inthe text.

The document subgraph D, as illustrated by the example in Fig. 3 (A),includes all entities that are linked within the document (e1 to e8) as well asall entities from the knowledge base that connect at least two entities from thedocument (e9 to e10). As connectedness is defined on undirected graphs, thefunction rel(ei, ej) is applied to to create an undirected document subgraph. Itreturns true if and only if there exists some relation from ei to ej or from ej toei. Each entity ei ∈ D has a set Ei of directly connected entities and a set Fi ofindirectly connected entities:

Ei = {e ∈ D|rel(e, ei)} and Fi = {e ∈ D|∃x : rel(e, x) ∧ rel(x, ei)} (4)

Fig. 3 (B) illustrates Ei and Fi for the example document in part A.

____________________________________________________________________________________________________________________________________-__________________________________________________________________-__________________________________________________________________-______________________________-

e1*

e5*

e4*

e3*

e2*

e6*e7*e8*

e10*

e9*

A:- ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

e1

e5

e4

e3

e2

e6e7e8

B:

Fig. 3. A: Subgraph of a knowledge base spanned by a document. An arrow from entityei to ej indidcates that an RDF triple 〈ei〉 rel 〈ej〉 exists in the underlying knowledgebase. B: Each entity is connected to the members of its Ei by a solid line, and to themembers of its Fi by a dashed line.

Based on these sets, connectedness cn(ei, d) is computed as follows:

cn(ei, d) = 1 + (|Ei|+ |Fi|)×|D|nd

, where nd =∑ej∈D

|Ej |+ |Fj |. (5)

Entities may have no connections to any other entities in the document sub-graph. Since they are nevertheless relevant to the document, we add 1 to allscores. The multiplication with |D|/nd normalizes the score by the average num-ber of connected entities over all e ∈ D. This normalization creates comparabilitybetween different documents, which is otherwise lacking for two reasons: On theone hand, entities are more likely to be connected to other entities in documentswith more annotations. On the other hand, a single connection to another entityis more significant in a sparse document subgraph than in a dense one. The aver-age score is preferable over the sum of scores because it keeps a larger variationin scores.

While calculating the term’s weight within a document via connectedness,we keep the traditional inverse document frequency (idf) to calculate the term’sdistinctness. Whether or not a word or entity has a large power to distinguishrelevant from non-relevant documents depends on the document corpus. In acorpus with articles about Nobel Prize winners, for example, ’Nobel Prize’ isa common term, whereas in general collections, the same term is much lessfrequent, and its occurrence is more informative. Distinctness can thus only beaccurately estimated by a corpus dependent measure.

Since connectedness is independent of taxonomic classes, for this approachthe term vectors consist of the entity vector (all 0 for unannotated terms) con-catenated with the traditional term vector. While weights in the traditional termvector part remain tf-idf weights, the entity vectors’ values are cn-idf values, i.e.

w(ei, d) = cn(ei, d)× idf(ei) =

(1 + (|Ei|+ |Fi|)×

|D|nd

)× log |N |

df(t). (6)

On the other hand, connectedness is not suitable for weighting query enti-ties. The main reason for this is that most of the times queries are too shortto contain sufficiently many connections to convey any meaningful context. Fur-thermore, whether query weights are recommendable is application-dependent,and consequently left out in our considerations.

4 Evaluation

The evaluation shows how the proposed retrieval models improve the searcheffectiveness. To perform an initial optimization, we have manually assembled aground truth dataset with documents, queries, and relevance judgements. Sincehuman relevance judgements are idiosyncratic, variable and subjective, we havesubsequently conducted a multi-user study with different judges, to double checkwhether the proposed method also performs well in a real user scenario.

An appropriate evaluation of the presented methods necessitates a correctlyannotated dataset. This prohibits the use of traditional retrieval datasets, such

as large scale web-search datasets, e. g. as provided by the TREC9 community,because the semi-automatic creation of necessary semantic annotations wouldhave taken too long. Datasets for NEL evaluation are perfectly annotated, butdo not provide user queries and relevance judgments [19]. Since no appropriatedataset was publicly available, we decided to compile a new dataset. It consistsof 331 articles from the yovisto blog10 on history in science, tech, and art. Thearticles have an average length of 570 words, containing 3 to 255 annotations(average 83) and have been manually annotated with DBpedia entities [16].Inspired by the blog’s search log, we have assembled and also manually annotateda set of 35 queries. The blog authors, as domain experts, assisted in the creationof relevance judgements for each query.

On this initial dataset, we optimized some parameters of our approacheswith respect to Mean Average Precision (MAP) and Normalized DiscountedCumulative Gain (NDCG). The Resnik similarity with Zhou’s IC performedbest for the class-based method, and connectedness worked best when averagenormalization (as described in Section 3.2) was applied. For text-only searchon this dataset, length normalization was not beneficial, and the classical linearterm frequency weighting performed better than Lucene’s default, which takesthe square root of the term frequency.

With this configuration, we have set up a user study. Since ranking all 331documents for 35 queries was beyond our capacities, we have used pooling toidentify potentially relevant documents. For every query, the top 10 ranked doc-uments from text-, class- (Resnik-Zhou), and connectedness search have beencollected and presented to the users in random order. The users were then askedto assign every document to one of the following five categories based on its rela-tion to the query: Document is relevant (5), parts are relevant (3), document isrelated (3), parts are related (1), irrelevant (0). The rounded arithmetic mean ofthe relevance scores (indicated in parentheses) determines all relevance score inthe ground truth. Furthermore, the participants directly compared the rankingsproduced by text search (baseline), class-based search, and connectedness-basedsearch and identified the best (score 2) as well as the second-best ranking (score1). In total, 64 users have participated in the relevance assessments. All querieshave been assessed by at least 8 participants.

Table 1. Evaluation Results

Method MAP NDCG MAP@10 NDCG@10 RR Prec@1

Text (baseline) 0.696 0.848 0.555 0.743 0.960 0.943Concept+Text (α = 1) 0.736 0.872 0.573 0.761 0.979 0.971Connectedness (only) 0.711 0.862 0.567 0.752 0.981 0.971Connectedness (with tf) 0.749 0.874 0.583 0.766 0.979 0.943Taxonomic (no similarity, α = 1

2) 0.766 0.875 0.603 0.758 0.961 0.943

Taxonomic (Resnik-Zhou, α = 12) 0.768 0.877 0.605 0.762 0.961 0.943

9 http://trec.nist.gov/10 http://blog.yovisto.com/

Tab. 1 shows that the inclusion of semantic annotations and similaritiesclearly improves retrieval performance compared to traditional text search. Asexpected, the taxonomic approach increases overall recall (97.5% as compared to93.2% for text search) because it is able to retrieve a larger number of documents.However, it also improves the ranking, measured by MAP and NDCG.

The connectedness approach performs better than text search, but worsethan the other semantic methods, including the simple “Concept+Text” ap-proach, where entities are treated as regular index terms within Lucene’s defaultmodel. This poor performance is surprising because the results from the users’direct comparison of the three rankings indicates that connectedness (with anaverage score of 1.09) provides better rankings than taxonomic (1.01) and textsearch (0.90). This seeming contradiction hints at a difference between informa-tion retrieval evaluation measures and user perception of ranking quality. Theevaluators have judged mainly by the very top few documents. Connectednessoutperforms the other approaches in this respect, as shown by the reciprocalrank (RR) and precision@1 in Tab. 1. Also, connectedness performs best whenrelated documents are not considered relevant.

Combining the connectedness measure with tf weights seems to be the bestweighting. When considering only the first 10 documents, as users often do, thiscombined weighting’s performance can be considered slightly better than thetaxonomic approach due to its higher NDCG value, because NDCG takes differ-ent relevance levels into account, while MAP does not. However, connectednessis inferior to the taxonomic approach on the complete search results, because itsimply does not retrieve certain documents. This harms the recall and negativelyeffects MAP and NDCG, while precision may still be higher.

Table 2. Comparison of semantic similarities for the taxonomic model

Similarity Uniform Jiang-Conrath [9] Lin [10]

Specificity – lin depth log depth Zhou’s IC lin depth log depth Zhou’s IC

MAP 0.766 0.753 0.766 0.767 0.767 0.766 0.767

NDCG 0.875 0.868 0.875 0.875 0.876 0.875 0.877

Similarity Resnik [13] Tversky Ratio [18] Tversky Contrast [18]

Specificity lin depth log depth Zhou’s IC – –

MAP 0.768 0.768 0.768 0.768 0.763

NDCG 0.877 0.876 0.877 0.876 0.873

Tab. 1 indicates that the use of the Resnik-Zhou similarity only has a verysmall positive impact on retrieval compared to the same approach with uniformweights. This is in line with the numbers from Tab. 2, which demonstrate thatthe choice of semantic similarity has only little impact. Apparently, the numberof shared classes has a more significant influence in this setup than the classweights.

To determine the influence of NEL quality, we have also executed experimentswith documents, where annotations have not been revised after automated NEL.The results still show an improvement over text search, but a MAP is 2.5% to4.5% and NDCG 0.1% to 1.6% lower than in the equivalent experiments onmanually revised documents.

5 Conclusions and Future Work

This paper has shown how Linked Data can be exploited to improve documentretrieval based on an adaption of the GVSM. We have introduced two novel ap-proaches utilizing taxonomic as well as connectedness features of Linked Data re-sources annotated within documents. Our evaluation has shown that both meth-ods achieve a significant improvement compared to traditional text retrieval. Theconnectedness approach tends to increase precision whereas the taxonomic ap-proach raises recall. Thereby, the similarity measure that weights taxonomicclasses of index terms has only little influence on the retrieval. The quality ofannotations substantially impacts the retrieval results, but uncorrected state-of-the-art NEL still ascertains an improvement. In general, the quality of LinkedData itself is an obstacle. Only about 65% of entities in the dataset used havetype statements, which are essential for the performance of the taxonomy-aidedretrieval. Since no annotated ground truth datasets with relevance judgementsexist, we have created one via a pooling method.

There are still open questions, which are to be answered in future work, in-cluding how well the models perform with other knowledge bases (e. g. Wikidataor national authority files) or languages, what other semantic relations betweenentities are valuable for document retrieval, and how the semantic similaritycan obtain more influence. The annotation of queries with classes may improvethe retrieval, and so could the combination of the two proposed methods. Factretrieval would benefit from the more precise connectedness approach whereasincreasing the impact of the taxonomic approach would be preferable also inexploratory search system. Furthermore, the main ideas can be transferred toan adapted language or probabilistic retrieval model.

References

1. T. H. Cao, K. C. Le, and V. M. Ngo. Exploring combinations of ontological featuresand keywords for text retrieval. In T. B. Ho and Z.-H. Zhou, editors, PRICAI,volume 5351 of LNCS, pages 603–613. Springer, 2008.

2. P. Castells, M. Fernandez, and D. Vallet. An adaptation of the vector-space modelfor ontology-based information retrieval. IEEE Transactions on Knowledge andData Engineering, 19(2):261–272, Feb 2007.

3. O. Egozi, E. Gabrilovich, and S. Markovitch. Concept-based feature generation andselection for information retrieval. In AAAI, volume 8, pages 1132–1137, 2008.

4. O. Egozi, S. Markovitch, and E. Gabrilovich. Concept-based information retrievalusing explicit semantic analysis. ACM Transactions on Information Systems, 2011.

5. F. Giunchiglia, U. Kharkevich, and I. Zaihrayeu. Concept search. In Proc. of theESWC 2009, pages 429–444, Berlin, Heidelberg, 2009. Springer-Verlag.

6. S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain. Semantic measures for thecomparison of units of language, concepts or entities from text and knowledge baseanalysis. CoRR, abs/1310.1285, 2013.

7. A. Hliaoutakis, G. Varelas, E. Voutsakis, E. G. M. Petrakis, and E. E. Milios.Information retrieval by semantic similarity. Int. Journal on Semantic Web andInformation Systems, pages 55–73, 2006.

8. M. O. Jackson et al. Social and economic networks, volume 3. Princeton UniversityPress, 2008.

9. J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics andlexical taxonomy. 1997.

10. D. Lin. Extracting collocations from text corpora. In First Workshop on Compu-tational Terminology, pages 57–63, 1998.

11. G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, Nov. 1995.

12. V. M. Ngo and T. H. Cao. Ontology-based query expansion with latently relatednamed entities for semantic text search. Advances in Intelligent Information andDatabase Systems, 2010.

13. P. Resnik. Using information content to evaluate semantic similarity in a taxonomy.In Proc. of the 14th Int. Joint Conference on Artificial Intelligence, pages 448–453,San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

14. N. A. L. Seco. Computational models of similarity in lexical ontologies. Technicalreport, Masters Thesis, 2005.

15. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge.In Proceedings of the 16th international conference on World Wide Web, pages697–706. ACM, 2007.

16. T. Tietz, J. Waitelonis, J. Jager, and H. Sack. Smart media navigator: Visualiz-ing recommendations based on linked data. In 13th International Semantic WebConference, Industry Track, pages 48–51, 2014.

17. G. Tsatsaronis and V. Panagiotopoulou. A generalized vector space model for textretrieval based on semantic relatedness. In Proceedings of the 12th Conference ofthe European Chapter of the Association for Computational Linguistics: StudentResearch Workshop, pages 70–78. Association for Computational Linguistics, 2009.

18. A. Tversky. Features of similarity. Psychological Review, 84(4):327, 1977.19. R. Usbeck et al. GERBIL – general entity annotation benchmark framework. In

24th WWW conference, 2015.20. D. Vallet, M. Fernandez, and P. Castells. An ontology-based information retrieval

model. The Semantic Web: Research and Applications, 2005.21. E. M. Voorhees. Using wordnet to disambiguate word senses for text retrieval. In

Proceedings of the 16th annual international ACM SIGIR conference on Researchand development in information retrieval, pages 171–180. ACM, 1993.

22. J. Waitelonis and H. Sack. Towards exploratory video search using linked data.Multimedia Tools and Applications, 59(2):645–672, 2012.

23. S. K. M. Wong, W. Ziarko, and P. C. N. Wong. Generalized vector spaces modelin information retrieval. In Proc. of the 8th SIGIR Conf. on Research and Devel-opment in Information Retrieval, pages 18–25, New York, NY, USA, 1985. ACM.

24. Z. Zhou, Y. Wang, and J. Gu. A new model of information content for seman-tic similarity in wordnet. In Future Generation Communication and NetworkingSymposia, 2008. FGCNS ’08. 2nd Int. Conf. on, volume 3, pages 85–89, Dec 2008.