semantic data markets: store, reason and query

12
Semantic Data Markets: Store, Reason and Query Roberto De Virgilio # , Giorgio Orsi * , Letizia Tanca * , Riccardo Torlone # # Dip. di Informatica e Automazione Universit´ a Roma Tre, Rome, Italy {devirgilio,torlone}@dia.uniroma.it * Dip. di Elettronica e Informazione Politecnico di Milano, Milan, Italy {orsi,tanca}@elet.polimi.it Abstract— This paper presents NYAYA, a flexible system for the management of Semantic-Web data which couples an effi- cient storage mechanism with advanced and up-to-date ontology reasoning capabilities. NYAYA is capable of processing large Semantic-Web datasets, expressed in a variety of formalisms, by transforming them into a collection of Semantic Data Kiosks that expose the native meta-data in a uniform fashion using Datalog ± , a very general rule-based language. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. NYAYA is easily extensible and robust to updates of both data and meta-data in the kiosk. In particular, a new kiosk of the semantic data market can be easily built from a fresh Semantic- Web source expressed in whatsoever format by extracting its constraints and importing its data. In this way, the new content is promptly available to the users of the system. The approach has been experimented using well known benchmarks with very promising results. I. I NTRODUCTION Ever since Tim Berners Lee presented, in 2006, the design principles for Linked Open Data 1 , the public availability of Semantic-Web data has grown rapidly. Today, practitioners, organizations and universities are all contributing to what it is called the Web of Data by building RDF repositories either from scratch or transforming data stored in traditional formats [1]. In this trend, ontology languages such as RDFS and OWL (with all their relatives and versions) provide a means to annotate data with meta-data (constraints forming a TBox, in knowledge base terminology) enabling different forms of reasoning, depending on the expressiveness of the adopted language. However, this spontaneous growth of available data is not yet supported by sufficiently efficient and equally general management systems, since each dataset is usually processed within a language-dependent (and usually rigid) management system that implements the specific reasoning and query- processing algorithms of the language used to specify the constraints. For these reasons the problem of storing, reasoning over, and querying large Semantic-Web datasets in a flexible and efficient way represents a challenging area of research and a profitable opportunity for industry [2]. 1 http://linkeddata.org/ In this paper we present NYAYA 2 , a system for Semantic- Web data management which provides (i) an efficient and general storage policy for Semantic-Web datasets, along with (ii) advanced and language-independent ontology reasoning capabilities and (iii) the possibility to express conjunctive queries and user-defined constraints over the dataset depending on the application needs. Reasoning and querying in NYAYA is based on Datalog ± [3], a very general logic-based language that captures the most common tractable ontology languages and provides efficiently-checkable, syntactic conditions for decidability and tractability. To show how NYAYA works, we refer to the conceptual architecture reported in Figure 1. This architecture is not meant to define an actual implementation strategy. Rather, it aims to illustrate interactions among the main features of our system. The whole framework relies on a collection of persistent repositories of Semantic-Web data that we have called kiosks. A kiosk is populated by importing an available RDF dataset, possibly coupled with meta-data that specify constraints on its content using a generic Semantic-Web language such as RDF(S) and OWL (or variants thereof). In order to allow inferencing, “native” vocabularies and meta-data are extracted and represented as Datalog ± constraints. In addition, NYAYA adopts non-recursive Storage Programs expressed in Datalog in order to link the vocabulary entities of a kiosk to their actual representation in the storage. This also enables a form of separation of concerns between the reasoning and query- processing algorithms and the logical organization of the storage that can be changed without affecting querying. On the opposite side, each kiosk exposes a uniform query interface to the stored data and allows access to the meta-data. A collection of kiosks constitutes what we have called a Semantic Data Market, a place that exposes the content of all the kiosks in a uniform way and where users can issue queries and collect results, possibly by specifying further constraints over the available data using Datalog ± in addition to the original ones. Users operate in the semantic data market using a high level front-end (e.g., a SPARQL Endpoint 3 or a visual Query 2 Nyaya, literally “recursion” in Sanskrit, is the the name of the school of logic in the Hindu philosophy. 3 http://semanticweb.org/wiki/SPARQL endpoint

Upload: others

Post on 07-Nov-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Data Markets: Store, Reason and Query

Semantic Data Markets: Store, Reason and QueryRoberto De Virgilio #, Giorgio Orsi ∗, Letizia Tanca ∗, Riccardo Torlone #

#Dip. di Informatica e AutomazioneUniversita Roma Tre, Rome, Italy

{devirgilio,torlone}@dia.uniroma.it

∗Dip. di Elettronica e InformazionePolitecnico di Milano, Milan, Italy{orsi,tanca}@elet.polimi.it

Abstract— This paper presents NYAYA, a flexible system forthe management of Semantic-Web data which couples an effi-cient storage mechanism with advanced and up-to-date ontologyreasoning capabilities.

NYAYA is capable of processing large Semantic-Web datasets,expressed in a variety of formalisms, by transforming theminto a collection of Semantic Data Kiosks that expose the nativemeta-data in a uniform fashion using Datalog±, a very generalrule-based language. The kiosks form a Semantic Data Marketwhere the data in each kiosk can be uniformly accessed usingconjunctive queries and where users can specify user-definedconstraints over the data.

NYAYA is easily extensible and robust to updates of both dataand meta-data in the kiosk. In particular, a new kiosk of thesemantic data market can be easily built from a fresh Semantic-Web source expressed in whatsoever format by extracting itsconstraints and importing its data. In this way, the new contentis promptly available to the users of the system. The approachhas been experimented using well known benchmarks with verypromising results.

I. INTRODUCTION

Ever since Tim Berners Lee presented, in 2006, the designprinciples for Linked Open Data1, the public availability ofSemantic-Web data has grown rapidly. Today, practitioners,organizations and universities are all contributing to whatit is called the Web of Data by building RDF repositorieseither from scratch or transforming data stored in traditionalformats [1]. In this trend, ontology languages such as RDFSand OWL (with all their relatives and versions) provide ameans to annotate data with meta-data (constraints forminga TBox, in knowledge base terminology) enabling differentforms of reasoning, depending on the expressiveness of theadopted language.

However, this spontaneous growth of available data is notyet supported by sufficiently efficient and equally generalmanagement systems, since each dataset is usually processedwithin a language-dependent (and usually rigid) managementsystem that implements the specific reasoning and query-processing algorithms of the language used to specify theconstraints. For these reasons the problem of storing, reasoningover, and querying large Semantic-Web datasets in a flexibleand efficient way represents a challenging area of research anda profitable opportunity for industry [2].

1http://linkeddata.org/

In this paper we present NYAYA2, a system for Semantic-Web data management which provides (i) an efficient andgeneral storage policy for Semantic-Web datasets, along with(ii) advanced and language-independent ontology reasoningcapabilities and (iii) the possibility to express conjunctivequeries and user-defined constraints over the dataset dependingon the application needs. Reasoning and querying in NYAYAis based on Datalog± [3], a very general logic-based languagethat captures the most common tractable ontology languagesand provides efficiently-checkable, syntactic conditions fordecidability and tractability.

To show how NYAYA works, we refer to the conceptualarchitecture reported in Figure 1. This architecture is not meantto define an actual implementation strategy. Rather, it aims toillustrate interactions among the main features of our system.

The whole framework relies on a collection of persistentrepositories of Semantic-Web data that we have called kiosks.A kiosk is populated by importing an available RDF dataset,possibly coupled with meta-data that specify constraints onits content using a generic Semantic-Web language such asRDF(S) and OWL (or variants thereof). In order to allowinferencing, “native” vocabularies and meta-data are extractedand represented as Datalog± constraints. In addition, NYAYAadopts non-recursive Storage Programs expressed in Datalogin order to link the vocabulary entities of a kiosk to theiractual representation in the storage. This also enables a formof separation of concerns between the reasoning and query-processing algorithms and the logical organization of thestorage that can be changed without affecting querying. On theopposite side, each kiosk exposes a uniform query interfaceto the stored data and allows access to the meta-data.

A collection of kiosks constitutes what we have called aSemantic Data Market, a place that exposes the content of allthe kiosks in a uniform way and where users can issue queriesand collect results, possibly by specifying further constraintsover the available data using Datalog± in addition to theoriginal ones.

Users operate in the semantic data market using a highlevel front-end (e.g., a SPARQL Endpoint3 or a visual Query

2Nyaya, literally “recursion” in Sanskrit, is the the name of the school oflogic in the Hindu philosophy.

3http://semanticweb.org/wiki/SPARQL endpoint

Page 2: Semantic Data Markets: Store, Reason and Query

SourceUCQ

SQLrewriting

SQLqueries

Storageprogramrewriting

ReformulatedUCQ

user-definedDatalog+-program

userUCQ

RDFS(DL)program

OWL-QLprogram

SourceUCQ

SQLrewriting

SQLqueries

Storageprogramrewriting

Storageprogramrewriting

offlinevocabulary

andmeta-dataextraction

(FO-reducible) Datalog+- Expressive Power

RDFS(DL)Kiosk

OWL-QL Kiosk

RDFKiosk

RDFS(DL)meta-data

data

OWL-QLmeta-data

data data

SourceUCQ

SQLrewriting

SQLqueries

. . .

RDFS(DL)Dataset

RDFDataset

OWL-QLDataset

UserFront-End

Fig. 1. The conceptual architecture of a Semantic Data Market

by Example interface). The user query is translated into aDatalog± query and reformulated in terms of the Datalog±

constraints. Since NYAYA focuses on FO-reducible constraints,the result of the translation process is a union of conjunctivequeries (UCQ). In a subsequent step the UCQ is reformulatedin terms of the storage programs of each kiosk obtaining aset of UCQ (one for each kiosk) that takes into account therelationship between the entities occurring in the query and thestructures in the persistent storage of each kiosk. Each UCQis then easily rewritable into actual queries over the persistentstorage system (SQL in our current relational implementation)that retrieve the tuples answering the original query.

An important aspect of this framework is that it is easilyextensible: when a fresh Semantic-Web source expressed inwhatsoever format is made available, a new kiosk can be easilybuilt from it by extracting its meta-data and importing its data.In this way, the content of the source is promptly available tothe users of the system. Conversely, if a user wants to querythe same kiosk by adopting a different set of constraints, thequeries is automatically reformulated accordingly and issuedto the kiosk.

In sum, the main contributions of this work are the follow-ing:

• the definition of Semantic Data Market, an extension of

standard Knowledge Bases where the ontological con-straints and the logical organization of persistent infor-mation are uniformly represented in the same language,

• an algorithm for rewriting-based access to Kiosks of thesemantic data market,

• an efficient storage and model-independent mechanismfor Semantic-Web data, coupled with a rule-based infer-ence and query-answering engine.

The paper is organized as follows: Section II discusses somerelated work, while Section III describes the architecture ofNYAYA and the storage technique we use, based on meta-modelling. Section IV introduces the fundamental theoreticalnotions underlying reasoning and query-processing algorithmsin NYAYA, subsequently presented in Section V. Section VIdescribes the experimental setting and gives the performancefigures. Finally, Section VII draws the conclusions and delin-eates our future work.

II. RELATED WORK

Managing Semantic-Web data represents an important andprofitable area of research. We now briefly discuss somerepresentative systems, focusing on the two major issuesconcerning Semantic-Web data management, namely: storageand querying.

Considering the storage, we identify two main trends: (i) thefirst focuses on developing native storage systems to exploitad-hoc optimisations while (ii) the second adapts mature data-models (relational, object-oriented, graph [4]) to Semantic-Web data. Generally speaking, native storage systems (suchas AllegroGraph4 or OWLIM5) are more efficient in termsof load and update time. On the other side, relying on anexisting data-management system improves query efficiencydue to the availability of mature and effective optimisations. Inthis respect, a drawback of native approaches consists exactlyin the need for re-thinking query optimizations and transactionprocessing techniques.

Early approaches to Semantic-Web data management (e.g.,RDFSuite [5], Sesame6, KAON7, TAP8, Jena9, Virtuoso10 andthe semantic extensions implemented in Oracle Database 11gR2 [6]) focused on the triple (subject, predicate, object) orquads (graph, subject, predicate, object) data models, each ofwhich can be easily implemented as one large relational tablewith three/four columns. Queries are formulated in an RDF(S)-specific query language (e.g., SPARQL11, SeRQL12, RQL13 orRDQL14), translated to SQL and sent to the RDBMS. How-ever, during query-processing the number of self-joins that

4http://www.franz.com/products/allegrograph/5http://www.ontotext.com/owlim/6http://www.openrdf.org/7http://kaon.semanticweb.org/.8http://tap.stanford.edu/.9http://jena.sourceforge.net/10http://virtuoso.openlinksw.com/11http://www.w3.org/TR/rdf-sparql-query/.12http://www.openrdf.org/doc/sesame/users/ch06.html13http://139.91.183.30:9090/RDF/RQL/.14http://www.w3.org/Submission/RDQL/.

Page 3: Semantic Data Markets: Store, Reason and Query

must be executed makes this approach unfeasible in practiceand the optimisations introduced to overcome this problem(such as property-tables [7], GSPO/OGPS indexing [8]) haveproven to be highly query-dependent [9] or to introduce signif-icant computational overhead [10]. Abadi et al. [9] proposedthe Vertical Partitioning approach, where the triple table issplit into n two-column tables in such a way that it is possibleto exploit fast merge-joins in order to reconstruct informationabout multiple properties for subsets of subjects. However,Sidirourgos et. al [11], analysed the approach concluding thatits hard-coded nature makes it difficult to generalise and usein practice.

On the querying side we analyse the problem w.r.t. twodimensions, namely reasoning support and inference mate-rialisation. Earlier systems did not allow reasoning eitherbecause efficient algorithms were not available or because thedata language (for example RDF) was too simple to allowinferencing. In such cases the efficiency of query-processingdepends only on the logical and physical organization of thedata and on the adopted query language. On the other side,reasoning is a fundamental requirement in the Semantic-Weband efficient reasoning algorithms are now available [12],[13]. However, a system that supports reasoning must facethe problem of inference materialisation.

A first family of systems materialises all the inferences,this enable ad-hoc indexing and dramatically improves query-processing performance. By contrast, loading-time is inflatedbecause all the complexity of inferencing is deferred to thismoment and could potentially generate an exponential blow-up of space occupation. Moreover update-management isalso critical since the inferred data must be incrementallyupdated accordingly. Full materialisation is suitable for reallystable datasets, especially when the queries are known inadvance. The second class of systems, such as QuOnto [14]and REQUIEM [15], executes inferencing on-the-fly, thus thecomplexity of query processing depends on both the logicalorganisation of the data and the expressiveness of the adopteddata language and the queries. This approach has many ad-vantages: in the first place, space occupation is reduced to theminimum necessary to store the dataset and it is possible todefine a suitable trade-off for inference materialisation depend-ing on which queries are executed most frequently. Anotheradvantage is the robustness to updates: since inferences arecomputed every time, there is no need for incremental updates.The major drawback of on-the-fly inferencing is the impactof reasoning time on query processing, however, the systemsQuOnto and REQUIEM are built upon languages such asDL-Lite [12] and ELHIO¬ [16] that trade expressivenessfor tractability of query answering through intensional queryreformulation [17] or database completion [18]; the latter,however, requires either write-access to the target database orenough main memory to store the produced inferences.

To the best of our knowledge NYAYA is the first systemthat adopts a flexible combination of tractable reasoning witha highly performant storage system.

III. NYAYA REPRESENTATION OF DATA AND META-DATA

Following an approach for the uniform management ofheterogeneous data models [19], data and meta-data are rep-resented in NYAYA using a meta-model M made of a genericset of constructs, each of which represents a primitive usedin known semantic models (e.g., RDF, RDFS and OWL) fordescribing the domain of interest. The adoption of a meta-model able to describe all the data models of interest has twomain advantages. On the one hand, it provides a frameworkin which semantic models can be handled in a uniform wayand, on the other hand, it allows the definition of model-independent reasoning capabilities.

Constructs in M are chosen by factoring out commonprimitives in different models. For instance, M includes theconstruct CLASS that is used in both RDFS and OWL torepresent a set of resources. Each construct is associated withan object identifier, a name, a set of properties, and a set ofreferences to other constructs. The approach is basically inde-pendent of the actual storage model (e.g., relational, object-oriented, or XML-based). However, since NYAYA relies on arelational implementation of the meta-model, in the followingwe assume that each construct correspond to a table.

The core of M is constituted by the following constructs.• CLASS(OID, Name)• DATAPROPERTY(OID, Name, TypeClassOID, isKey, ClassOID)• OBJECTPROPERTY(OID, Name, IsKey, isNegative, isTransi-

tive, isSymmetric, isFunctional, isInverseFunctional, Subject-ClassOID,ObjectClassOID)

• I-CLASS(OID, URI, ClassOID)• I-DATAPROPERTY(OID, Name, Value, i-ClassOID, Property-

OID)• I-OBJECTPROPERTY(OID, Name, Subject-ClassOID, Object-

ClassOID, PropertyOID)

The first three tables serves to store sets of resources,atomic properties of resources (i.e., properties whose objectsare literals) and object properties of resources (i.e., propertieswhose objects are resources themselves), respectively. TheDATAPROPERTY table has a reference to the CLASS it belongsto and has the range data type (e.g., integer or string) as at-tribute. The OBJECTPROPERTY construct is used to representRDF statements and has two references to the subject class andthe object class involved in the statement. In addition, it hasattributes that specify notably features of the property. The I-CLASS construct is used to model resources that are instancesof the class specified by its ClassOID property. Similarly forI-DATAPROPERTY and I-OBJECTPROPERTY.

Assume we have represented (by means of RDF triples) thebloodline relationships among the Aeneid’s characters Priam,Hector, Paris and Astyanax as shown in Figure 2. We use fourindividuals, i0, . . . , i3, having the name of the above charactersas property. The individuals are linked by two instances ofthe property father and one instance of the property brotherrepresenting that Hector and Paris are brothers and are bothsons of Priam. Moreover, we represent the fact that Astyanaxas the first-born of Hector.

The tables of the core meta-model storing the entities inFigure 2 are reported in Fig. 3. Note that data and meta-data

Page 4: Semantic Data Markets: Store, Reason and Query

i_1

Person

type

i_0

i_2

i_3

firstBorn

father

father

name

Astyanax

name

Hector

Paris

name

name

Priam

Fig. 2. Aeneid’s Bloodlines (RDF)

are managed in an uniform way.

Fig. 3. A relational implementation of the meta-model of NYAYA

RDF Schema can be included in this framework by simplyadding to M the following further constructs, which capturethe notions of subclass and subproperty (in its two forms)respectively:

• SUBCLASS(OID, classOID, subClassOID)• SUBDATAPROPERTY(OID, propertyOID, subpropertyOID)• SUBOBJECTPROPERTY(OID, propertyOID, subpropertyOID)

Indeed, one of the main advantages of this approach is that itis easily extensible. For example, containers can be representedby adding three further constructs: CONTAINER (representingsequences, alternatives or bags) SIMPLEELEMENT and CLAS-SELEMENT (representing literal and resource elements of acontainer, respectively).

In Fig. 4 we show an UML-like diagram representing a frag-ment of M involving the main constructs of OWL-QL. Therectangles represent constructs and the arrows references be-tween them. Notice how the core ofM (enclosed in the dashedbox) can serve to represent the facts of both OWL-QL andRDFS(DL) ontologies; the full specification of the meta-modelcan be found at http://mais.dia.uniroma3.it/Nyaya.

In practical applications, the above tables could be verylarge and therefore a suitable tuning involving indexing andpartitioning needs to be performed to guarantee good perfor-mances. This subject will be addressed in Section VI.

IV. REASONING AND QUERYING

Before going into the details of how reasoning and queryingare actually done in NYAYA we introduce some fundamentalnotions.

Fig. 4. The meta-model M

A. Ontological Query Answering

An ontological schema S (or simply schema) is a set ofunary and binary first-order predicates. A term t in a predicatemay be an individual name (or simply an individual), aconstant (e.g., strings and integers), a null (i.e., an anonymousindividual or an unknown constant) or a variable. An atomicformula (i.e., an atom) is either a formula p(t) or p(t, t′),where p is a predicate of S and t, t′ are terms. We call positionπi the i-th argument of a predicate p ∈ S. A unary predicateis called a concept name (or simply a concept), while binarypredicates are referred to as role names (or simply roles) whenboth terms are individuals, and attribute names (or simplyattributes) when the first term is an individual and the secondis a constant.

Similarly to databases, the semantics of an ontologicalschema is given in terms of its models. Consider three pairwisedisjoint sets ∆i, ∆c and ∆z of symbols such that: ∆i is afinite set of individuals, ∆c is a finite set of constants and ∆z

is a (possibly infinite) set of, not necessarily distinct, labellednulls. An ontological instance (or simply instance) I for aschema S is a (possibly infinite) set of atoms of the formp(t), r(t, t′) and u(t, t′′) (named facts), where p,r and u arerespectively a concept, a role and an attribute in S whilet, t′ ∈ (∆i ∪ ∆z) and t′′ ∈ (∆c ∪ ∆z). The ontologicalinstance can be seen as the extensional counterpart of theconcept of interpretation of description logics (and logics ingeneral) [20]. We say that an ontological instance I is a modelfor an ontological schema S if and only if I |= S.

The structure of an ontology is defined by means of anontology-modelling language that is used to specify the re-lationships among the entities in the ontology. The expres-siveness of such languages ranges from simple assertionsover the set of individuals and constants as in the RDF

Page 5: Semantic Data Markets: Store, Reason and Query

language15 to complex languages such as OWL2-DL16 andFLORA-217 whose semantics is given in terms of DescriptionLogics [20] and F-Logic [21] respectively. The ontologicalstructures defined by such languages are equivalent to enforcea set of first-order constraints restricting the number of modelsof the ontological schema.

The query language we consider is that of unions ofconjunctive queries (UCQs) whose expressive power is thesame of select-project-join Relational Algebra queries.

A conjunctive query (CQ) q over an ontological schema S isa formula of the form q(X) ← ∃Yφ(X,Y), where φ(X,Y)is a conjunction of atoms over S, X and Y are sequencesof variable or constants in ∆c. The answer to a CQ q overan ontological instance I denoted as ans(q, I) is the set ofall the tuples t ∈ ∆c for which there exists a homomorphismh : X∪Y → ∆c∪∆z such that h(φ(X,Y)) ⊆ I and h(X)=t.

If we add a set of constraints Σ the problem becomes thatof ontological conjunctive query answering under constraintsi.e., the problem of deciding whether I ∪ Σ |= q, where q isa conjunctive query.

The semantics we adopt for query answering is the certainanswers semantics [22], requiring that an answer to a querybe an answer in all the possible models of a given schema.Formally, let mods(Σ, I) be the set of all the (possiblyinfinite) sets of atoms D such that D |= I∪Σ. The answer to aCQ q over an instance I w.r.t. a set of constraints Σ (denotedby ansΣ(q, I)) is the set of all the tuples which belong toans(q,D) for all D ∈ mods(Σ, I).

Given a set of constraints Σ, we say that query answeringunder Σ is first-order reducible, henceforth denoted by FO-reducible if and only if, given a query q, it is possible toconstruct a first-order query qrew (called the perfect rewriting)such that Σ ∪ I |= q if and only if I |= qrew for everyinstance I. This class of constraints corresponds to the classof all the non-recursive Datalog programs [23], for which data-complexity of query answering (i.e., fixed set of rules) is inuniform AC0 [24].

B. Datalog±

Datalog± is a family of extensions to the Datalog lan-guage [23] where the rules are tuple generating dependencies,equality generating depedencies and negative constraints.

A tuple-generating dependency (TGD) is a formula σ overa schema S of the form φ(X,Y)→ ∃Zψ(X,Z) where φ andψ are conjunctions of atoms respectively called the head andthe body of the TGD. Whenever Z is empty we say that theTGD is full because it completely specifies the variables inthe head. Given a TGD σ we denote as Uσ the set of all thepositions in head(σ) where a universally quantified variableoccurs.

An equality-generating dependency (EGD) is a formula ηover S of the form ∀XΦ(X) → Xi = Xj where Φ(X) is aconjunction of atoms and Xi, Xj ∈ X.

15http://www.w3.org/RDF/.16http://www.w3.org/TR/owl2-syntax/.17http://flora.sourceforge.net/.

Given an EGD η and an atom a ∈ body(η), we denote asJaη (joined-positions) the set of all the positions πa in a such

that the variable V occurring in πa occurs also in anotheratom b ∈ body(η) in a position πb 6= πa. We also denote asP aη (V ) the set of all the positions of a variable V in an atom

a ∈ body(η).A negative constraint (NC) is a first-order sentence of the

form ∀Xφ(X)→ ⊥, where ⊥ denotes the truth constant false.In other words, a negative constraint specifies that certainformulas must be false in every model of a given theory.NCs have been proven useful for modelling various forms ofontological structures [3], [25] as well as conceptual schemassuch as Entity-Relationship (ER) diagrams [26].

It is known that query answering under TGDs is undecid-able [27]. However, the interaction of any decidable set ofTGDs and NCs is known to be safe if it is decidable for theset of TGDs under consideration [28]. When a set of EGDsis considered together with a set of TGDs, the problem ofquery answering is undecidable in the general case since EGDsgeneralise the well-known class of functional dependenciesin databases. For this reason the interaction between EGDsand TGDs must be controlled to retain decidability. Thiscan be done by assuring that EGDs and TGDs are non-conflicting [29], [25].

A TGD σ is non-conflicting with an EGD η iff for eachatom a ∈ body(η), if pred(a) 6= r then one of the followingconditions is satisfied: (i) Uσ is not a superset of Ja

η andeach existential variable in σ occurs just once, or (ii) thereexists a variable V such that P a

η (V ) ≥ 2, and there exist twopositions π1, π2 ∈ P a

η such that in head(σ) at π1, π2 we haveeither two distinct existential variables, or one existentialand one universal variable. It is known that query answeringunder non-conflicting EGDs and TGDs is decidable [3], [25].

Example 1: (non-conflicting dependencies)Consider the following dependencies:

η: firstBorn(X,Y) ∧ firstBorn(X,Z) → Y=Zσ1: father(X,Y) → firstBorn(Y,X)σ2: firstBorn(X,Y) → father(Y,X)

The EGD η states that a father may have only one first-bornchild, the TGD σ1 says that all the children of a father arefirst-born children (i.e., they are all twins) while the TGD σ2states that all the first-born children are also children. Clearly,if we assume that η holds, then, intuitively, σ1 conflictswith η because they are both true exclusively in case ofonly children; instead, σ2 is non-conflicting. This intuitionis captured by the notion of non-conflicting dependencies ofDatalog±, in fact the set of joined-positions for firstBorn isJfirstBornη ={firstBorn[1]} and the set of universal positions inσ1 is Uσ1={firstBorn[1],firstBorn[2]} (i.e., conflicting) whileUσ2=∅ (i.e., non-conflicting).

NYAYA implements Linear Datalog±, a member of theDatalog± family where the TGDs have a single atom inthe body and for which conjunctive query answering is FO-

Page 6: Semantic Data Markets: Store, Reason and Query

TABLE IAENEID’S BLOOD-LINES (LANGUAGE COMPARISON)

DL OWL2-QL RDFS(DL) Datalog±

Person v ∃name u ∃father−.Person X - Person(X) → ∃Y ∃Z name(X,Y) ∧ father(Z,X) ∧ Person(Z)Dom(brother) v Person X X brother(X,Y) → Person(X)Ran(brother) v Person X X brother(X,Y) → Person(Y)Dom(father) v Person X X father(X,Y) → Person(X)Ran(father) v Person X X father(X,Y) → Person(Y)Dom(firstBorn) v Person X X firstBorn(X,Y) → Person(X)Ran(firstBorn) v Person X X firstBorn(X,Y) → Person(Y)firstBorn v father− X - firstBorn(X,Y) → father(Y,X)brother v brother− X - brother(X,Y) → brother(Y,X)funct(firstBorn) - - firstBorn(X,Y) ∧ firstBorn(X,Z) → Y=Z (EGD)name(i0, ’Priam’) X X name(i0, ’Priam’)name(i1, ’Hector’) X X name(i1, ’Hector’)name(i2, ’Paris’) X X name(i2, ’Paris’)name(i3, ’Astyanax’) X X name(i3, ’Astyanax’)Person(i1) X X Person(i1)father(i0,i1) X X father(i0,i1)father(i0,i2) X X father(i0,i2)brother(i1,i2) X X brother(i1,i2)firstBorn(i1,i3) X X firstBorn(i1,i3)

reducible. Reasoning is thus reduced to conjunctive queryanswering under constraints where the data are considered as atraditional relational database and the meta-data are translatedas first-order constraints. This reduction allows NYAYA todelegate both reasoning and querying to the relational databaseengine thus obtaining the same efficiency as for traditionaldatabase queries. Table I shows a possible DL representationof the example of Figure 2 indicating which elements arerepresentable in OWL2-QL, RDFS(DL) and Datalog± respec-tively.

Sometimes, according to the application requirements, wewould like to represent different sets of constraints overthe same dataset. Someone might want to represent the factthat every person has a name and must have a father (e.g.,for logical coherence), and others the fact that brother is asymmetric relationship. Differently from languages such asOWL-QL and RDFS(DL) where decidability and tractabilityis achieved by posing restrictions over the constructors thatcan be adopted, Datalog± allows us to use combinations ofconstructs, usually forbidden in OWL-QL and RDFS(DL), ifthey do not harm decidability and efficiency of reasoning andquery answering.

C. Semantic Data Kiosks

In NYAYA, reasoning and querying must take into accountalso the logical organization of the persistent storage. We nowdefine the Sematic Data Kiosk as an extension of a traditionalKnowledge Base that explicitly considers the structure of thestorage.

Definition 1: (Semantic Data Kiosk)A Semantic Data Kiosk is a triple K=(ΣO, ΣS , D) where (i)ΣO is a set of Linear Datalog± rules, (ii) ΣS is a set of fullTGDs where the rules have, as head, a (single) predicate ofΣO and, as body, a conjunction of atoms over predicates ofthe meta-model M presented in Section III, and (iii) D is adatabase over M.

Table II shows the typical structure of the rules in a semanticdata-kiosk. The rules of ΣO are LTGDs encoding the definitionof the concept Person, the role father and the attributename while the rules of ΣS always constitute a safe and non-recursive Datalog program since all the atoms in the bodiesnever occur in any of the heads and, for each rule, all thevariables appearing in the head also appear in the body.

Knowing that in NYAYA query-answering over D underΣO is FO-reducible, a natural question is now whether queryanswering remains FO-reducible also in the presence of ΣS .

First of all, observe that every derivation using the rulesin ΣS has length 1, because no atom ever contributes to theproduction of another body atom. Note now that this particularstructure of the rules in ΣS imposes a natural stratification ofthe rules in ΣT w.r.t. those in ΣS . Query answering can bethen reduced to a two-step process where every conjunctivequery q is first rewritten in terms of the ΣT (by virtue of theFO-reducibility of Linear Datalog±) obtaining a UCQ QΣT

that is a perfect rewriting of q w.r.t. ΣT . Every qiΣT∈ QΣt

can be then rewritten in terms of ΣS obtaining another UCQQi

Σsthat is a perfect rewriting of qiΣT

w.r.t. ΣS . The union ofall the Qi

ΣSfor every i ∈| QΣT

| produced by the process isthus a perfect rewriting of q w.r.t ΣT ∪ ΣS .

Based on the above considerations, the following theoremestablishes that, provided that query answering under ΣT

is FO-reducible, the logical organization of the data in thepersistent storage does not affect the theoretical complexityof query answering.

Theorem 1:Consider a set ΣT consisting of a Linear Datalog± rules anda set ΣS consisting of a set of positive safe non-recursiveDatalog rules ΣS such that (i) for each σ ∈ ΣS head(σ)appears only in the rules of ΣT and (ii) every predicateappearing in ΣT appears only in ΣT or in the head of a ruleσS ∈ ΣS . Then, conjunctive query answering under ΣT ∪ΣS

is FO-reducible.

Page 7: Semantic Data Markets: Store, Reason and Query

TABLE IIRUNNING EXAMPLE: SEMANTIC DATA KIOSK

ΣO:name(X,Y) :- Person(X).father(X,Y), Person(Y) :- Person(X).

ΣS :Person(X) :- I CLASS(Z0,X,Z1), CLASS(Z1,’http://aeneid.gr/aeneid.owl#Person’).father(X,Y) :- I OBJECTPROPERTY(Z0,Z1,Z2,Z3), I CLASS(Z1,X,Z6), I CLASS(Z2,Y,Z7),

OBJECTPROPERTY(Z3,’http://aeneid.gr/aeneid.owl#father’,Z4,Z5).name(X,Y) :- I DATAPROPERTY(Z0,Z1,Y,Z2), I CLASS(Z1,X,Z5),

DATAPROPERTY(Z2,’http://aeneid.gr/aeneid.owl#name’,Z3,Z4).DI CLASS(’06’,’http://aeneid.gr/aeneid.owl#genID 0’,’01’).I DATAPROPERTY(’10’,’06’,’Priam’,’03’).I OBJECTPROPERTY(’14’,’06’,’07’,’04’).

The most important aspect is that FO-reducibility of LinearDatalog± programs can be syntactically checked for eachprogram and does not depend on the language as it is thecase for other Semantic-Web languages such as DL-Lite andRDFS(DL). Consider the example of Table I: due to therestrictions on the language imposed by RDFS(DL) and DL-Lite, none of them is able to represent the example completely.Instead Datalog± can entirely capture the example, becausethe specific program is FO-reducible.

V. REWRITING-BASED ACCESS TO SEMANTIC DATAKIOSKS

The meta-model presented in Section III imposes a strongseparation between the data and the meta-data in a kiosk. Thissection shows how this logical organisation supports efficientreasoning and querying.

Given a query q, the actual computation of the rewriting isdone by using the rules in ΣT and ΣS as rewriting rules inthe style of [12] and applying backward-chaining resolution.

Given two atoms a,b we say that they unify if there existsa substitution γ, called unifier for a and b, such that γ(a) =γ(b). A most general unifier (MGU) is a unifier for a and b,denoted as γa,b, such that for each other unifier γ for a and b,there exists a substitution γ′ such that γ = γ′◦γa,b. Notice thatif two atoms unify then there exists a MGU and it is uniqueup to variable renaming.

In order to describe how the perfect-rewriting is constructed,we now define the notion of applicability of a TGD to anatom of a query, assuming w.l.o.g. that the TGDs have asingle atom in their head.

Definition 2: (Applicability) [26]Let σ be a LTGD over a schema S and q a CQ over R. Givenand an atom a ∈ body(q), we say that σ is applicable to awhenever a and head(σ) unify through a MGU γa,σ and thefollowing conditions are satisfied:

1) if the term at position π is either a constant or a sharedvariable in q, then the variable at position π in head(σ)occurs also in body(σ).

2) if a shared variable in q occurs only in a at positionsπi, . . . , πm, for m ≥ 2, then either the variable at

position πi in head(σ), for each i ∈ {1, . . . ,m}, occursalso in body(σ), or at positions π1, . . . , πm in head(σ)we have the same existential variable.

Whenever a TGD σ is applicable to an atom a of a query qwe substitute γa,σ(body(σ)) to the atom a in q producing aso-called partial-rewriting qr of q such that qr ⊆ q. Roughlyspeaking, we use the TGD as a rewriting rule whose directionis from the right to the left and the notion of applicabilityguarantees that every time an atom of a query q is rewrittenusing the TGD what we obtain is always a sub-query of q.Conditions (1) and (2) above are used to ensure soundnesssince a naive backward-resolution rewriting may produceunsound partial-rewritings as shown by the following example.

Example 2: (Loss of soundness)Consider the LTGD σ : person(X) → ∃Y name(X,Y ) overthe schema S={name, person} and the conjunctive queryq(A)← name(X,′Hector′) asking for all the persons in thedatabase named Hector. Without the restrictions imposed bythe applicability condition a naive rewriting would produce thepartial-rewriting qr(A) ← person(A) where the informationabout the constant Hector gets lost. Consider now thedatabase D = {person(i0), person(i1), name(i1,Hector)}.The answer to qr is the set {person(i0),person(i1)} wherethe tuple person(i0), corresponding to Priam, is not a soundanswer to q.

However, the applicability condition might prevent someof the partial-rewritings from being generated thus loosingcompleteness as shown by the following example:

Example 3: (Loss of completeness)Consider the following set of LTGDs Σ over the schemaS={father, person}:

σ1: person(X)→ ∃Y father(Y,Z)σ2: father(X,Y )→ person(X)

and the conjunctive query q(B)← father(A,B), person(A).The only viable strategy in this case is to apply σ2 to the atomperson(A) in q, since the atom father(A,B) is blocked bythe shared variable A. The obtained partial-rewriting is thus

Page 8: Semantic Data Markets: Store, Reason and Query

the query q1r(B)← father(A,B), father(A,X1). Note that,in q1r , the variable A remains shared thus it is not possibleto apply σ1. However, the query q2r(B) ← person(B) isalso a partial rewriting but its generation is prevented by theapplicability condition.

In order to solve this problem and achieve completeness,approaches such as [12] resort on exhaustive factorisationsthat, however, produce a non-negligible number of redundantqueries. We now define a restricted form of factorisationthat generates only those queries that actually lead tothe generation of a non-redundant partial-rewriting. Thiscorresponds to the identification of all the atoms in the querywhose shared existential variables come from the same atomin all the derivations and they can be thus unified with noloss of information.

Definition 3: (Factorisability)Given a query q and a set of TGD Σ, we say that a positionπ in an atom a of q is existential if there exists a TGD σsuch that a unifies with head(σ) and the term at position π inhead(σ) is an existential variable. A set of atoms S ⊆ body(q)is factorisable w.r.t a TGD σ iff the following conditions hold:

• a and b unify through a MGU γa,b for all a, b ∈ S• if a variable X appears in an existential position in an

atom of S then X appears only in existential positionsin all the other atoms of q.

Example 4: (Factorisability)Consider, as an example, the following CQs:

q1(A)← father(A,B), father(C,B)q2(A)← father(A,B), father(B,C)

and the LTGD σ : person(X) → ∃Y father(Y,X). In q1,the atoms father(A,B) and father(C,B) unify through theMGU γ = {C → A} and they are also factorisable sincethe variables C and A appear only in existential positions(i.e, father[1]) in q1. The factorisation results in the queryq3(A) ← father(A,B); it is worth noting that, before thefactorisation, σ was not applicable to q1 while it is applicableto q3. On the contrary, in q2, despite the atoms father(A,B)and father(B,C) unify, the variable C appears in positionfather[2] which is not existential w.r.t σ, thus the atoms arenot factorisable.

We are now ready to describe the algorithm LTGD-rewritewhose pseudo-code is presented as Algorithm 1.

The perfect-rewriting of a CQ q is computed by exhaustiveexpansion of the atoms in body(q) by means of the applicableTGDs in Σ. Each application of a TGD leads to a newquery qn (i.e., a partial-rewriting) retrieving a subset of thesound answers to q; these queries are then factorised andstored into the set Qr. The process is repeated until no newpartial-rewritings can be generated (up to variable renaming).The CQ q(X) ←

∨qr∈Qr

body(qr(X,Y)) constitutes aperfect-rewriting i.e, a rewritten query that produces soundand complete answers to q under the constraints defined by Σ.

Algorithm 1 The algorithm LTGD-rewriteRequire: A schema S , a set of LTGDs Σ over S and a CQ q overS

Ensure: A set of partial rewritings Qr for q over SQn := {q}Qr := ∅repeat

Qr := Qr ∪Qn

Qtemp := Qn

Qn := ∅for all q ∈ Qtemp do

for all σ ∈ Σ do– factorisation –q′ := factorise(q)– rewriting –for all a ∈ body(q′) do

γa,σ := ∅if isApplicable(σ,a,γa,σ) then

qn := γa,σ(q′[a�body(σ)]))Qn := Qn∪ {qn}

end ifend for

end forend for

until Qn = ∅return Qr

In the following, an example of application of LTGD-rewriteis given:

Example 5: (ΣO Rewriting)Consider the set of LTGDs Σ of Example 3 and the queryq(B) ← father(A,B), person(A). LTGD-rewrite first ap-plies σ2 to the atom person(A) since σ1 is not applica-ble. The produced partial rewriting is the query q1r(B) ←father(A,B), father(A, V 1) whose atoms are factorisablethrough the MGU γ = V 1 → B. The factorised query isq2r(B) ← father(A,B). LTGD-Rewrite can now apply σ1to q2r since the variable A in the existential position is now anon-shared variable. The obtained partial-rewriting is the queryq3r(B)← person(B). The set of partial rewriting produced byour algorithm is thus Qr = {q, q1r , q3r}.

Each partial-rewriting produced by LTGD-Rewrite is thenrewritten in terms of the storage tables using the rules inΣS . In this case the rewriting is generated using the sametechnique adopted in Algorithm 1, however given a queryq generated by LTGD-Rewrite we know that: (i) for eachatom in the query there exists only one rule in ΣS capable ofexpanding it and, (ii) if the head of the rule unifies with anatom of the query then the rule is always applicable becausethere are not existentials in the heads of the rules in ΣS . Thealgorithm then starts from q and rewrites its atoms until allof them have been expanded. In addition, in NYAYA we alsoexploit the knowledge about the logical organisation of thestorage e.g., we know that (i) the foreign-key constraints inthe storage tables are always satisfied by construction and,(ii) the OIDs of the tuples in D are generated with a known

Page 9: Semantic Data Markets: Store, Reason and Query

hash algorithm, this allows us to replace the foreign keys withthe corresponding target values (hashes), thus simplifying thequeries.

Example 6: (ΣS Rewriting)Consider the set ΣS of Table II and the query q of Example 5.The corresponding expansion is the following query:

QS(A) ← I OBJECTPROPERTY(Z0,Z1,Z2,Z3),OBJECT PROPERTY(Z3,’father’,Z4,Z5),I CLASS(Z1,A,Z6),I CLASS(Z2,A,Z7),CLASS(Z6,’Person’).

The knowledge derived from the storage meta-model(see Figure 3) enables further optimisations of the abovequery such as the replacement of foreign-keys with thecorresponding value that produce the following optimizedquery:

QS(A) ← I OBJECTPROPERTY(Z0,Z1,Z2,’04’),I CLASS(Z1,A,’01’),I CLASS(Z2,A,Z3),

Since the produced queries are always conjunctive queriesthey can be easily rewritten in SQL and executed by arelational DBMS. The following is the SQL representation ofthe optimised query QS above:

SELECT C1.URIFROM I-CLASS AS C1

I-CLASS AS C2I-OBJECTPROPERTY as OP1

WHERE C1.OID = OP1.SubjectI-ClassOID ANDC2.OID = OP1.ObjectI-ClassOID ANDC1.ClassOID = ’01’ ANDOP1.PropertyOID = ’04’

VI. IMPLEMENTATION AND EXPERIMENTAL SETTING

NYAYA is implemented in Java and its inference engine isbased on the IRIS Datalog engine18. The persistent storagehas been implemented in Oracle 11g R2, exploiting its nativepartitioning techniques. In particular NYAYA adopts referentialpartitioning, that allows the partitioning of related tablesbased on referential constraints. In other words, the relationsreferring to the data are partitioned with respect to theirforeign keys to the relations referring to the schema level(i.e. meta-data). For instance the I-CLASS, I-DATAPROPERTYand I-OBJECTPROPERTY were partitioned with respect tothe classOID and propertyOID references respectively. Eachpartition has a check constraint to redirect data inserted into theparent table and query processing into the children. Moreover,we defined clustered B+ tree indexes on the Subjecti-ClassOIDattribute of I-OBJECTPROPERTY and i-ClassOID attributeof I-DATAPROPERTY, and unclustered B+ tree indexes onthe Objecti-ClassOID of I-OBJECTPROPERTY, Value of I-DATAPROPERTY and URI of I-CLASS. In our implementationwe generated OIDs and translated textual data (e.g value ofI-DATAPROPERTY) by hash coding to gain more performance.

18http://www.iris-reasoner.org/

A prototype implementation of NYAYA is available online athttp://pamir.dia.uniroma3.it:8080/nyaya.

In the remainder of this section we discuss the results of ourexperimental campaign conducted to evaluate the performanceof our system.

A. Performance Evaluation

We compared NYAYA with two well-known systems:BigOWLim19 and IODT20. We used BigOWLim v. 3.3 on FileSystem (i.e. as Backend store), while IODT v. 1.5 includingthe Scalable Ontology Repository (SOR), was used over DB2v.9.1. All the experiments were performed on a dual quadcore 2.66GHz Intel Xeon, running Linux RedHat, with 8 GBof memory, 6 MB cache, and a 2-disk 1Tbyte striped RAIDarray.

In our experiments we used two widely-accepted bench-marks: LUBM21 and UOBM [30] (i.e. we generated up to 50Universities, that is 12.8 million triples). Since the expressive-ness of the meta-data in the above datasets exceeds the oneof Datalog± (i.e., OWL-Lite for LUBM and OWL-DL forUOBM), in our experiments we used the best FO-reducibleapproximation of the meta-data for all the systems.

Moreover we used Wikipedia3, a conversion of the EnglishWikipedia22 into RDF. This is a monthly updated dataset con-taining around 47 million triples. The dataset contains the twomain classes Article and Category (where Category is subClassof Article), four dataproperties (i.e. text, dc:title, dc:modified,dc:contributor) and seven objectproperties (i.e. skos:subject,skos:narrower, link with subProperties internalLink, exter-nalLink, interwikiLink, and redirectsTo). The expressivenessof its meta-data completely falls into the Datalog± so noadaptations were made.

We conducted three groups of experiments: (i) data loading,(ii) querying and reasoning, and (iii) maintenance.

TABLE IIIDATA LOADING

BigOWLim IODT NYAYALUBM(50) 8 (mins) 77 (mins) 9,18 (mins)UOBM(50) 18 (hours) 65 (hours) 0,5 (hours)Wikipedia3 71 (hours) 181 (hours) 2,38 (hours)

Data loading. As shown in Table III, NYAYA outperformssignificantly IODT and BigOWLim. The main advantage ofour framework is that we execute pure data loading whileother systems have to pre-compile the knowledge-base pro-ducing a (possibly exponential) blow-up of the tuples to bestored. The loading time for BigOWLim changes dramaticallyfrom LUBM to UOBM (i.e. 135 times slower) due to theincreasing complexity of the meta-data, and from LUBM toWikipedia3 (i.e. 532 times slower), due to the large set offacts. Moreover our framework is supported by the meta-model

19http://www.ontotext.com/owlim/big/20http://www.alphaworks.ibm.com/tech/semanticstk21http://swat.cse.lehigh.edu/projects/lubm/22http://labs.systemone.net/wikipedia3

Page 10: Semantic Data Markets: Store, Reason and Query

(a) cold-cache

(b) warm-cache

Fig. 5. UOBM50: Experimental Results

presented in Section III that allows a parallel execution of dataimport into the DBMS. BigOWLim has to process all triplesin memory, while IODT exploiting the Multi DimensionalClustering (MDC) [31] organization has to maintain complexblock indexes pointing to groups of records.

Querying and reasoning. In this group of experiments,we employed the 14 and 13 benchmark queries from LUBMand UOBM respectively. Due to space limitations we omit thequery lists here. For Wikipedia3 we defined six representativequeries, as follows

Q1(X):- Article(X), contributor(X,’Adam Bishop’).

Q2(X):- Article(X), link(’wiki:List_of_cities_in_Brazil’,X).

Q3(X):- Article(X), link(X,Y),internalLink(’wiki:United_States’, Y).

Q4(X):- link(X,Y), link(Y,Z), link(Z,’wiki:E-Mail’).

Q5(X):- Article(X).

Q6(Y):- Article(X), internalLink(’wiki:Algorithm’,X)internalLink(X,Y).

For OWL querying and reasoning, we performed cold-cacheexperiments (i.e. by dropping all file-system caches beforerestarting the various systems and running the queries) andwarm-cache experiments (i.e. without dropping the caches).We repeated all procedures three times and measured theaverage execution time.

UOBM50. We ran queries with increasing complexity. Inparticular we evaluated three components for each time:preprocessing, execution and traversal. BigOWLim exploitsthe preprocessing to load statistics and indexes in memory,while NYAYA computes rewriting tasks (i.e. ΣO, ΣS and SQLrewritings). IODT doesn’t adopt any preprocessing. The queryrun-times are shown in Figure 5.

In general BigOWLim performs better than IODT, andNYAYA is the best among the three approaches despite some

Page 11: Semantic Data Markets: Store, Reason and Query

(a) cold-cache (b) warm-cache

Fig. 6. Wikipedia3: Experimental Results

queries where the generated ΣO rewriting produces a largenumber of CQs to be executed. However, we are currentlyinvestigating optimisation techniques to significantly reducethe number of CQs produced by query-rewriting. We obtainedaverage cold-cache times of 1257 with respect to 1783 ofBigOWLim and 1993 of IODT. The warm-cache times ofNYAYA (i.e. geometric mean of 39) are comparable withBigOWLim (i.e. geometric mean of 42) and better than IODT(i.e. geometric mean of 168). NYAYA spends most of the com-putation of the ΣO rewriting and then benefits from generatingUCQs where each CQ can be executed in parallel and isefficiently answerable. Moreover, as for data loading, the entireprocedure is well-supported by the persistent organization ofdata. BigOWLim and IODT mainly focus on the executionstep. Therefore the amount of data to query (i.e. reducedin our system) plays a relevant role. The warm-cache timesdemonstrates how BigOWLim is well optimized to reducememory consumption for triple storage and query processing.We omit the LUBM results because the systems present asimilar behaviour.

Wikipedia3. Figure 6 shows the query run-times. NYAYA isthe best (in average) among the three approaches, but inthis case IODT performs better than BigOWLim. This isdue to the significant amount of data to process, althoughthe ΣO is simpler. All queries present the main hierarchyinferencing on Article and link. In particular Q3 requiresto traverse all articles linking to other articles linked byhttp://en.wikipedia.org/wiki/United States while Q4 requires arti-cles subject of a complex chain of links. Both the queries stressas the reasoning mechanism as the storage organization of allsystems. In this case NYAYA outperforms all competitors bya large margin. We improve cold cache times by a geometricmean of 8618 with respect to 21506 of BigOWLim and 16519of IODT. The warm-cache times of NYAYA (i.e. geometric

mean of 760) are comparable with IODT (i.e. geometric meanof 785) and better than BigOWLim (i.e. geometric meanof 1338). In this situation IODT demonstrates high perfor-mance and scalability in query and reasoning with respect toBigOWLim.

Maintenance. In these last experiments we tested mainte-nance operations in terms of insertion (deletion) of new (old)terms or assertions and update of existing terms or assertionsinto our datasets.

Fig. 7. Maintenance of a kiosk

Figure 7 shows the performance of each system with respectto maintenance operations (i.e., insert, delete and update) ofdata and meta-data w.r.t. the LUBM, UOBM and Wikipediadatasets. We consider the average response time (msec) toinsert, delete and update a predicate (i.e., concept, role orattribute) among the meta-data or a fact in the database. Theresults highlight how NYAYA outperforms the other systemswhen dealing with inserts and deletions of both data andmeta-data. In other words, in both BigOWLim and IODT,to insert a new predicate or a fact entails re-compiling a(possibly huge) set of tuples matching or referring to thenew predicate (e.g. subclasses, property restrictions, and soon) while in NYAYA this is done by inserting or removing a

Page 12: Semantic Data Markets: Store, Reason and Query

single-tuple in the storage. On the other side, in order to updatean existing predicate in all the systems the meta-data (e.g.,foreign keys, integrity constraints and so on) of all the factsreferring to that predicate must be updated. As a consequence,the complexity of this task depends also on the amount ofdata stored in the database. In this respect, Figure 7 showsthe behaviour of NYAYA w.r.t. meta-data updates when pure-SQL operations are used (i.e. DBMS independent) and whenpartition-maintenance functions provided by the DBMS (i.e.Oracle) are adopted.

VII. CONCLUSION AND FUTURE WORK

In this paper we have presented the development of a sys-tem, called NYAYA, for the uniform management of differentrepositories of Semantic Web data. NYAYA provides advancedreasoning capabilities over collections of Semantic-Web datasources taking advantage, when available, of meta-data ex-pressed in diverse ontology languages and at varying levelsof complexity. In addition, it provides the ability to (i) querythe sources with a uniform interface, (ii) update efficientlyboth data and meta-data, and (iii) easily add new sources. Thereasoning and querying engine of NYAYA relies on Datalog±,a very general logic-based language that captures the mostcommon tractable ontology languages. The effectiveness andthe efficiency of the system have been tested over availabledata repositories and benchmarks.

This work opens a number of interesting and challengingdirections of further research. First, we are planning to extendthe approach to other FO-rewritable members of the Datalog±

family, such as Sticky TDGs [25]; the next natural step is to fur-ther extend NYAYA to deal with non-FO-reducible languageslike Guarded and Weakly-Guarded Datalog±. Moreover, on adifferent front, it would be interesting to exploit NYAYA forsupporting also the reconciliation and the integration of thedata of the Semantic Data Market.

REFERENCES

[1] C. Bizer and R. Cyganiak, “D2R server: Publishing relational databaseson the semantic web,” in Proc. of the 5th Intl Semantic Web Conf.(ISWC)- Poster Session, 2006.

[2] R. de Virgilio, F. Giunchiglia, and L. Tanca, Eds., Semantic WebInformation Management - A Model-Based Perspective. SpringerVerlag, 2010.

[3] A. Calı, G. Gottlob, and T. Lukasiewicz, “A general datalog-basedframework for tractable query answering over ontologies,” in Proc. ofthe 28th Symp. on Principles of Database Systems (PODS), 2009, pp.77–86.

[4] R. Angles and C. Gutierrez, “Querying RDF data from a graph databaseperspective,” in Proc. of the 2nd European Semantic Web Conf. (ESWC),2005, pp. 346–360.

[5] S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, andK. Tolle, “The ICS-FORTH RDFSuite: Managing voluminous RDFdescription bases,” in Proc. of the 2nd Intl Workshop on the SemanticWeb (SemWeb), 2001, pp. 109–113.

[6] E. Chong, S. Das, G. Eadon, and J. Srinivasan, “An efficient SQL-basedRDF querying scheme,” in Proc. of the 31th Intl Conf. on Very LargeData Bases (VLDB), 2005, pp. 1216–1227.

[7] K. Wilkinson, “Jena property table implementation,” in Proc. of the IntlWorkshop on Scalable Semantic Web Knowledge Bases (SSWS), 2006.

[8] O. Erling and I. Mikhailov, “RDF support in the virtuoso DBMS,” inProc. of the Conf. on Social Semantic Web (CSSW), 2007, pp. 7–24.

[9] S. M. D. J. Abadi, A. Marcus and K. Hollenbach, “SW-Store: a verticallypartitioned DBMS for semantic web data management,” VLDB Journal,vol. 18, no. 2, pp. 385–406, 2009.

[10] J. Beckmann, A. Halverson, R. Krishnamurthy, and J. Naughton, “Ex-tending RDBMSs to support sparse datasets using an interpreted attributestorage format,” in Proc. of the 22rd Intl Conf. on Data Engineering(ICDE), 2006, pp. 58–68.

[11] L. Sidirourgos, R. Goncalves, M. L. Kersten, N. Nes, and S. Manegold,“Column-store support for RDF data management: not all swans arewhite,” in Proc. of the Intl Conf. on Very Large Database (VLDB), 2008,pp. 1553–1563.

[12] D. Calvanese, G. de Giacomo, D. Lembo, M. Lenzerini, and R. Rosati,“Tractable reasoning and efficient query answering in description logics:The DL-Lite Family,” Journal of Automated Reasoning, vol. 39, no. 3,pp. 385–429, 2007.

[13] J. Perez, M. Arenas, and C. Gutierrez, “Semantics and complexity ofSPARQL,” ACM Transactions on Database Systems, vol. 34, no. 3, pp.1–45, 2009.

[14] A. Acciarri, D. Calvanese, G. de Giacomo, D. Lembo, M. Lenzerini,M. Palmieri, and R. Rosati, “QuOnto: Querying ontologies,” in Proc. ofthe National Conf. on Artificial Intelligence, 2005, pp. 1670–1671.

[15] B. M. H. Perez-Urbina and I. Horrocks, “Efficient query answering forOWL 2,” in Proc. of the 8th Intl Semantic Web Conf. (ESWC), 2009,pp. 489–504.

[16] H. Perez-Urbina, B. Motik, and I. Horrocks, “Tractable query answeringand rewriting under description logic constraints,” Journal of AppliedLogic, vol. 8, no. 2, pp. 151–232, 2009.

[17] C. Beeri, A. Levy, and M. Rousset, “Rewriting queries using views indescription logics,” in Proc. of the 16th Symp. on Principles of DatabaseSystems (PODS), 1997, pp. 99–108.

[18] C. Lutz, D. Toman, and F. Wolter, “Conjunctive query answering in thedescription logic EL using a relational database system,” in Proc. of 21stIntl Joint Conf. on Artificial Intelligence (IJCAI), 2009, pp. 2070–2075.

[19] P. Atzeni, P. Cappellari, R. Torlone, P. Bernstein, and G. Gianforme,“Model-independent schema translation,” VLDB Journal, vol. 17, no. 6,pp. 1347–1370, 2008.

[20] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, The Description Logic Handbook: Theory, Implementation,and Applications. Cambridge University Press, 2003.

[21] M. Kifer and G. Lausen, “F-Logic: a higher-order language for reasoningabout objects, inheritance, and scheme,” ACM SIGMOD Record, vol. 18,no. 2, pp. 134–146, 1989.

[22] S. Abiteboul and O. M. Duschka, “Complexity of answering queriesusing materialized views,” in Proc. of the 17th Symp. on Principles ofDatabase Systems (PODS), 1998, pp. 254–263.

[23] S. Ceri, G. Gottlob, and L. Tanca, “What you always wanted toknow about Datalog (and never dared to ask),” IEEE Transactions onKnowledge and Data Engineering, vol. 1, no. 1, pp. 146–166, 1989.

[24] T. Lee, “Arithmetical definability over finite structures,” MathematicalLogic Quarterly, vol. 49, no. 4, pp. 385–393, 2003.

[25] A. Calı, G. Gottlob, and A. Pieris, “Advanced processing for ontologicalqueries,” in Proc. of the 36th Intl Conf. on Very Large Databases(VLDB), to appear 2010.

[26] A. Calı, G. Gottlob, and A. Pieris, “Tractable query answering overconceptual schemata,” in Proc. of the 28th Intl Conf. on ConceptualModeling (ER), 2009, pp. 175–190.

[27] A. K. Chandra and M. Y. Vardi, “The implication problem for functionaland inclusion dependencies is undecidable,” SIAM Journal of Comput-ing, vol. 14, no. 3, pp. 671–677, 1985.

[28] A. Calı, G. Gottlob, and M. Kifer, “Taming the infinite chase: Queryanswering under expressive relational constraints,” in Proc. of the11th Intl Joint Conf. on Principles of Knowledge Representation andReasoning (KR), 2008, pp. 70–80.

[29] A. Calı, D. Lembo, and R. Rosati, “On the decidability and complexityof query answering over inconsistent and incomplete databases,” in Proc.of the 22nd Symp. on Principles of Database Systems (PODS), 2003,pp. 260–271.

[30] L. Ma, Y. Yang, Z. Qiu, G. T. Xie, Y. Pan, and S. Liu, “Towards acomplete OWL ontology benchmark,” in Proc. of the 3rd EuropeanSemantic Web Conf. (ESWC), 2006, pp. 125–139.

[31] S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, andM. Huras, “Multi-dimensional clustering: A new data layout scheme indb2,” in Proc. of the 22rd Intl Conf. on Management of Data (SIGMOD),2003, pp. 637–641.