summarizing semantic graphs: a survey › hal-01925496 › file › hal.pdfin conjunction with rdf...

HAL Id: hal-01925496https://hal.inria.fr/hal-01925496

Submitted on 16 Nov 2018

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Summarizing Semantic Graphs: A SurveySejla Cebiric, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos,

Ioana Manolescu, Georgia Troullinou, Mussab Zneika

To cite this version:Sejla Cebiric, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos, Ioana Manolescu, et al..Summarizing Semantic Graphs: A Survey. The VLDB Journal, Springer, 2019, 28 (3). �hal-01925496�

https://hal.inria.fr/hal-01925496

https://hal.archives-ouvertes.fr

Summarizing Semantic Graphs: A Survey

Sejla Cebiric1 Francois Goasdoue2 Haridimos Kondylakis3 Dimitris Kotzinos4

Ioana Manolescu1 Georgia Troullinou3

Mussab Zneika4

1Inria and LIX (UMR 7161, CNRS and Ecole polytechnique), France 2U. Rennes, Inria, CNRS, IRISA, France3ICS-FORTH, Greece 4ETIS, UMR 8051, U. Cergy-Pontoise, ENSEA, CNRS, France

November 15, 2018

Abstract

The explosion in the amount of the available RDFdata has lead to the need to explore, query and un-derstand such data sources. Due to the complexstructure of RDF graphs and their heterogeneity,the exploration and understanding tasks are signifi-cantly harder than in relational databases, where theschema can serve as a first step toward understand-ing the structure. Summarization has been appliedto RDF data to facilitate these tasks. Its purpose isto extract concise and meaningful information fromRDF knowledge bases, representing their content asfaithfully as possible. There is no single concept ofRDF summary, and not a single but many approachesto build such summaries; each is better suited forsome uses, and each presents specific challenges withrespect to its construction. This survey is the firstto provide a comprehensive survey of summarizationmethod for semantic RDF graphs. We propose a tax-onomy of existing works in this area, including alsosome closely related works developed prior to theadoption of RDF in the data management commu-nity; we present the concepts at the core of each ap-proach and outline their main technical aspects andimplementation. We hope the survey will help read-ers understand this scientifically rich area, and iden-tify the most pertinent summarization method for avariety of usage scenarios.

1 Introduction

Semantic languages and models are increasingly usedin order to describe, represent and exchange datain multiple domains and forms. In particular, giventhe prominence of the World Wide Web Consortium(W3C)1 in the international technological arena, itsstandard model for representing semantic graphs,namely RDF, has been widely adopted. Many RDFKnowledge Bases (KBs, in short) of millions or evenbillions of triples are now shared through the Web,also thanks to the development of the Open Datamovement, which has evolved jointly with the datalinking best practices based on RDF. A famous repos-itory of open RDF graphs is the Linked Open Datacloud, currently referencing more than 62 billion RDFtriples, organized in large and complex RDF datagraphs [88]. Further, several RDF graphs are con-ceptually linked together into one, as soon as a nodeidentifier appears in several graphs. This enablesquerying KBs together, and increases the need to un-derstand the basic properties of each data source be-fore figuring out how they can be exploited together.

The fundamental difficulty toward understandingan RDF graph is its lack of a standard structure (orschema), as RDF graphs can be very heterogeneousand the basic RDF standard does not give means toconstrain graph structure in any way. Ontologies can(but do not have to) be used in conjunction to RDF

1http://www.w3.org

1

data graphs, in order to give them more meaning,notably by describing the possible classes resourcesmay have, their properties, as well as relationshipsbetween these classes and properties. On one hand,ontologies do provide an extra entry point into thedata, as they allow to grasp its conceptual structure.On the other hand, they are sometimes absent, andwhen present, they can be themselves quite complex,growing up to hundreds or thousands of concepts;SNOMED-CT2, a large medical ontology, comprisesmillions of terms.

To cope with these layers of complexity, RDFgraph summarization has aimed at extracting con-cise but meaningful overviews from RDF KBs, repre-senting as close as possible the actual contents of theKB. RDF summarization has been used in multipleapplication scenarios, such as: identifying the mostimportant nodes, query answering and optimization,schema discovery from the data, or source selection,and graph visualization to get a quick understand-ing of the data. It should be noted that indexing,query optimization and query evaluation were stud-ied as standalone problems in the data managementareas, before the focus went to semantic RDF graphs;therefore, several summarization methods initiallystudied for data graphs were later adapted to RDF.Among the currently known RDF summarization ap-proaches, some only consider the graph data withoutthe ontology, some others consider only the ontology,finally some use a mix of the two. Summarizationmethods rely on a large variety of concepts and tools,comprising structural graph characteristics, statis-tics, pattern mining or a mix thereof. Summarizationmethods also differ in their usage scope. Some sum-marize an RDF graph into a smaller one, allowingsome RDF processing (e.g., query answering) to beapplied on the summary (also). The output of othersummarization methods is a set of rules, or a set offrequent patterns, an ontology etc.

Summarizing semantic graphs is a multifacetedproblem with many dimensions, and thus many algo-rithms, methods and approaches have been developedto cope with it. As a result, there is now a confusionin the research community about the terminology in

2https://www.snomed.org/snomed-ct

the area, further increased by the fact that certainterms are often used with different meanings in therelevant literature, denoting similar, but not identicalresearch directions or concepts. We believe that thislack of terminology and classification hinders scien-tific development in this area.

To improve understanding of this field and to helpstudents, researchers or practitioners seeking to iden-tify the summarization algorithm, method or toolbest suited for a specific problem, this survey at-tempts to provide a first systematic organization ofknowledge in this area. We propose a taxonomy ofRDF (and most representative, prior graph) summa-rization approaches. Then, we classify existing worksaccording to the main class of algorithmic notionsthey are based on; further, for each work, we specifytheir accepted inputs, outputs, and when a tool ispublicly available, we provide the reference to it. Weplace each of the works in the space defined by thedimensions of our classification; we summarize theirmain concepts and compare them when appropriate.

Since our focus is on RDF graph summarizationtechniques, we leave out of our scope graph sum-marization techniques tailored for other classes ofgraphs, e.g., biological data graphs [90], social net-works [57] etc. We focus on techniques that haveeither been specifically devised for RDF, or adaptedto the task of summarizing RDF graphs. The litera-ture comprises surveys on generic (non-RDF) graphsummarization, and/or partial surveys related to ourarea of study. The authors of [113] present genericgraph summarization approaches, with a main focuson grouping-based methods. A recent survey [59] hasa larger focus than ours. It considers static graphs aswell as graphs changing over time; graphs which arejust connection networks (node and edge labels arenon-existent or ignored), but also labeled directed orundirected graphs, which can be seen as simple sub-sets of RDF. Also, a recent tutorial [40] covers a sim-ilar set of topics. However, given their broad scope,these works describe areas of work we are not con-cerned with, such as social (network) graph summa-rization, and ignore many of the proposals specificallytailored for RDF graphs, which are labeled, oriented,heterogeneous, and may be endowed with type infor-mation and semantics. In contrast, our survey seeks

2

to answer a need we encountered among many Se-mantic Web practitioners, for a comprehensive reviewof summarization techniques tailored exactly to suchgraphs. Another recent work [80] focuses on metricsused for ontology summarization only, whereas weconsider both RDF graphs and their ontologies.

Our survey is structured as follows: Section 2 re-calls the foundations of the RDF data model, andRDF Schema (RDFS, in short), the simplest ontologylanguage which can be used in conjunction with RDFto specify semantics for its data. Section 3 describesRDF summarization scope, applications and dimen-sions of analysis for this survey. In Section 4, weclassify along these dimensions a selection of the maingraph summarization works which preceded works onRDF summarization. We include a short discussionof these works here, as they were the first to intro-duce a set of concepts crucial for summarization, andon which RDF-specific summaries have built. In thesequel of the survey, Sections 5, 6, 7 and 8 analyzethe related works in each category. Finally, Section 9concludes this paper and identifies fields of future ex-ploration.

2 Preliminaries: RDF graphs

We recall here the core concepts and notations relatedto RDF graphs. At a first glance, these can be con-sidered particular cases of labeled, oriented graphs,and indeed classical graph summarization techniqueshave been directly adapted to RDF; we recall themin Section 2.1. Then, we present RDF graphs in Sec-tion 2.2, where we introduce the terminology and spe-cific constraints which make up the RDF standard,established by the W3C; we also introduce here on-tologies, which play a central role in most RDF ap-plications, with a focus on the simple RDF Schemaontology language. From a database perspective,the most common usage of RDF graphs is throughqueries; therefore, we recall the Basic Graph Pattern(BGP) dialect at the core of the SPARQL RDF querylanguage in Section 2.3. Finally, a brief discussion ofmore expressive ontology languages, sometimes usedin conjunction with RDF graphs, is provided in Sec-tion 2.4.

2.1 Labeled directed graphs: coreconcepts

Labeled directed graphs are the core concept allowingto model RDF datasets. Further, most (not all) pro-posals for summarizing an RDF graph also model thesummary as a directed graph. Thus, without loss ofgenerality, we will base our discussion on this model.Note that it can be easily generalized to more com-plex graphs, e.g., those with (multi-)labeled nodes.

Given a set A of labels, we denote by G = (V,E)an A-edge labeled directed graph whose vertices areV , and whose edges are E ⊆ V ×A× V .

Figure 1 displays two such graphs; A edge la-bels are attached to edges. Node labels will beused/explained shortly in our discussion.

Figure 1: Sample edge-labeled directed graphs

In addition, the notions of graph homomorphismand graph isomorphism frequently appear in graphsummary proposals:

Definition 1 (Homomorphism and isomorphism)Let G = (V,E) and G′ = (V ′, E′) be two A-edgelabeled directed graphs. A function φ : V → V ′ isa homomorphism from G to G′ iff for every edge(v1, l, v2) ∈ E there is an edge (φ(v1), l, φ(v2) ∈ G′.If, moreover, φ is a bijection, and its inverse φ−1 isalso a homomorphism from G′ into G, then φ is anisomorphism.

A homomorphism from G to G′ ensures that thegraph structure present in G has an ”image” into G′.For our discussion, this is interesting in three differentsettings:

1. If G is a data graph and G′ is a summary graphrepresenting G, a homomorphism from G to G′

3

ensures that every subgraph of G has an imagein G′.

2. Conversely, a homomorphism from a summarygraph G′ into the data graph G means that allthe graph structures present in the summary alsoappear in the data graph.

3. IfQ is a graph query, e.g., expressed in SPARQL,and G is a data graph, e.g., an RDF graph, theanswer to Q on G, denoted Q(G), is exactly de-fined through the set of homomorphisms whichmay be established from G to G′. Together withthe two items above, this leads to several inter-esting relationships between queries, data graphsand their summaries, in particular allowing touse the summary to gain some knowledge aboutQ(G) without actually evaluating it.

Observe that while homomorphisms between agraph and its summary have useful properties, anisomorphism would defeat the purpose of summariza-tion, as two isomorphic graphs would have the samesize.

In Figure 1, the graph shown on the right is homo-morphic to that on the left. Indeed, a homomorphismmaps each node from the graph at left into the rightgraph node whose label contains its number.

Notations: node and edge counts Throughoutthis survey, unless otherwise specified, N denotes thenumber of nodes and M the number of edges of a di-rected graph input to some summarization approach.

2.2 The Resource Description Frame-work (RDF)

Our study of graph summarization techniques is cen-trally motivated by their interest when summarizingRDF graphs. RDF is the standard data model pro-moted by the W3C for Semantic Web applications.

RDF graph. An RDF graph (in short a graph) isa set of triples of the form (s, p, o). A triple statesthat a subject s has the property p, and the value ofthat property is the object o. We consider only well-formed triples, as per the RDF specification [107],

Assertion Triple Relational notation

Class (s, rdf:type, o) o(s)Property (s, p, o) p(s, o)

Constraint Triple OWA interpretation

Subclass (s,≺sc, o) s ⊆ o

Subproperty (s,≺sp, o) s ⊆ o

Domain typing (p,←↩d, o) Πdomain(s) ⊆ o

Range typing (p, ↪→r, o) Πrange(s) ⊆ o

Figure 2: RDF (top) & RDFS (bottom) statements.

belonging to (U ∪ B) × U × (U ∪ B ∪ L) where Uis a set of Uniform Resource Identifiers (URIs), La set of typed or untyped literals (constants), andB a set of blank nodes (unknown URIs or literals);U ,B,L are pairwise disjoint. Blank nodes are essen-tial features of RDF allowing to support unknownURI/literal tokens. These are conceptually similar tothe labeled nulls or variables used in incomplete re-lational databases [1], as shown in [27]. As describedabove, it is easy to see that any RDF graph is a la-beled graph as described in Section 2.1. However, aswe explain below, RDF graphs may contain an ontol-ogy, that is, a set of graph edges to which standardontology languages attach a special interpretation.The presence of ontologies raises specific challengeswhen summarizing RDF graphs, which do not occurwhen only plain data graphs are considered.

Notations. We use s, p, and o as placehold-ers for subjects, properties and objects, respec-tively. Literals are shown as strings between quotes,e.g., “string”. Fig. 2 (top) shows how to use triples todescribe resources, that is, to express class (unary re-lation) and property (binary relation) assertions. TheRDF standard [107] has a set of built-in classes andproperties, as part of the rdf: and rdfs: pre-definednamespaces. We use these namespaces exactly forthese classes and properties, e.g., rdf:type specifiesthe class(es) to which a resource belongs. For brevity,we will sometimes use τ to denote rdf:type.

Example 1 (RDF graph) For example, the fol-lowing RDF graph G describes a book, identified bydoi1, its author (a blank node :b1 whose name is

4

known), title and date of publication:

G =

{(doi1, rdf:type, Book), (doi1, writtenBy, :b1),(doi1, hasTitle, “Le Port des Brumes”),( :b1, hasName, “G. Simenon”),(doi1, publishedIn, “1932”)}

RDF Schema (RDFS). RDFS allows enhancingthe assertions made in an RDF graph with the useof an ontology, i.e., by declaring semantic constraintsbetween the classes and the properties they use. Fig.2 (bottom) shows the four main kinds of RDFS con-straints, and how to express them through tripleshence particular graph edges. For concision, we de-note the properties expressing subclass, subproperty,domain and range constraints by the symbols ≺sc,≺sp, ←↩d and ↪→r, respectively. Here, “domain” de-notes the first, and “range” the second attribute ofevery property.

The RDFS constraints depicted in Fig. 2 are inter-preted under the open-world assumption (OWA) [1],i.e., as deductive constraints. For instance, ifthe triple (hasFriend,←↩d,Person) and the triple(Anne,hasFriend,Marie) hold in the graph, then sodoes the triple (Anne, τ,Person). The latter is due tothe domain constraint in Fig. 2.

Example 2 (RDF graph with an RDFS ontology)Assume that the RDF graph G in the preced-ing example is extended with the RDFS on-tological constraints: (Book,≺sc,Publication),(writtenBy,≺sp,hasAuthor), (writtenBy,←↩d,Book)and (writtenBy, ↪→r,Person). The resulting graphis depicted in Fig. 3. Its implicit triples are thoserepresented by dashed-line edges.

RDF entailment. An important feature of RDFgraphs are implicit triples. Crucially, these are con-sidered part of the RDF graph even though theyare not explicitly present in it, e.g., the dashed-lineG edges in Fig. 3, hence require attention for RDFgraph summarization.

W3C names RDF entailment the mechanismthrough which, based on a set of explicit triples andsome entailment rules, implicit RDF triples are de-rived. We denote by `iRDF immediate entailment,

doi1

Book

Publication

“Le Port des Brumes”

:b1

“G. Simenon”

“1932”

Person

writtenBy

hasAuthor

publishedIn

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

hasTitle

writtenBy

hasName

rdf:type

rdf:type

hasAuthor rdf:type

rdfs:domain

Figure 3: RDF graph and its implicit triples.

i.e., the process of deriving new triples through a sin-gle application of an entailment rule. More generally,a triple (s, p, o) is entailed by a graph G, denotedG `RDF (s, p, o), if and only if there is a sequence ofapplications of immediate entailment rules that leadsfrom G to (s, p, o) (where at each step, triples previ-ously entailed are also taken into account).

Saturation. The immediate entailment rules allowdefining the finite saturation (a.k.a. closure) of anRDF graph G, which is the RDF graph G∞ definedas the fixed-point obtained by repeatedly applying`iRDF rules on G.

The saturation of an RDF graph is unique (upto blank node renaming), and does not contain im-plicit triples (they have all been made explicit bysaturation). An obvious connection holds betweenthe triples entailed by a graph G and its saturation:G `RDF (s, p, o) if and only if (s, p, o) ∈ G∞.

RDF entailment is part of the RDF standard itself;in particular, the answers to a query posed on G musttake into account all triples in G∞ [110], since in thepresence of RDF Schema constraints, the semanticsof an RDF graph is its saturation [107]. As a result,the summarization of an RDF graph should reflectits saturation, e.g., by summarizing the saturation ofthe graph instead of the graph itself.

Example 3 (RDF entailment and saturation)The saturation of the RDF graph comprising RDFSconstraints G, displayed in Fig. 3, is the graph G∞

obtained by adding to G all its implicit triples thatcan be derived through RDF entailment, i.e., the

5

graph G in which the implicit/dashed edges are madeexplicit/solid ones.

We introduce below a few more notions we willneed in order to describe existing RDF summariza-tion proposals.

Instance and schema graph. An RDF instancegraph is made of assertions only (recall Fig. 2), whilean RDF schema graph is made of constraints only(i.e., it is an ontology). Further, an RDF graph canbe partitioned into its (disjoint) instance and schemasubgraphs.

Properties and attributes of an RDF graph.While this is not part of the W3C standard, someauthors use attribute to denote a property (other thanthose built in the RDF and RDFS standards, suchas τ , ←↩d etc.) of an RDF resource such that theproperty value is a literal. In these works, the termproperty is reserved for those RDF properties whosevalue is an URI.

Example 4 (Instance, schema, properties and attributes of an RDF graph)The RDF graph G shown in Fig. 3 consists of theRDF schema graph comprising the blue triples, andof the RDF instance graph comprising the blacktriples. Further, within this G instance subgraph,the properties considered attributes are the following:publishedIn, hasTitle and hasName.

2.3 BGP queries

SPARQL3 is the standard W3C query langage usedto query RDF graphs. We consider its popular con-junctive fragment consisting of Basic Graph Pattern(BGP) queries. Subject of several recent works [27,95, 26, 78, 9], BGP queries are also the most widelyused in real-world applications [78, 53]. A BGP isa generalization of an RDF graph in which variablesmay also appear as subject, property and object oftriples.

3https://www.w3.org/TR/rdf-sparql-query/

Notations. In the following we use the conjunctivequery notation Q(x):- t1, . . . , tα, where {t1, . . . , tα} isa BGP. The head of Q is Q(x), and the body of Qis t1, . . . , tα. The query head variables x are calleddistinguished variables, and are a subset of the vari-ables occurring in t1, . . . , tα; for boolean queries x isempty. We denote by VarBl(Q) the set of variablesand blank nodes occurring in the query Q. In thesequel, we will use x, y, z, etc. to denote variables inqueries.

Query evaluation. Given a queryQ(x):- t1, . . . , tα and an RDF graph G, the evaluationof Q against G is:

Q(G) = {Φ(x) | Φ : VarBl(Q)→ Val(G) is a Q to Ghomomorphism such that {Φ(t1), . . . ,Φ(tα)} ⊆ G}

where we denote by Φ(t) (resp. Φ(x)) the result ofreplacing every occurrence of a variable or blank nodee ∈ VarBl(Q) in the triple t (resp. the distinguishedvariables x), by the value Φ(e) ∈ Val(G).

Query answering. The evaluation of Q against Guses only G’s explicit triples, thus may lead to anincomplete answer set. The (complete) answer setof Q against G is obtained by the evaluation of Qagainst G∞, denoted by Q(G∞).

Example 5 (Query evaluation versus answering)The query below asks for the author’s name of “LePort des Brumes”:

Q(x3):- (x1, hasAuthor, x2), (x2, hasName, x3)(x1, hasTitle, “Le Port des Brumes”)

Its answer against the explicit and implicit triplesof our sample graph is: Q(G∞) = {〈“G. Simenon”〉}.

Note that evaluating Q only against G leads to theempty answer, which is obviously incomplete.

2.4 OWL

Semantic graphs considered in the literature for sum-marization sometimes go beyond the expressivenessof RDF, which comes with the simple RDF Schema

6

ontology language. The standard by W3C for seman-tic graphs is the OWL [108, 109] family of dialectsthat builds on Description Logics (DLs) [4].

DLs are first-order logic languages that allow de-scribing a certain application domain by means ofconcepts, denoting sets of objects, and roles, denotingbinary relations between concept instances. DL di-alects differ in the ontological constraints they allowexpressing on complex concepts and roles, i.e., de-fined by DL formula. One of the most important is-sues in DLs is the trade-off between expressive powerand computational complexity of reasoning with theconstraints (consistency checking, query answering,etc.)

The first flavour of OWL [108] consists of three di-alects of increasing complexity: OWL-Lite, OWL-DLand OWL-Full. Unfortunately, very basic reasoning(concept satisfiability) in these dialects is highly in-tractable: ExpTime-complete in OWL-lite that cor-responds to the SHIFD DL, NExpTime-completein OWL-DL that corresponds to the SHIOND DL,and even undecidable in OWL-full. A second flavourof OWL [109], a.k.a. OWL2, defines three new di-alects, OWL2 EL based on the EL DL, OWL2 QLbased on the DL-liteR DL and OWL2 RL which canbe expressed using logical rules. These new dialectscomes with PTIME complexity for most of the rea-soning tasks. In particular, data management tasks(consistency checking, query answering, etc.) un-der OWL2 QL/DL-liteR ontologies have the samecomplexity as their counterparts in the relationaldatabase model [10].

3 RDF summarization: scope,applications and dimensionsof analysis for this survey

As we shall see, RDF summarization has been at-tached many different meanings in the literature, andresearch is still ongoing. Therefore, we start with de-limiting the scope of RDF summarization as consid-ered in this survey (Section 3.1), before describingthe RDF summary applications most frequently en-countered within this scope (Section 3.2), and finally

presenting several dimensions along which the cor-responding RDF summarization techniques can beclassified (Section 3.3).

3.1 Scope

Our goal in this survey is to study summarizationnotions and tools which are useful to concrete RDFdata management applications. We will thus dis-cuss a broad set of techniques, some of which arealso used outside our target RDF data managementcontexts. However, to keep the survey focused, self-contained, and useful to RDF practitioners, we do notcover graph summarization or clustering techniquesdesigned for very specific classes of graphs. For in-stance, while social network graphs can be modeledin RDF, such graphs have a very specific semantics,for instance, to reflect the important role of “user”nodes. Instead, we aim to cover summarization ofgeneral RDF graphs (without making assumptionson their application domain), without ontologies (inwhich case they basically coincide with labeled ori-ented graphs) or with ontologies (that are a specific,crucial feature of RDF data graphs).

Our review of the literature leads us to the follow-ing generic definition. An RDF summary is one orboth among the following:

1. A compact information, extracted from the orig-inal RDF graph; intuitively, summarization is away to extract meaning from data, while reduc-ing its size;

2. A graph, which some applications can exploit in-stead of the original RDF graph, to perform sometasks more efficiently; in this vision, a summaryrepresents (or stands for) the graph in specificsettings.

Clearly, these notions intersect, e.g., many graphsummaries extracted from the RDF graphs are com-pact and can be used for instance to make some queryoptimization decisions; these fit into both categories.However, some RDF summaries are not graphs; some(graph or non-graph) summaries are not always verycompact, yet they can be very useful etc.

7

3.2 Applications

We illustrate the above generic definition of an RDFsummary through a (non-exhaustive) list of usesand applications.

Indexing. Most (RDF) summarization methodsfrom the literature build summaries which are smallergraphs; each summary node represents several nodesof the original graph G. This smaller graph, then,serves as an index as follows. The identifiers of all theG nodes represented by each summary node v are as-sociated with the node v. To process a query onG, wefirstly identify the summary nodes, which may matchthe query; then identify based on the index, the graphnodes corresponding to these summary nodes, as afirst step toward answering the query.

Estimating the size of query results. Considera summary defined as a set of statistics about prop-erty (edge label) co-occurrence in G, that is: the sum-mary stores, for any two properties a, b appearing inG, the number of nodes which have at least an outgo-ing a edge and at least an outgoing b edge. If a querysearches, e.g., for resources having both a “descrip-tion” and an “endorsement” in an RDF graph storingproduct information, if the summary indicates thatthere are no such resources, we can return an emptyquery answer without consulting G. Further, assumethat a BGP query requires a resource with proper-ties p1, p2, somehow connected to another resourcewith properties p3, p4. If the summary shows thatthe former property combination is much rarer thanthe latter, a query optimizer can exploit this to startevaluating the query from the most selective condi-tions p1, p2.

Making BGPs more specific BGPs queries maycomprise path expression with wildcards; these arehard to evaluate, as they require traversing a poten-tially large part of G. A graph summary may helpunderstand, e.g., that a path specified as “any num-ber of a edges followed by one or more b edges” cor-responds to exactly two data paths in G, namely: a bedge; and an a edge followed by a b edge. These two

short and simple path queries are typically evaluatedvery efficiently.

Source selection One can detect based on a sum-mary whether a graph is likely to have a certain kindof information that the user is looking for, withoutactually consulting the graph. In a distributed queryprocessing setting, this can be used to know whichdata partition(s) are helpful for a query; in a LODcloud querying context, when answering queries overa large set of initially unknown data sources, thisproblem is typically referred to as source selection.

Graph visualization A graph-shaped summarymay be used to support the users’ discovery and ex-ploration of an RDF graph, helping them get ac-quainted with the data and/or as a support for visualquerying.

Vocabulary usage analysis RDF is often used asa mean to standardize the description of data froma certain application domain, e.g., life sciences, Webcontent metadata etc. A standardization commit-tee typically works to design a vocabulary (or set ofstandard data properties) and/or an ontology; appli-cation designers learn the vocabulary and ontologyand describe their data based on them. A few yearsdown the road, the standard designers are interestedto know which properties and ontology features wereused, and which were not; this can inform decisionsabout future versions of the standard4.

Schema (or ontology) discovery When an on-tology is not present in an RDF graph, some worksaim at extracting it from the graph. In this case, thesummary is meant to be used as a schema, which isconsidered to have been missing from the initial datagraph.

4Thanks to William van Voensel from schema.org for shar-ing this application with us.

8

3.3 Classification of RDF summariza-tion methods

From a scientific viewpoint, existing summarizationproposals are most meaningfully classified accordingto the main algorithmic idea behind the summariza-tion method:

1. Structural methods are those which considerfirst and foremost the graph structure, respec-tively the paths and subgraphs one encountersin the RDF graph. Given the prominence of ap-plications and graph uses, where structural con-ditions are paramount, graph structure is promi-nently used in summarization techniques.

• Quotient: A particular natural conceptwhen building summaries is that of quotientgraphs (Definition 2). They allow charac-terizing some graph nodes as ”equivalent”in a certain way, and then summarizing agraph by assigning a representative to eachclass of equivalence of the nodes in the orig-inal graph. A particular feature of struc-tural quotient methods is that each graphnode is represented by exactly one sum-mary node, given that one node can onlybelong to one equivalence class.

• Non-quotient: Other methods for struc-turally summarizing RDF graphs are basedon other measures, such as centrality, toidentify the most important nodes, and in-terconnect them in the summary. Suchmethods aim at building an overview of thegraph, even if (unlike quotient summaries)some graph nodes may not be representedat all.

2. Pattern mining methods: These methodsemploy mining techniques for discovering pat-terns in the data; the summary is then built outof the patterns identified by mining.

3. Statistical methods: These methods summa-rize the contents of a graph quantitatively. Thefocus is on counting occurrences, such as count-ing class instances or building value histograms

per class, property and value type; other quanti-tative measures are frequency of usage of certainproperties, vocabularies, average length of stringliterals etc. Statistical approaches may also ex-plore (typically small) graph patterns, but al-ways from a quantitative, frequency-based per-spective.

4. Hybrid methods: To this category belongworks that combine structural, statistical andpattern-mining techniques.

Another interesting dimension of analysis is therequired input by each summarization method, interms of the actual dataset, and of other user inputswhich some methods need:

1. Input parameters: Many works in the arearequire user parameters to be defined, e.g. user-specified equivalence relations, maximum sum-mary size, weights assigned to some graph ele-ments etc., whereas others are completely userindependent. While parameterized methods areable to produce better results in specific sce-narios, they require some understanding of themethodology and as such limit their exploitationability only to experts.

2. Input Dataset: Different works have differentrequirements from the dataset they get as in-put. RDF data graphs are most frequently ac-cepted, usually RDF/S and/or OWL are usedfor specifying graph semantics whereas only veryfew works consider DL models. In addition,some works consider or require only ontologies(semantic schema), whereas other works exploitonly instances. Hybrid approaches exploit bothinstance and schema information. For instance,the instance and schema information can be usedto compute the summary of the saturated graph,even if the instance graph is not saturated.

For what concerns the summarization output,we identify the following dimensions:

1. Type: This dimension differentiates techniquesaccording to the nature of the final result (sum-mary) that is produced. The summary is some-times a graph, while in other cases it may be just

9

a selection of frequent structures such as nodes,paths, rules or queries.

2. Nature: Along this dimension, we distinguishsummaries which only output instance represen-tatives, from those that output some form ofsummary a posteriori schema, and from thosethat output both.

Availability: Last but not least, from a practicalperspective it is interesting to know the availabilityof a given summarization service. This will allow adirect comparison with future similar tools:

1. System/Tool: Several summarization ap-proaches are made available by their authors as atool or system shared with the public; in our sur-vey, we signal when this is the case. In addition,some of the summarization tools can be readilytested from an online deployment provided bythe authors.

2. Open source: The implementation of somesummarization methods is provided in opensource by the authors, facilitating comparisonand reuse.

We also consider the quality characteristics of eachindividual algorithm. Quality has to do with: com-pleteness in terms of coverage, precision and recallof the results if an “ideal” summary is available as agold standard to compare with, the connectivity ofthe computed summary and, at the end, computa-tional complexity. Given variety of RDF summariza-tion approaches, it is not easy to define and evaluatea single meaningful notion of quality. A more com-prehensive, effort to establish a generic framework forcomputing quality metrics on summaries is proposedin [122], where authors discuss summarization qual-ity concerning both the schema and instances lev-els. However, difficulties remain, e.g. identifying thecomplexity of a summarization algorithm is not al-ways possible when the available description of thealgorithm does not provide sufficient information.

The main categorization we retain for the dif-ferent RDF summarization approaches is based ontheir measurable/identifiable characteristics and not

on their intended use. This is because the bound-aries among the different usages are not very clearand there are types of methods that can be usedin diverse cases/applications. Thus, the advantagesand/or disadvantages for each category of methodscannot be identified in a generic way, however, weinserted a discussion whenever pertinent.

Fig. 4 depicts a high-level taxonomy of the RDFsummarization works, based on the aforementioneddimensions. Note that many of the dimensions areorthogonal, thus a work may be classified in multi-ple categories. In the sequel, we classify the works,describe the main ideas and the implemented algo-rithms. Then we identify the specific dimension ofanalysis captured in Fig. 4 for each of these works.

4 Generic graph (non-RDF)summarization approaches

In this Section we review generic graph summariza-tion approaches. While these have not been specif-ically devised for RDF, they have either been ap-plied to RDF subsequently, or served as inspira-tion for similar RDF-specific proposals. An overviewof the generic graph summarization works is pro-vided in Table 1 and Table 2. More precisely, Sec-tion 4.1 presents structural graph summarizationmethods, Section 4.2 describes works based on min-ing and statistics, while Section 4.3 considers sum-maries based on statistic and hybrid (structural andstatistic) methods.

4.1 Structural graph summarization

The structural complexity and heterogeneity ofgraphs make query processing a challenging prob-lem, due to the fact that nodes which may be partof the result of a given query can be found anywherein the input graph. To tame this complexity and di-rect query evaluation to the right subsets of the data,graph summaries have been proposed as a basis forindexing, by storing next to each summary node, theIDs of the original graph nodes summarized by thisnode; this set is typically called the extent. Given aquery, evaluating the query on the summary and then

10

Figure 4: A taxonomy of the works in the area.

using the summary node extents allows obtaining thefinal query results.

4.1.1 Quotient graph summaries

Many proposals for indexing graph data are based onestablishing some notion of equivalence among graphnodes, and storing the IDs of all nodes as the extentof the summary node. Formally, these correspond toquotient graphs, whose definition we recall below:

Definition 2 (Quotient graph) Let G = (V,E) bean A-edge labeled directed graph and ≡ ⊆ V × V bean equivalence relation over the nodes of V . Thequotient graph of G using ≡, denoted G/≡, is an Aedge-labeled directed graph having:

• a node uS for each set S of ≡-equivalent Vnodes;

• an edge (vS1 , l, vS2) iff there exists an E edge(v1, l, v2) such that vS1 (resp. vS2) representsthe set of V nodes ≡-equivalent to v1 (resp. v2).

Prominent node equivalence relations used forgraph summarization build on backward, forward, orbackward-and-forward bisimulation [32]:

Definition 3 (Backward bisimulation) In anedge labeled directed graph, a relation ≈b between thegraph nodes is a backward bisimulation if andonly if for any u, v, u′, v′ ∈ V :

1. If v ≈b v′ and v has no incoming edge, then v′

has no incoming edge;

2. If v ≈b v′ and v′ has no incoming edge, then vhas no incoming edge;

3. If v ≈b v′, then for any edge ua−→ v there exists

an edge u′a−→ v′ such that u ≈b u′;

4. If v ≈b v′, then for any edge u′a−→ v′ there exists

an edge ua−→ v such that such that u ≈b u′.

Forward bisimulation, noted ≈f , is defined sim-ilarly to backward simulation, but considers the out-going edges of v and v′, instead of the incoming ones.Forward and backward simulation, noted ≈fb,is both a backward and a forward bisimulation.

We already pointed out that, in Figure 1, the graphon the right is homomorphic to that on the left. Theformer is actually the quotient graph of the latterusing ≈fb. In particular, the classes of ≈b-equivalentnodes are {1}, {2, 3, 4, 5}, {6, 8, 9}, {7}, those of ≈f -equivalent nodes are {1}, {2}, {3, 4}, {5, 6, 7, 8, 9},

11

and those of ≈fb-equivalent nodes are{1}, {2}, {3, 4}, {5}, {6}, {7}, {8, 9}.

Finally, we remark that it easily follows from thebisimulation definitions that if v ≈ v′, for instance forforward-bisimulation, then any label path that can befollowed from v in the graph G can also be followedfrom v′ in G and the other way around. In otherwords, the same paths start (respectively, end) in two≈-equivalent nodes. This condition is hard to meetin graphs that exhibit some structural heterogeneity:in such cases, every node is ≈-equivalent to very few(if any) other nodes.

The Template Index (or T-index) [64] summaryconsiders that two graph nodes are equivalent if theyare backward bisimilar. In particular, in a T-index,nodes represented together need to be reachable bythe exact same set of incoming paths. The goal ofthe T-index is to speed up the evaluation of com-plex queries of a certain form (or template), such asP.v (all nodes v reachable by a path matching theregular expression P ) or v.P.u (all v, u node pairsconnected by a path matching P ); the proposal gen-eralizes to arbitrary arity queries, although the au-thors note it is likely to be most useful in the abovetwo simplest forms. The simulation relation betweenthe nodes of a graph G is known to be computable inO(M ∗ log(M)) [71] or O(N ∗M) [32]; the cost dropsto linear for acyclic graphs. All these algorithms as-sume the graph fits in memory.

To support efficient processing of graph queriesthat navigate both forward (in the direction of thegraph edges) and backward, [37] describes the For-ward and Backward Index (F&B) which considerstwo nodes equivalent if they are undistinguishable byany navigation path composed only of forward andbackward steps (see Figure 5 for an example). Whilethis equivalence condition is very powerful, it is rarelysatisfied by two nodes, thus the F&B index is likelyto have a large amount of nodes (close to the numberof nodes in the original graph), making the manip-ulation of the F&B index structure inefficient. Toaddress the problem, the authors note that in prac-tice not any path query is frequently asked by appli-cations, therefore it suffices to consider F&B equiva-lence of nodes as being undistinguishable by forwardand backward navigation along the paths from a cer-

tain set only.

Another method proposed in [38, 83], in order tomake the F&B index smaller and more manageable,is to consider bisimilarity restricted only to paths ofa certain length around the graph nodes (see Figure 5for an example). This increases the chances that twonodes be considered equivalent, thus reducing the sizeof the bisimilarity-based summary. Limited bisimula-tion summaries like this, can be computed by evaluat-ing structural group-by queries (for k-bounded bisim-ilarity), respectively, queries derived from the work-load of interest (for workload-driven F&B summa-rization). While the theoretical complexity of suchqueries is O(Mk), with M the number of edges and kthe size of the most complex query involved, efficientgraph query processors, with the help of good index-ing, achieve much better performance in practice.

[18] considers the summarization of large Web doc-ument collections. While the main structure in thiscase consists of trees, they may also feature referenceedges which turn the dataset into a global graph. Theauthors build a summary as a collection of regular ex-pression queries, such that the set of results to thesequeries, together, make up a partition over the setof nodes in all the documents of the input collec-tion. To each such regular expression is associateda set of cardinality statistics, to help application de-signers chose meaningful queries and inform them onthe expected performance which may be reached onthose queries. The dominant-cost operation requiredby this approach is computing simulations among Nnodes; summaries can then be refined based on user-specified path queries.

[62] provides an I/O efficient external memorybased algorithm for constructing the k-bisimulationsummary of a disk-resident graph on a single ma-chine, based on several passes of sorting and grad-ually refining partitions of nodes on disk. The I/Ocomplexity of the algorithm is O(k ∗ sort(Mp) + k ∗scan(Np) + sort(Np)), where Mp, respectively, Npare the numbers of disk pages required to store thegraph edges, respectively, graph nodes, while sort(·)and scan(·) quantify the cost of an external sort, re-spectively, the cost to scan a certain number of pages.

12

4.1.2 Non-quotient graph summaries

There are also many methods that construct non-quotient graph summaries. They distinguish nodesaccording to several criteria/measures and createsummaries in which summary nodes represent multi-ple nodes out of the original graph.

A Dataguide [28] (see Figure 5 for an example) isa summary of a directed acyclic graph, having onenode for each data path in the original graph. Agraph node reachable by a set of paths belongs to theextents of all the respective Dataguide nodes. Thus,given a path query which may also contain wildcards,and a Dataguide, it is easy to identify the Dataguidenodes corresponding to the query, and from theretheir extents. The authors show how to integratethe Dataguide path index with more conventional in-dexes, e.g., a value index which gives access to allthe nodes containing a certain constant value etc. ADataguide is not a quotient: an input node may berepresented by two Dataguide nodes, if it is reach-able by two distinct paths in the input graph. Build-ing a Dataguide out of a data graph amounts to de-terminizing an undeterministic finite automaton; theworst-case complexity of this is known to be expo-nential in the size of the input graph, yet only linearwhen the database is tree-structured.

Summary-based query answering with an accept-able level of error is considered in [41, 69]. The focusis on graph compression while preserving bounded-error query answering and/or the ability to fully re-construct the graph from the summary with the helpof so-called “corrections” (i.e. edges to add to or re-move from the “expansion” of the summary into theregular part of the graph it derives from). The nodesof the resulting structural summary represent parti-tions of similar nodes from G, while a summary edgeexists between two summary nodes u and v only ifthe nodes from G represented by u and v are denselyconnected. Similarly, [55] aims to compress G, butwithout considering corrections, while their edge la-bels represent the number of edges within each par-tition set, and the number of edges between everytwo such sets. To determine similar nodes, [41] re-lies on locality-sensitive hashing, while [69, 55] use aclustering method where pairs of nodes to merge are

chosen based on the optimal value of an objectivefunction. In [84], the authors build on the conceptsof [55] to show how to obtain in polynomial time,summaries which are close to the optimal one (interms of corrections needed. However, these worksfocus on node connectivity, and ignore the node andedge labels which crucially encode the data contentof an RDF graph.

[97] proposes the SNAP (Summarization onGrouping Nodes on Attributes and Pairwise Rela-tionships) technique, whose purpose is to construct,with some user input, a summary graph that can beused for visualization. A SNAP summary representsall the input graph nodes and edges: its nodes formsa partition of the input graph nodes, and there is anedge of type t between two summary nodes A and B ifand only if some input graph node represented by A isconnected through an edge of type t to an input graphnode represented by B. Further, a SNAP summaryhas a minimal number of nodes such that (i) all in-put graph nodes represented by a summary node havesame values for some user selected attributes and (ii)every input graph node represented by some sum-mary node is connected to some input graph nodesthrough edges of some user-selected types.

[22] considers reachability and graph patternqueries on labeled graphs, and builds answer-preserving summaries for such queries, that is: fora given graph G, summary S(G) and query Q,there exists a query Q′ which can be computedfrom Q, and a post-processing procedure P suchthat P (Q′(S(G))) = Q(G). In other words, eval-uating Q′ on the summary and then applying thepost-processing P leads to the result Q(G). Thisproperty is rather strong, however, it is attainednot under the usual query semantics based on graphhomomorphism, underlying SPARQL, but under abounded graph simulation one. Under these se-mantics, answering a query becomes P (instead ofNP ), at the price of not preserving the query struc-ture (i.e., joins). The authors propose two paral-lel graph compression strategies, targeting differentkinds of queries: (i) reachability queries, where weseek to know if one node is reachable from another;(ii) graph pattern queries, for which they attain asignificant compression ratio. It should be again

13

Figure 5: Structural graph summarization examples.

stressed that the authors consider these queries undernon-standard, more lenient semantics than the onesused for RDF querying. The complexity of build-ing their summaries are O(N ∗ M) for reachabilityqueries, and O(N ∗log(M)) for graph pattern queries.

4.2 Mining-based graph summariza-tion

OLAP and data mining techniques applied to datagraphs have considered them through global, aggre-gated views, looking for statistics and/or trends inthe graph structure and content.

[112] leverages pattern mining techniques to buildgraph indices in order to help processing a graphquery. It firstly applies a frequent pattern miningalgorithm to identify all the frequent patterns withthe size support constraint. Once the frequent pat-terns are extracted, they are organized in a prefix treestructure, where each pattern is associated with a listof ids of the graphs containing it. This prefix tree isthen used to answer queries asking for the respectivepart of the graph. This approach by design does notreflect all data and is based on numeric information.[118] uses the same idea, but considers trees insteadof graphs.

The Vocabulary-based summarization of Graphs

(VoG) [51] aims at summarizing a graph by its char-acteristic subgraphs of some fixed types, which havebeen observed encoding meaningful information inreal graphs: cliques, bi-partite cores, stars, chains,and approximations thereof. VoG first decomposesthe input graph, using any clustering method, intopossibly overlapping subgraphs, the (approximate)type of each is then identified using the MinimumDescription Length principle [85]. Finally, the inputgraph summary is composed of some non-redundantsubgraphs, picked by some heuristic like top k, hencemay not reflect all the input graph. Importantly, theVoG method has been shown to scale well; it is near-linear in the number of edges. The VoG code is avail-able for download5.

An aggregation framework for OLAP operationson labeled graphs is introduced in [16]. The authorsassume as available an OLAP-style set of dimensionswith their hierarchies and measures; in particular,graph topological information is used as aggregationdimensions. Based on these, they define a “graphcube” and investigate efficient methods for comput-ing it. With a different perspective, [15] focuses onbuilding out of node- and edge-labeled graphs, a setof randomized summaries, so that one can apply data

5https://github.com/yikeliu/VoG-Overlap

14

mining techniques on the summary set instead of theoriginal graph. Using the summary set leads to bet-ter performance, while guaranteeing upper bounds onthe information loss incurred.

A graph summarization approach, based solely onthe graph structure is reported in [58]. It produces asummary graph that describes the underlying topol-ogy characteristics of the original one. Every sum-mary node, or super-node, comprises of a set of nodesfrom the original graph; every summary edge, orsuper-edge, represents an all-to-all connections be-tween the nodes in the corresponding super nodes.The goal of this work is to generate a summarythat minimizes the false positives and negatives in-troduced by this summarization. The authors in-vestigate different distributed graph summarizationmethods, which proceed in an incremental fashion,gradually merging nodes into supernodes; the meth-ods differ in the way they chose the pairs of nodesto be merged, and cut different trade-offs betweenefficiency (running time) and effectiveness (keepingthe false positives and negatives under control). Themethod termed Dist-LSH selects node pairs with ahigh probability to be merged; the probability is es-timated based on locality-sensitive hashing (LSH) ofthe nodes. The algorithms are implemented on topof the Apache Giraph framework.

[57] surveys many other quantitative, mining-oriented graph sampling and summarization meth-ods.

4.3 Statistical and hybrid graph sum-marization

Several follow-ups on the SNAP summarization ap-proach (Section 4.1.2) have been proposed in the lit-erature.

The k-SNAP summarization approach [97, 98] isan approximation of the SNAP one. It allows settingthe desired number k of summary nodes, so that awhole graph can be visualized at different granular-ity levels, similarly to roll-up and drill-down OLAPoperations. A k-SNAP summary is a graph of k sum-mary nodes which satisfy the above condition (i) ofa SNAP summary, but relax condition (ii) so thatonly some (not every) input graph node represented

by some summary node satisfies it. Further, as manysuch summaries may exist, a k-SNAP summary isdefined as one that best satisfy the condition (ii) ofSNAP summaries. Finding such a summary is NP-complete [97], hence tractable heuristic-based algo-rithms are proposed to compute approximations ofk-SNAP summaries.

The k-SNAP summarization approach has beenfurther extended [115] to handle numerical attributes,while k-SNAP as well as SNAP before have only con-sidered categorial attributes whose domains are madeof a limited number of values. The proposed CANALapproach allows bucketizing the values of some nu-merical attribute into the desired number of cate-gories, hence reducing the summarization of graphswith categorial and numerical attributes to that ofgraphs with categorial attributes only. For a givennumerical attribute a, each of the obtained categoriesrepresent a range of a values that nodes with similaredge structure have. Computing such categories isin O(N logN + k4a), where ka is the number of dis-tinct a values that the input graph contains. Also,to ease the use of k-SNAP to inspect a graph in anroll-up and drill-down OLAP fashion, [115] providesa solution to automatically recommend k values forvisualizing this graph. It consists in ranking k-SNAPsummaries of varying k according to a so-called inter-estingness measure, defined in terms of conciseness,coverage and diversity criteria.

k-SNAP has also strongly inspired the summariza-tion approach in [60], which similarly aims at com-puting graph summaries w.r.t. user selected num-ber of summary nodes, attributes and edge types.The summaries are computed using a variant ofone above-mentionned tractable k-SNAP heuristics,which keeps the SNAP condition (i) but changes theSNAP condition (ii) that k-SNAP tries to best sat-isfies, so that a summary best reflects the organiza-tion in social communities of the input graph nodesw.r.t. the selected attributes and edge types.

From a different perspective, [86] sketches SAPHANA’s approach for large graph analytics throughsummarization. It consists in defining rules to sum-marize part of an analyzed graph. Rules are madeof two components, one graph pattern to be matchedon the graph, and how the matched data should be

15

Work Input requirements Purpose Output type Outputnature

System -Theory

SNAP/k-SNAP[97, 98]

Required user param-eters

Visualization Graph Instance Theory

Zhang et al.[115]



Louati et al.[60]



Fan et al. [22] Required user-selected queries

Query answer-ing

Graph Node-labeleddirectedgraph

Theory

DescribeX [18] None Query answer-ing

Single rootnode-labeledgraphs

Instance System

Luo et al. [62] None Indexing, queryanswering

Graph Instance Theory

T-index [64] Parametrized user in-put

Query answer-ing

Multi-rootsedge-labeledgraphs

Instance Theory

D(K)-index [83] Parametrized user in-put

Query answer-ing


Instance Theory

F&B-index [37] Parametrized user in-put

Query answer-ing


Instance Theory

Table 1: Graph summaries based on structural quotients.

grouped and aggregated into a result graph.

5 Structural RDF summariza-tion

Structural summarization of RDF graphs aims atproducing a summary graph, typically much smallerthan the original graph, such that certain interestingproperties of the original graph (connectivity, paths,certain graph patterns, frequent nodes etc.) are pre-served in the summary graph. Moreover, these prop-erties are taken into consideration to construct a sum-mary. The methods for structural summarization aredistinguished into two categories. The quotient sum-marization methods, discussed in Section 5.1, while

the remaining structural summarization methods aredescribed in 5.2

5.1 Structural quotient RDF sum-maries

We begin with summarization techniques that arebased on quotient methods. Intuitively, each sum-mary node corresponds to (represents) multiple nodesfrom the input graph, while an edge between twosummary nodes represents the relationships betweenthe nodes from the input graph, represented by thetwo adjacent summary nodes. Often, the nodes for-mulated in summaries like these, are called super-nodes, while their edges are called super-edges.

An interesting property, which directly follows

16

from the notion of quotient graph, relates query an-swers on an RDF graph G to query answers on itsquotient summary:

Definition 4 (Representativeness) Given anRDF query language (dialect) Q, an RDF graph Gand a summary Sum of it, Sum is Q-representativeof G if and only if for any query Q ∈ Q such thatQ(G∞) 6= ∅, we have Q(Sum∞) 6= ∅.

Informally, representativeness guarantees thatqueries having answers on G should also have an-swers on the summary. This is desirable in orderfor the summary to help users formulate queries: thesummary should reflect all graph patterns that occurin the data.

An overview of the structural quotient summariesis shown in Table 3. Section 5.1.1 introduces sum-maries defined based on bisimulation graph quotients(recall Definition 2), while Section 5.1.2 discussesother quotient summaries.

5.1.1 (Bi)simulation RDF summaries

The classical notion of bisimulation (Section 4.1) hasbeen used to define many RDF structural quotientsummaries.

Thus, [78] presents SAINT-DB, a native RDF man-agement system based on structural indexes. Thisindex is an RDF quotient simulation, based on triple(not node) equivalence. The summary is not an RDFgraph: its nodes group triples from the input, whileedge labels indicate positions in which triples in ad-jacent nodes join. Thus, the index is tailored for re-ducing the query join effort, by pruning any danglingtriples which do not participate in the join. Since theindex contains only information on joins, and nothingof the values present in the input graph, the querylanguage is restricted to BGPs comprising of vari-ables in all positions; further, these BGPs must beacyclic. For compactness, they bound the simula-tion, for small k values, e.g., 2; this enables com-pression factors of about 104. Semantic informationor ontologies are not considered. The time complex-ity of the algorithm comes from the corresonding al-gorithms for computing graph simulation. This is

O(N2 ∗M), where N is the number of nodes (equiv-alent types) and M is the number of edge labels inthe result graph. In practice, different query process-ing strategies aimed at join pruning are implementedby integrating the structural index with the RDF-3X[70] engine.

A structure-based index is proposed in [99], definedas a bisimulation quotient; the authors show that thesummary is representative of only tree-shaped queriesover non-type and non-schema triples, comprising asingle distinguished variable which corresponds to theroot node. Further, the authors study limited ver-sions of the bisimulation quotient by considering: (i)only forward bisimulation, (ii) only backward bisim-ulation or (iii) only neighborhoods of a certain lengthfor tree structures of the input graph. The proposedapplications of the structure index are twofold: (i)for data partitioning, by creating a table for eachnode of the structure index, thereby physically group-ing triples with subjects that share the same struc-ture, and (ii) for query answering, where the querymay be run first on the structure index, to obtainthe set of candidate answers, thus achieving prun-ing of the (larger) original graph. The authors donot consider graph semantics, nor answering queriesover type and schema triples. The complexity ofthe corresponding algorithm for generating the indexis O((N1 ∪ N2) ∗M ∗ logN), where N1 and N2 arethe nodes selected for backward and forward bisimu-lation respectively, M is the number of edges and Nthe number of nodes of the input graph.

RDF summaries defined in [17] are quotients basedon FW bisimulation. The authors do not considergraph semantics or ontologies. They show how touse the summary as a support for query evaluation:incoming navigational SPARQL queries are evaluatedon the summary, then the results on the summary aretransformed into results on the original graph by ex-ploring the extents of summary nodes. They proposein [45] an implementation based on GraphChi [52],the single-machine multi-core processing framework,to construct the summary in roughly the amount oftime required to load the input KB plus write thesummary. GraphChi supports the Bulk SynchronousParallel (BSP) [106] iterative, node-centric process-ing model, by which nodes in the current iteration

17

execute an update function in parallel, depending onthe values from the previous iteration. Their sum-marization approach is based on the parallel, hash-based approach of [6] which iteratively updates eachnode’s block identifier by computing a hash valuefrom the node’s signature defined by the node’s neigh-bors from the previous iteration. The main idea isthat two bisimilar nodes will have the same signa-ture, the same hash value, and thus have the sameblock identifier. Due to the large size of the result-ing bisimulation summary, the authors propose a so-called singleton optimization, which involves remov-ing summary nodes representing only one node fromG; the reduced summary is therefore no longer a quo-tient of G.

ExpLOD [42, 43] produces summaries of RDFgraphs, by first transforming the original RDFdataset into an unlabeled-edge-ExpLOD-graph,where a node is created for each triple in the orig-inal RDF graph, labeled with the triple property;unlabeled edges go from the original triple’s subjectand object, to the newly constructed property node.Then, the ExpLOD graph is summarized by aforward bisimulation quotient, grouping togethernodes having the same RDF usage. RDF usage canbe statistical, e.g., the number of instances of aparticular class, or the number of times a propertyis used to describe resources in the graph. RDFusage can also be structural, e.g., the set of classesto which an instance belongs, the sets of propertiesdescribing an instance, or sets of resources connectedby the owl:sameAs property. As such, they do notpropose one summary, but rather a framework whereone can select the summary according to his ”usage”preferences. Finally, the bisimulation quotient isapplied without taking into account neither schemanor type triples, thus the summary is not represen-tative. There are two sequential implementationsof ExpLOD. The first implementation computesthe relational coarsest partition of a graph usinga partition refinement algorithm [72], and requiresdatasets to fit in main memory. The second ap-proach uses SPARQL queries against an RDF triplestore; although in principle this is more scalable,as datasets need not be stored in main memory,it is slower due to the query answering time. To

Figure 6: RDF vocabulary for the data graph sum-mary [11].

overcome the limitation of the centralized approach,the authors extend ExpLOD, proposing a novel,scalable mechanism to generate usage summaries ofbillions of Linked Data triples based on a parallelHadoop implementation [44].

[87] considers the problem of efficiently buildingquotient summaries of RDF graphs based on the FWbisimulation node equivalence relation. The authorsdo not reserve any special treatment to RDF typeand schema triples, which prevents the resulting RDFsummaries of being representative. Two implemen-tations of the algorithm for computing graph bisim-ulations, first introduced in [3], are presented: onefor sequential single-machine execution using SQL,and the other for distributed execution, taking advan-tage of MapReduce parallelization to reduce runningtime. They both have worst-case time complexity ofa O(M ∗N +N2).

5.1.2 Other structural quotient summaries

To assist users whose task is query formulation, [11]creates the summary graph, the so called node collec-tion layer, by grouping nodes having the exact sametypes, or in the absence of types, the same outgo-ing properties, into entity nodes; further, nodes fromthe input with no outgoing properties, and havingthe same incoming properties from subjects with thesame set of types, are represented by blank nodes.An edge exists between two summary nodes v1 andv2, labeled by a property p, if there exist two nodesv′1 and v′2 in G, such that v′1 is represented by v1,v′2 is represented by v2, and there exists an edge la-beled by p in G from v′1 to v′2. The number of rep-

18

resented nodes from the input is attached to eachsummary node, and the number of represented edgesfrom the input to each summary edge. This summarygraph may group resources from multiple datasets.The proposed dataset layer groups together nodes ofthe node collection layer which belong to the samedataset. Schema triples are not considered. The ap-proach bears similarities with ExpLOD, since nodesin the first layer are partitioned by types, and par-titions are represented by distinct summary nodes.However, unlike ExpLOD, the G nodes having a type,are not further distinguished by their data proper-ties, i.e., two nodes of the same type A, one havingthe data properties a, b and c and the other hav-ing the properties a and d will be represented by thesame summary node. Unlike ExpLOD, their sum-mary graph is an RDF graph.

RDF summaries are defined in a rather restrictedsetting in [34]. The authors assume that all subjectsand objects are typed, and that each has exactly onetype; class and property URIs are not allowed in sub-ject and object position, and no usage is made ofpossible schema triples. Under this hypothesis, theyconstruct from the RDF graph a typed object graph(TOG) comprising (s, p, o) triples and assigning anRDF type for each such s and o. Two methods areproposed for summarizing the TOG, namely, equiva-lent compression and dependent compression. Theequivalent compression produces a quotient of theTOG by grouping together nodes having the sametype and the same set of labels on the edges adja-cent to the node. In the dependent compression, twonodes v1 and v2 of the TOG are grouped together ifv1 is adjacent only to v2, or vice-versa. As applicationscenarios of this approach the authors indicate min-ing semantic associations, usually defined as graphor path structures representing group relationshipsamong several instances.

Based on query-preserving graph compression [22](Section 4.1.2), an Adaptive Structural Summary forRDF graphs (ASSG, in short) is introduced in [114].ASSG aims at building compressed summaries of thepart of an RDF graph which is concerned by a cer-tain set of queries. The authors compute a structuralrank of nodes, which is 0 for leaves, and grows withthe shortest distance between the node and a leaf;

then, nodes having the same label and the same rankare considered equivalent, and are all compressed to-gether in a single ASSG node. To partition the Nnodes of a graph G to different equivalent classes bytheir label and rank the cost is O(N +M), where Mthe number of the edges of the graph.

RDF summaries defined in [13, 14, 12] adapt theidea of quotient summaries to two characteristic fea-tures of RDF graphs: (i) the presence of type triples(zero, one, or any number of types for a given re-source), and (ii) the presence of schema triples. Aswe explained in Section 2.1, RDF Schema informa-tion is also expressed by means of triples, which arepart of G. [12] shows that quotient summarization ofschema triples does more harm than good, as it de-stroys the semantics of the original graph. To addressthis, they introduce a notion of RDF node equiva-lence which ensures that class and property nodes(part of schema triples) are not equivalent to anyother G nodes, and define a summary as the quotientof G through one such RDF node equivalence. Suchsummaries are shown to preserve the RDF Schematriples intact, and to enjoy representativeness (Defini-tion 4) for BGP queries having variables in all subjectand object positions. The authors show how bisim-ulation summaries can be cast in this framework,and introduce four novel summaries based propertycliques, which generalize property co-occurrence asfollows. A clique cG is a set of data properties fromG such that for any p1, p2 ∈ cG, a resource of G is thesource and/or target of both p1 and p2. For instance,if resource r1 has properties a and b while r2 has band c, then a, b, c are part of the same source clique;if, instead, r1 and r2 are targets of these properties,then a, b, c are part of the same target clique. Theso-called weak summary groups together nodes hav-ing the same source or target clique, while the strongsummary requires the same source and the same tar-get clique; their variant typed-weak and typed-strongsummaries first group nodes according to their types,and then according to their cliques. All these sum-maries can be computed in linear-time in the size ofthe input graph. A benefit of this specific approach isthat clique summaries are orders of magnitude morecompact than bisimulation summaries. The authorsalso study how to obtain the summary of G’s seman-

19

tics, which is the saturation of G. They provide asufficient condition on the RDF equivalence relation,ensuring that the summary of G∞ can be computedfrom G without saturating it, and show that this maybe many times faster than the direct procedure of firstsaturating G, then summarizing G∞. The softwaretool implementing this approach is available as opensource6.

5.2 Structural non-quotient RDFsummaries

Several structural RDF summaries have been basedon techniques different from structural quotients.Our presentation below attempts to identify fami-lies of proposals based on the summarization tech-niques and/or, as appropriate, on the usage for whichthe summaries are built. An overview of all struc-tural, non-quotient summaries is shown in Table 4,depicting their individual characteristics as well. Sec-tion 5.2.1 presents proposals based on text summa-rization and information retrieval; Section 5.2.2 de-scribes summaries built around concepts of central-ity (or rank, importance) of nodes in a graph; Sec-tion 5.2.3 considers structural RDF summarizationbased on an index or other structures which aim atselective data access; finally, Section 5.2.4 discussesRDF summaries whose goal is to facilitate the extrac-tion of a schema, understood as a compact structuraldescription, of the input RDF graph.

5.2.1 Inspired by text summarization and in-formation retrieval

One way of summarizing data, especially when thesummary is meant for human users, is to select amost significant subset thereof. Such summarizationis very useful, considering that the human ability toprocess information does not change as the availabledata volumes grow. We describe here RDF knowl-edge base summarization efforts inspired from infor-mation retrieval and text summarization.

In [96], the authors study the problem of select-ing the most important part of an RDF graph which

6https://gitlab.inria.fr/cedar/quotientSummary

is to be shown to a user interested in a certain en-tity (resource). A fixed space (triple) budget is to beused; beyond labels, authors also allow the edges ofthe RDF graph to carry weights, with higher-weightedges being more important to show. The authorsprovide algorithms which select triples favoring close-ness to the target entity and weight; then, they ex-tend this with criteria based on diversity (includeedges with different labels in the selection) and pop-ularity (favor frequently occurring edge labels). Itis worth noting that similar techniques have recentlybeen included in Google’s search engine, when theuser searches for an entity present in Google’s Knowl-edge Graph, and is presented with a small selectionof this entity’s properties.

Besides this, text summarization principles, wherea text can be seen as a collection of terms or a bagof sentences, have been applied to summarize ontolo-gies. An ontology summarization method along theselines is introduced in [117], based on RDF SentenceGraphs. An RDF Sentence Graph is a weighted, di-rected graph where each vertex represents an RDFsentence, which is a set of RDF Schema statementsas illustrated in Fig. 7. A link between two sentencesexists, if an object of one sentence belongs to anothersentence as well. The creation of a sentence graph iscustomized by domain experts, who provide as in-put the desired summary size, and their navigationpreferences, i.e. weights in the links they are mostlyinterested in. Then, the importance of each RDF sen-tence is assessed by determining its centrality in thesentence graph. The authors compare different cen-trality measures (based on node degree, betweenness,and the PageRank and HITS scores, and show thatweighted in-degree centrality and some eigenvector-based centralities produce better results. Finally, themost important RDF sentences are re-ranked consid-ering the coherence of the summary and the coverageof the original ontology, and the constructed resultis returned to the user. This method does not han-dle implicit information (thus, it should be applied toa saturated ontology); also, it does not consider theinstance graph.

Subsequently, the authors extended their techniquefrom one ontology, to the global set of ontologies har-vested from the semantic web [116]. Specifically, to

20

Figure 7: Sample RDF sentences [117].

decide the importance (salience) of an RDF sentence,they extend it with neighboring information, for ex-ample counting how often the terms of the sentenceare linked or instantiated in the global semantic web.Two salience measures are proposed: structural andpragmatic importance. Structural importance, mea-sures the number of entities in the web that have areference to the local RDF sentences with regards tosubjects, predicates or objects. Secondly, the prag-matics importance takes into account the statisticalfrequency of terms instantiated across the global se-mantic web. The two measures are combined in orderto produce an integrated importance value for eachRDF sentence, which again is passed to a re-rankingstep to ensure coverage over the whole ontology. Al-though, in the second approach, statistical informa-tion over the instances is considered, the approachstill does not consider implicit information.

KCE [76], [66], on the other hand, attempts to au-tomatically identify the key concepts in an ontology.To achieve this, it combines cognitive principles withlexical and topological measures (the density and thecoverage). The goal is to identify a number of con-cepts that would be selected by human experts. Tothis direction a number of criteria are defined:

• The notion of natural category is used for iden-tifying concepts that are information-rich in apsycho-linguistic sense. This is approximated bytwo measures: (i) name simplicity promotes con-cepts labeled with simple names and penalizes

compounds; (ii) basic level measures how cen-tral a concept is in the taxonomy of the ontol-ogy. This is combined with the density favouringconcepts which have many properties and taxo-nomic relationships.

• The coverage tires to ensure that no part of theontology is neglected.

• Lastly, the notion of popularity, is based on lexi-cal statistics, and tries to identify concepts com-monly used in natural language.

Each ontology concept is assigned a score, which isa weighted sum of the scores assigned for each indi-vidual criterium; then the key concepts of the ontolo-gies are taken to be those with the highest score. Thisapproach extracts only schema elements, based onboth schema and instance information. Implicit in-formation in the ontology is also taken into account inthe process. To fine-tune result quality, the methodrequires the user to specify values for a set of pa-rameters. The tool implementing this approach isavailable online and in open source7.

5.2.2 Focused on centrality measures

In this section, we present approaches that focus onexploiting centrality measures (and in some cases incombination with other parameters)in order to pro-duce summaries.

RDFDigest [102], [100] produces summaries ofRDF schemas, consisting of the most representativeconcepts of the schema, seen also from the angleof their frequency in a given instance (RDF graph).Thus, the input of the process includes both an ontol-ogy and a data graph. The tool starts by saturatingthe knowledge base with all implicit data and schemainformation, thus taking them into account for therest of the process. The goal of the work is to iden-tify the most important nodes in the ontology, andto link them in order to produce a valid sub-graph ofthe input scema.

In its first version [101], node relevance is definedbased on the relative cardinality, and the in/out de-gree centrality of the node. Then the most relevant

7http://www.essepuntato.it/kce

21

nodes are retained as being part of the answer. Tofind out how to connect these nodes in a sub-schema,two algorithms are proposed, trying to maximize theimportance of the selected ontology edges either glob-ally or locally. In the first case a spanning tree is cal-culated maximizing the importance of the selectededges, and then the most important nodes are con-nected using paths from this tree. In the secondcase, the edges linking the most important nodes areselected based on the notion of coverage trying tomaximize the most representative edges out of thewhole schema graph. The authors report that link-ing the most important nodes based on maximum-cost spanning trees produces better summaries ac-cording to their experiments with both methods hav-ing a worst-case time complexity of O(N3

O), whereNO is the number of nodes in the ontology. How-ever, this method does not guarantee that the totalweight of the selected subgraph is maximized, andwhen picking a connecting path, it may introducemany additional nodes in the result, some of whichmay not be important at all. The size and qualityof the resulting summary can be fine-tuned by spec-ifying values for a set of parameters; the system isavailable online8 and is mostly targeted for visuallypresenting the ontology summaries.

The more recent version [75] tries to identifythe most important schema nodes using six well-known measures from graph theory (i.e., degree,betweennes, bridging centrality, harmonic centrality,radiality, and ego centrality [8]) and adapting themfor RDF/S KBs in order to consider instance infor-mation as well. The authors model the problem oflinking those nodes as a graph Steiner Tree selec-tion problem [30], trying to minimize the total num-ber of additional nodes introduced, employing ap-proximations and heuristics to speed up the execu-tion of the respective algorithms. According to theauthors, the optimal selection of importance mea-sure and approximation for linking the most impor-tant nodes yields a worst-case time complexity ofO(N2∗M)+O(S∗(N+M)) where N is the number ofnodes, M is the number of edges in the ontology andS the number of most important nodes to be selected.

8http://www.ics.forth.gr/isl/rdf-digest/

An overview of the summarization process is shownin Fig. 8. Also in this version, users have the op-portunity to parameterize the process by specifyingvalues of different parameters, such as the number ofsummary nodes to be selected. In the latest versionof the system [103, 104], the authors propose explo-ration operations based on summaries such as zoomand extend. Extend focuses on a specific subgraphof the initial ontology whereas zoom on the wholegraph, providing more or less detailed informationfor the selected nodes.

[82] on the other hand, tries to combine user pref-erences with centrality measures in order to calculatethe importance of a node. Then paths that includethe most important nodes are identified to producethe final graph. Thus, the result is a subgraph of theoriginal graph. The main steps of this summarizationmethod are shown in Fig. 9: (i) select the parame-ters (e.g., the size of the summary and importancethresholds) and possibly nodes that are important ac-cording to user’s opinion; (ii) compute the relevanceof the concepts in the ontology as the weighted sumof the the degree centrality and the closeness central-ity; (ii) identify the paths linking the selected nodesusing the Broaden Relevant Paths algorithm. Thespecific algorithm tries to find paths of greatest qual-ity within the summarized graph by considering therelevance of the included nodes in the path. The ap-proach supports RDF or OWL ontologies and mainlyaims to help ontology understanding through visual-ization.

5.2.3 Index-driven RDF summaries

Besides summaries focusing on centrality measures,other approaches try to exploit summaries for index-ing.

GRIN [105], for example, is an index for RDFgraphs that has been designed for efficient query an-swering. The semantics of the indexed RDF graph istaken into account by assuming that the input graphis saturated before being indexed, however the RDFSchema is not part of the index. A GRIN index isa hierarchical clustering of the resources of an RDFgraph, modelled as a balanced binary tree. The setof leaf nodes in the tree, form a partition of the set of

22

Figure 8: Creating an ontology schema graph using [75].

Figure 9: Flowchart of the ontology summarization method [82].

triples in the RDF graph; each leaf node representsthe resources of the triples it holds. Interior nodesare constructed by finding a center triple, and a ra-dius value R. An interior node in the tree implicitlyrepresents the set of all vertices in the graph that arewithin R units of distance (i.e. less than or equal toR links) from that center. Inner nodes at a same levelof the index form a partition of the input RDF graph;each inner node reflects the resources of the triples ofthe nodes it is an ancestor of. The worst case com-plexity for building the index is O(R4 ∗ log2R). Thenat query time, GRIN derives a set of inequality con-straints from the query, evaluated againes the nodesof the GRIN index in order to identify the smallestsub-graph that contains answers to the input query.

[29] studies efficient query processing in the TriAD

distributed RDF data management system, by rely-ing on a summary of the RDF graph stored within thesystem. This summary, which is an RDF graph, fol-lows from a standard partitioning of the input RDFgraph, computed with METIS9, which minimizes thenumber of edges across the partition’s blocks. The

summary is made of triples of the form B1p−→ B2 re-

flecting that there exists a triple (s, p, o) in the inputgraph with s a resource of the partition block B1 ando a resource of the block B2. The proposed techniquedoes not consider implicit triples of the input RDFgraph and basically reflects how (some resources of)sets of strongly connected triples, i.e., those reflectedby the partition blocks, are loosely connected. Then

9http://glaros.dtc.umn.edu/gkhome/metis/metis/

overview

23

at query time, the index is used to prune danglingtriples by identifying the bindings for the join vari-ables in the query, which are later used to generateresults via the relational joins of the underlying sys-tem.

[54] builds RDF graph summaries meant to helpanswer keyword queries in RDF graphs. As commonin this setting, a match is a tree of interconnectedRDF triples, such that each keyword from the queryis present in one of the triples; the score of a querymatch is consisted of several keywords and decreaseswith the number of edges (triples) in this tree, that is,the closest the nodes (thus, the smaller the tree) thebetter; and the answer to the query is the set of thehighest-scoring matches. To enable efficient queryanswering, the authors build a summary as a collec-tion of tree structures (see below), each representinga subset of the graph triples; the summary trees, to-gether, represent all graph triples. Given a summarytree, the authors estimate the length of a path whichmay connect the tree with the nodes matching differ-ent query keywords. If the result tree has a lengthlarger than one already found, there is no reason ofvisiting the nodes of this tree. To summarize an RDFgraph (the authors consider a restricted setting whereeach resource has at most one type), they proceed asfollows. (i) For each type T and each resource r ofthis type, given a user-specified integer parameter α,a partition block is created comprising of the triplesforming paths of length at most α from r, except thetriples included in previously built partition blocks.(ii) Next, they search for a partition block (sub-graphofG) which can be embedded via homomorphism intoanother, and when this happens, they discard the for-mer (as it can be considered to be “sufficiently wellrepresented” by the latter). More specifically, eachpartition is represented by its graph core, and homo-morphisms are identified between cores; for efficiency,covering trees of sub-graphs are used instead of thesubgraphs themselves. Overall, this technique doesnot consider RDF Schema nor implicit information.

5.2.4 Oriented towards schema extraction

In this subsection we focus on methods that try tocreata schema-like structures of the available data

sources.One of the challenges when working with Linked

Open Data is the lack of a concise schema, or a cleardescription of the data that can be found in the datasource. SchemEX [49, 50], an indexing and schemaextraction tool for the LOD cloud, attempts to solvethis problem. Out of an RDF graph, it produces athree-layered index, based on resource types. Eachlayer groups input data sources of the LOD cloudinto nodes, as follows: (i) in the first layer, each nodeis a single class c from the input, to which, the datasources containing triples whose subject is of type care associated; (ii) in the second layer, each node,now named as an RDF type cluster, is a set of classesC mapped to those data sources having instanceswhose exact set of types is C; (iii) in the third layer,each node is an equivalence class, where: two nodesu and v from the input belong to the same equiva-lence class if and only if they have the exact sameset of types, they are both subjects of the same dataproperty p, and the objects of that property p belongto the same RDF type cluster. Further, each equiva-lence class is mapped to all data sources comprising oftriples (s, p, o) from an input RDF graph, such that sbelongs to the equivalence class of the node. To buildthe index, a stream-based computation approach isproposed, depicted in Fig. 10. The restriction to acertain window size of the data stream typically leadsto incomplete results, thus the choice of the appro-priate window size is an essential parameter for thequality of the extracted index. The specific approachdoes not consider implicit triples, nor untyped re-sources. In addition, the resulting index is not a quo-tient, since in each layer data sources may be mappedto multiple index nodes (while a quotient partitionsthe graph nodes). Finally the time complexity of theapproach is O(N ∗ logN) or O(M ∗ logM), where Nis the number of RDF classes available and M is thenumber of properties.

Towards the same goal of building a compact rep-resentation of an RDF graph to be used as a schema,[39] proposes an approach to extract a schema-like di-rected graph, as follows. A density-based clusteringalgorithm is used on the input RDF graph to iden-tify the summary nodes: each such node, called a(derived) type, corresponds to a set of resources with

24

Figure 10: Graph compression technique for SchemEX.

sufficient structural similarity. Further, each of thesetypes is described by a profile, i.e., a set of (property,probability) pairs of the form (−→p , α) or (←−p , α) mean-ing that a resource of that type has the outgoing or in-coming property p with probability α. These profilesare used to define output schema edges of two kinds:(i) There is a p-labeled edge from a type node T1 to

a type node T2, i.e., T1p−→ T2, whenever p is an out-

going property of T1’s profile and an incoming prop-erty of T2’s profile. (ii) There is an rdfs:subClassOf-labeled edge from a type node T1 to a type node T2,

i.e., T1rdfs:subClassOf−−−−−−−−−−→ T2, whenever T1 is found more

specific than T2 by an ascending hierarchical clus-tering algorithm applied to their profiles. The timecomplexity of the corresponding algorithm is O(N2)where N is the number of entities in the dataset. Theoutput directed graph can be seen as a summary de-scribing first, types which correspond to structurallysimilar resources, and second, how properties relateresources of various types.

6 Pattern-based RDF summa-rization

In this section we review summarization methodsbased on data mining techniques, which extractthe frequent patterns from the RDF graph, anduse these patterns to represent the original RDFgraph. A frequent pattern, usually referred to asa knowledge pattern in the RDF/OWL KB con-

text (or simply pattern from now on), character-izes a set of instances in an RDF data graph thatshare a common set of types and a common set ofproperties. It is usually modeled as a star BGPof the form {(x, τ, c1), . . . , (x, τ, cn), (x, Pr1, ?b1), . . . ,(x, Prm, ?bm)} denoting some resource x havingtypes c1, . . . , cn and properties Pr1, . . . , P rm. Givenan RDF graph G, a pattern KP identifies all the Gresources that match x in the embeddings of KP intoG; the number of such embeddings is called the sup-port of KP in G. Patterns identified in such a mannerbecome representative nodes (supernodes).

Example 6 (Knowledge pattern) Consideragain our sample RDF graph G presented inFigure 1. The following knowledge pattern{(x, τ, Publication), (x, hasTitle, y), (x, hasAuthor, z)}has no support in G and a support of 1 in G∞ (whenthe embedding matches x to doi1).

An overview of the works in this category is shownin Table 5. Section 6.1 discusses methods that exploitmining graph patterns, while Section 6.2 is concernedwith methods which summarize the RDF graph basedon rules derived from mining techniques.

6.1 Mining graph patterns

In this section we describe methods that exploit pat-terns that eventually appear in the RDF graph andconstruct the summary based on these patterns.

25

Figure 11: Zneika et al. approach. [121]

[121, 120] present an approach for RDF graph sum-marization based on mining a set of approximategraph patterns that “best” represent the input graph;the summary is an RDF graph itself, which allows totake advantage of SPARQL to evaluate (simplified)queries on the summary instead of the original graph.The approach proceeds in three steps, as shown inFig. 11. First, the RDF graph is transformed intoa binary matrix. In this matrix the rows representthe subjects and the columns represent the predi-cates. The semantics of the information is preserved,by capturing the available distinct types and all at-tributes and properties (capturing property partici-pation both as subject and object for an instance).Second, the matrix created in the previous step, isused in a calibrated version of the PaNDa+ [61] al-gorithm, which allows to retrieve the best approx-imate RDF graph patterns, based on different costfunctions. Each extracted pattern identifies a set ofsubjects (rows), all having approximately the sameproperties (columns). The patterns are extracted soas to minimize errors and to maximize the coverage(i.e. provide a richer description) of the input data.A pattern thus encompasses a set of concepts (type,property, attribute) of the RDF dataset, holding atthe same time information about the number of in-stances that support this set of concepts. Based onthe extracted patterns and the binary matrix, thesummary is reconstructed as an RDF graph, enrichedwith the computed statistic information; this enables

SPARQL query evaluation on the summary to re-turn approximate answers to queries asked on theoriginal graph. The time complexity of the approachis O(M ∗ N + K ∗ M2 ∗ N), where K is the max-imum number of patterns to be extracted, M andN are respectively the number of distinct propertiesand subjects/resources in the original KB. The au-thors note that the algorithm works equally well onhomogeneous and heterogeneous RDF graphs.

In [91], the goal is to discover the k patterns whichmaximize an informativeness measure (an informa-tiveness score function is provided as input). Thealgorithm takes as input an integer distance in d,which will be used to control the neighborhoods inwhich we will look for similar entities, and a boundk as the maximum number of the desired patterns.The algorithm discovers the k d-summaries/patternsthat maximize the informativeness score function.

The authors use d-similarity to capture similaritybetween entities in terms of their labels and neigh-borhood information up to the distance d. Com-pared to other graph patterns like frequent graph pat-terns, (bi)simulation-based, dual- simulation-basedand neighborhood-based summaries, d-similarity of-fers greater flexibility in matching, while it takes intoaccount the extended neighborhood, something thatprovides better summaries especially for schema-lessknowledge graphs, where similar entities that are notequivalent in a strict pairwise manner. A node v ofthe original graph G is attributed to the base graph

26

of the d-summary P if and only if there is a nodeu of P which has the same label as v and for everyparent/child u1 of u in P , there exists a parent/childv1 of v in G such that edges (u1, u) and (v1, v) havethe same edge label. Then the d-summaries are usede.g. to facilitate query answering.

A d-summary P is said to dominate another d-summary P ′ if and only if supp(P ) ≥ supp(P ′); amaximal d-summary P is one that dominates anysummary P ′ that may be obtained from P by addingone more edge. The algorithm starts by discoveringall maximal d-summaries by mining and verifying allk-subsets of summaries for the input graph G, thengreedily adds a summary pair (P, P1) that brings thegreatest increase to the informativeness score of thesummary. The time complexity of the approach isO(S ∗ (b + N) ∗ (b + M) + K

2 ∗ S2), where N , M

are respectively the total number of nodes (subjectand objects) and edges (triples) of the original RDFgraph, and S is the number of possible d-summarieswhose size is bounded by b.

6.2 Mining rules

Methods described here use rule mining techniquesin order to extract rules for summarizing the RDFgraph. A common limitation of such methods is that,by design, the summary is not an RDF graph, thusit cannot be exploited using the common set of RDFtools (e.g., SPARQL querying, reasoning etc.)

[36, 35] propose compressing the RDF datasets re-moving triples that can be inferred using logical andinference rules. Thus, graph decompression inferssuch triples again, to retrieve the original graph.

This approach, which is depicted in Fig. 12, gen-erates, from a given RDF graph G, an active graphGA containing the triples that adhere to certain log-ical rules, and a dormant graph GD, which containsthe set of triples of the original graph which noneof the identified rule can infer. This leads to view-ing an RDF graph G as being R(GA) ∪ GD, whereR represents the set of rules to be applied to the ac-tive graph GA, while (GA, GD) together represent thecompressed graph. An association rule mining algo-rithm is employed to automatically identify the setof logical rules.

The authors leverage the Apriori [2] or FP-Growth[31] frequent pattern mining algorithms, to identifysets of association rules. First, for each propertyp, a “transaction” (in classical data mining terms)is a list of objects which are the values of prop-erty p for a given subject. Each rule thus is de-fined by: a property p, an object item k, and a fre-quent itemset x associated with k. One sample ruleis: ∀x, (x, p, k) →

∧ni=1(x, p, vi), stating that the

subjects that carry the value k for property p, carryalso the values ui for the same property. Based onsuch a rule, the triple (x, p, k) is encoded in the sum-mary while the inferred triples

∧ni=1(x, p, vi) can be

removed. Further, the authors extend the approachto use as a transaction, the lists of all (p, o) pairs fora given subject, and similarly mine for frequent item-sets in this context, each of which will be interpretedas a logical compression rule.

The specific approach works well when the originalgraph contains many different nodes, sharing manysame “neighbors”, but it is not effective when thecontrary is true. To deal with the last issue, theauthors of [74] extend the previous approach by ex-ploiting a graph pattern with two variables instead ofone, which makes it applicable to more generic graphstructures, reducing the size of the summary graph.This is because the number of triples in the sum-mary graph is halved (a rule can now represent moretriples). The time complexity of the two previous ap-proaches is O(M ∗R+Np ∗O2

v ∗Ns), where M is thetotal number of triples, R is the number of the gen-erated logical rules, Np and Ns are respectively thenumber of distinct properties and subjects/resourcesin the graph and Ov is the average number of differ-ent objects/values that are assigned to a property p(we should remember that we are looking for commonneighbours and thus for common values for the prop-erties in the object/value part, (s, p, o) or (s, p, v) inthe triple notation); thus the lower the average num-ber of different values the more common the neigh-bourhood and the better the algorithm behaves asalready stated earlier.

27

Figure 12: Graph compression technique for Joshi et al. [36].

Figure 13: Graph compression framework following [74].

7 Statistical summarization

The works we discuss here focus on quantitativelysummarization of the contents of an RDF graph. Anoverview is shown in Table 6.

A first motivation for statistical summarizationworks comes from the source selection problem. Ingeneral, statistical methods are interested in provid-ing quantitative statistics on the content of the KBsin order to decide if it is pertinent to use the KB ornot. In that respect, compared to the pattern miningcategory (which is conceptually close) and the othercategories, it has the advantage not to care too muchabout issues of structural completeness of the sum-

mary and thus reducing computational costs. Earlyworks in this area [5, 89] use SPARQL ASK queriesto identify whether a triple pattern exists in a sourcenode or not, and query those sources only in a subse-quent step. The main problem of this solution is thatmany sources might contain the same facts, meaningthat we will have many duplicate results and there-fore many unnecessary requests. The authors of [33]propose to expand ASK queries in order not to returna boolean answer, but a concise summary of the re-sult, in the form of Bloom filters [7]. Based on thesesummaries, a corresponding algorithm estimates thebenefit of retrieving results for a triple pattern froma specific source, ignoring sources with low or zero

28

benefit. The summaries produced are called sketches,and include statistical information on the instances.In this approach, input is not required from the users.

Another approach, which seeks to identify the mostimportant resources in an ontology, is [111]. Theproposed algorithm, named Concept-And-Relation-Ranking, does not consider instances and tries toidentify the most important schema concepts and re-lations in an iterative manner. The importance of aconcept is a combination of the number of relationsstarting from it, its relations to more important con-cepts, and the weights of these relations. The moreimportant the concept at the source of a relation, thehigher the weight of the relation. The importance ofnodes and the weights of the relations reinforce oneanother in an iterative process. The approach con-siders implicit and semantic information as well; itis based on an ontology graph model to which RDF,DAML+OIL and OWL ontologies can be mapped.

[79] proposes an method to automatically summa-rize local ontologies that are used as schemas of peersparticipating in a peer-to-peer system. The goal is tohelp peer clustering, where an incoming peer mustsearch for semantically similar peers in order to join.To do that, a schema summary of the new node iscompared with the schema summaries of the existingpeers in order to decide where to join.

In order to determine the relevance of a concept,two measures are combined: centrality and frequency.Centrality is an adaptation of the degree central-ity; different weights can be assigned to user-definedproperties, on one hand, and to the special proper-ties isA (RDFS subclassOf), partOf and sameAs; fre-quency is the ratio between the number of conceptcorrespondences and the number of distinct local on-tologies. The algorithm starts by computing the rel-evance, then it selects the top-k nodes, and subse-quently groups adjacent relevant concepts. Finally,in order to link non-adjacent groups, the first k-pathsconnecting them are examined in order to select theones that have the best f -measure and average rel-evance. The approach does not consider the datatriples of the input graph, nor the implicit triples.However, it supports OWL ontologies. Using it re-quires setting the values of a set of parameters andweights, in particular to determine the importance of

different properties, to compute the relevance, to de-termine the summary size etc. The tool is availableonline for download10.

LODSight [21] is an RDF dataset summary visu-alization tool that displays typical combinations oftypes and predicates. It relies solely on SPARQLqueries and as such, given a SPARQL endpoint, it cantheoretically summarize all accessible data, withoutrequiring any user input. Through those SPARQLqueries, it collects statistical information on the avail-able combinations of types and predicates, and visu-alizes them in a labelled graph. Implicit RDF datais only accounted for to the extent that the endpointreturns full answers based on reasoning. The toolprovides dynamic means of changing the level of de-tails, and is able to summarize very large datasets.The system is available online11.

LODSight was extended in [68] in order to furtherimprove the understanding of a dataset, by instanti-ating the summary patterns identified by LODSight.To do that, the authors propose an approach to se-lect instances through three methods, namely ran-dom, distinct and representative. In random selec-tion, random examples of each RDF summary pathare selected; this runs the risk of returning dupli-cates. The distance selection method aims to selectdata paths as distinct from one another as possible;to this effect, distance measures are used to find howsimilar two paths are, and a greedy heuristic is em-ployed to construct a sufficiently diverse set of pairs.The representation selection method combines diver-sity and representativeness criteria in order to selecta set of paths achieving a comprehensive result. Theselection method is available as a web service 12.

[81] proposes a dataset analysis method based onrecognizing and discovering patterns. The aim of thismethod is to support query answering. As such, theauthors create initially an ontology that depicts theorganization of the dataset and identifies its main fea-tures, i.e. information about triples, paths, and typesand properties occurring in the paths. In addition,it includes statistics about these elements, such as

10http://www.cin.ufpe.br/~speed/OWLSummarizer/11http://lod2-dev.vse.cz/lodsight/about.html12https://github.com/jindrichmynarz/

rdf-path-examples

29

the number of occurrences of each path. Using thisontology, the core types and properties can be distin-guished based on their frequencies and the positionin paths. According to these observations, centralknowledge patterns (containing a central type andproperties) are extracted in order to define prototyp-ical queries. Implicit information is not taken intoaccount.

8 Other summarization meth-ods

In this section, we present approaches that combinemethods from the structural, statistical and pattern-mining categories in order to get better results. Inaddition, there are methods going beyond RDF graphsummaries, for example summarizing DL ontologies.An overview of the works in this category is shownin Table 7.

[3] proposes a hybrid structural summarizationtechnique for RDF graphs, the purpose of which isto reduce their size while retaining their structureas much as possible. It consists of a graph quotientstep followed by a graph clustering step. The firstone adopts bounded forward bisimulation, as accord-ing to the authors, previous studies showed that, ingeneral, unbounded bisimulation is not amenable tosignificant graph size reduction. This intermediategraph being a quotient, it represents all the N nodesand M edges of the input graph, and is computed inO(M ∗N+N2). It is then further compressed by hier-archical clustering, which fuses root nodes of similardepth-bounded tree subgraphs. This is achieved withthe so-called Complete-Link Clustering algorithm inO(N2) [67], where N is the number of nodes of theaforementioned intermediate graph. This techniqueproduces a standard graph out of an RDF one, with-out user input, which summarizes the whole inputgraph (nodes and edges). However, it does not per-form particular treatment on RDF Schema triples,hence does not capture the implicit triples they en-tail, unless input RDF graph are saturated prior tosummarization.

ABSTAT [73, 92, 93] is a summarization method

for RDF graphs (and OWL KBs), which aims at re-flecting how class instances are related through prop-erties. A summary is not a graph but a set of abstractknowledge patterns (AKPs) of the form (c1, p, c2) rep-resenting the (s, p, o) graph triples with c1 (resp. c2)one of the most specific types of s (resp. o); theremay have several such c1, c2 pairs for a given prop-erty p. An ABSTAT summary is built in polynomialtime in the size of the input RDF graph, by firstcomputing for every value presents in the graph allits types, from which are pruned out the redundantones. Then, for each property assertion (s, p, o), anAKP (c1, p, c2) is built if c1 (resp. c2) is a most spe-cific type for value s (resp. o). ABSTAT is availableonline13.

[94] proposes to use structural summaries of RDFgraphs for estimating the cardinality of conjunctivequeries. The authors build a graph summary of anRDF graph by grouping together nodes having ex-actly the same set of types, same outgoing and sameincoming properties. A summary edge is labeled withthe number of edges of G that have been collapseddue to merging (thus, the summary is not a quo-tient). The classes (appearing in the object positionof type triples), and properties appearing in the prop-erty position are preserved, i.e., they are representedin the summary by themselves. However, propertiesappearing in the subject/object positions are not pre-served; moreover, possible RDFS properties are sum-marized just as any other data property, thus theschema is not conserved either. These summaries,built in time linear in M , the number of graph edges,may be too large; therefore, the authors propose analgorithm to reduce the summary to a target sizespecified by the user, by merging nodes having simi-lar incoming and outgoing properties. The similarityis determined by a Jaccard index, approximated byMinHashing [56]; to efficiently compute the similaritybetween all pairs of summary nodes, locality-sensitivehashing [56] is used. The approach gets as input onlyinstances and optional parameters for summary re-finement and returns an instance graph. This graphis further used to enable the estimation of the cardi-nality for easing query answering and evaluation.

13http://abstat.disco.unimib.it/search

30

[77] combines structural non-quotient and statisti-cal methods to create a summary of an RDF graph,which they call relational schema. The initial sum-mary is generated, in linear time of the RDF graphsize (average complexity), by computing sets of prop-erties joining on the subject, the so called character-istic sets, denoted CS. A summary node is createdfor each CS, thus representing nodes of G having thesame outgoing properties. An edge exists betweentwo summary nodes, labeled by a property p, if thereexist two G nodes in their respective extents, suchthat there exists an edge labeled by p in G betweenthe two G nodes. This structural aspect thereforeconsiders only the instance component of an RDFgraph. In the second step, a short human-friendlylabel is computed for summary nodes and edges, byrelying on type triples, or in their absence, on ontol-ogy information. To reduce the summary size, sum-mary nodes are subsequently merged. In semanticmerging, two summary nodes can be merged in twoways: (i) if they have the same label, taken from anontology, or (ii) if their labels, originating from dif-ferent ontologies, have a common superclass, and thegenerality score of this superclass is lower than a cer-tain threshold. The generality score of an ontologyclass v is computed as the ratio between the numberof instances of v’s subclasses and the total number ofinstances covered by the ontology to which v belongs.Moreover, the two ways in which two summary nodescan be merged in structural merging are as follows:(i) if they both have an incoming edge with the sameproperty, from another summary node, or (ii) if theproperties in their respective CSs have the TF/IDFsimilarity score higher than a given threshold. Themerging order of the nodes, affects the resulting sum-mary. The summary is modeled relationally: a ta-ble is created per summary node, with a column foreach property in the CS represented by the summarynode; the relationships between the nodes are storedas foreign keys. The chosen relational model proveschallenging for storing the possibly highly heteroge-neous graph structure, inherent to RDF graphs, andit drives the modifications to the summary. First, theCS of each summary node may be a result of mergingmultiple summary nodes and thus their CSs. There-fore, a G node in the extent of a summary node may

not have a value for each property in the CS of thesummary node, and will have a NULL value in thetable for each such property. Properties having fewnon-NULL values are deleted. Second, a single prop-erty in RDF may have multiple literal values, possiblyof different types. In such cases, [77] chooses to add acolumn for each distinct literal value type per prop-erty, which other than incurring space costs, may leadto more NULL values. Therefore, for each literal valuetype below the infrequency threshold, all triples aremoved to a separate single PSO table. Finally, infre-quent edges are deleted from the summary, while asummary node is deleted if the number of nodes itrepresents is below a threshold; with the exceptionof summary nodes which are referred to many timesfrom other tables. The approach relies on end-usersfor choosing the right parameters whereas, the au-thors propose also an auto-tuning algorithm for de-termining the best value of the similarity threshold.

[119] proposes a framework for mining equivalentstructure patterns with equivalent semantic mean-ing. As in RDF KBs, it is common to have differ-ent graph structures, sharing the same meaning, theauthors’ aim is to ease end-user’s querying task. Assuch, instead of demanding from the users to havethe complete knowledge of the schema - enumerat-ing in the query all possible semantically equivalentgraph structures -. the authors propose an approachthat performs query rewriting, exploiting automati-cally other possible graph structures with the samemeaning. To achieve that, they define the notion ofsemantic graph edit distance and present a frameworkthat tries first to rewrite the input query to one con-sidering semantic equivalences and then finding thesubgraphs minimizing the semantic graph edit dis-tance. For efficiency, they build offline a semanticsummary graph over which they perform a two-levelpruning at query time in order to finally provide an-swers. The semantic summary graph is a multi-layergraph where the first layer is consisted of the linkedtypes of the instances (they call them semantic facts).Then, they abstract this graph in the layers above,replacing/abstracting in each layer classes with theirsuperclass. The summary produced does not requireuser input to be produced, whereas the aforemen-tioned method can only be applied in fully typed

31

RDF KBs.

Finally, the next works consider structural meth-ods for the summarization of ABoxes (facts) in de-scription logics KBs.

[25] proposes a method for compressing (hencesummarizing) the ABox of a Horn-ALCHOI de-scription logic KBs, using the notion of ABoxabstraction. Given an ABox A, for each A value v,a type pattern of the form tp(v) = (tp↓, tp→, tp←)is computed, where tp↓ denotes the explicit typesv has in A (C’s such that C(v) ∈ A), tp→ theoutgoing properties v has in A (R’s such thatR(v, v′) ∈ A) and tp← the incoming properties vhas in A (S’s such that S(v′, v) ∈ A). These typepatterns are then used to build the abstraction Bof the ABox, which is an ABox itself; each suchtype pattern is used to represent all the ABoxvalues that match it: for every type pattern tp =({C1, . . . , Cm}, {R1, . . . , Rn}, {S1, . . . , Sl}), B con-tains {C1(xtp), . . . , Cm(xtp), R1(xtp, y

R1tp ), . . . , Rn(xtp, y

Rntp ),

S1(yS1tp , xtp), . . . , Sl(y

Sltp , xtp)}. We remark that,

given an ABox, its abstraction is simply obtainedby traversing its facts. It is further shown in[25] how the abstraction of the ABox of an inputHorn-ALCHOI KB can be gradually refined, usingreasoning steps, to obtain the abstraction of theABox of the input KB saturation.

ABox summaries has also been considered forthe data management tasks of consistency checking[24, 23] and query answering [19, 20] in SHIN KBs.In these works, the notion of a summary ABox is verygeneral: an ABox A′ is a summary of another ABoxA w.r.t. some function f that maps A values to A′ones whenever f defines a homomorphism from A toA′. Based on this property of ABox summaries, thepurpose of [24, 23, 19, 20] is to study how SHIN con-sistency and query answering reasoning techniquescan be performed correctly and more efficiently thanwhen handling the input ABox.

9 Conclusions and FutureWork

In this survey, we present a comprehensive state-of-the-art in semantic graph summarization. To this di-rection, we introduce a taxonomy of the works in thearea (based on different properties/criteria that theworks adhere to), that can help practitioners and re-searchers to determine the method most suitable fortheir data and goal. In this taxonomy, we groupedthe main methods of the algorithms presented intofour main categories structural, statistical, pattern-mining and hybrid, identifying subcategories when-ever possible. In addition we also classified works inthe field according to their input, output, availabil-ity on the internet and purpose, showing the rapidlyevolving dynamics in the area.

In general, RDF graph summaries can be useful invarious application scenarios ranging from data un-derstanding to query answering and from RDF graphdata indexing to RDF graph visualization. Struc-tural quotient summaries are most applicable to in-dexing and query answering through graph reduc-tion; this holds especially for quotients built throughequivalence relations such as bisimilarity (possiblybounded). Non-quotient summaries mostly target vi-sualization, schema discovery and data understand-ing. Pattern mining summaries provide in many caseslogical rules besides the summary graph as part ofthe final result, so could be possibly more useful inRDF graph compression scenarios. Summaries couldalso be really useful in data integration scenarios [48],where instead of generating mappings [63], [65] be-tween data source schemas, summaries could be usedto drive the definition of the mapping. Extendingthis to a scenario where the sources can also evolve[47], [46], summaries can play a key role in schemaunderstanding and mapping redefinition. DifferentRDF summarization scenarios each bring their veryspecific requirements (e.g., whether the summary sizeis bounded or not, whether a schema is present ornot, whether to summarize the data or the schema,whether the summary needs to reflect all the struc-ture or not etc.); in many cases more than one al-gorithm or family of algorithms will provide suitable

32

results. The goal of our survey was to provide enoughinformation to the users of these algorithms (i.e. ap-plication developers or researchers) in order to beable to easily refer to the characteristics of each ap-proach, and evaluate their suitability to their appli-cation requirements.

Despite the considerable amount of work in thearea of semantic graph summarization, there still aremany important open problems in the field. Belowwe mention two of the most notable ones.

The first one is the quality of the produced RDFsummary. Since the result summary for the differentalgorithms, varies among a selection of nodes, quo-tients, some other frequency structure or a completegraph with various types of nodes and links identify-ing a single golden standard is a complex task. Onthe other hand, even domain experts in many casesdisagree on which specific elements should be selectedin a semantic summary. However, having in mind ourproposed taxonomy we believe that the next step inthe area is the establishment of different golden stan-dards specific to each subcategory focusing on specificpurpose input and output.

Another open issue we perceive as really impor-tant, is the dynamic nature of all these datasets. Asnew information becomes available due to new ex-perimental evidence or observations and erroneouspast conceptualizations are constantly updated manydatasets are rapidly changing. However, summa-rization in most cases and especially for big datasources is a time-consuming process that should beconstantly updated to facilitate data exploration. Assuch, novel ideas should focus in this dynamicity, aug-menting the exploration experience of end users.

The work in RDF graph summarization gains moreimportance as the RDF Knowledge Bases becomelarger and more connected and thus we expect to seeadditional advances in the field in the near future.Acknowledgments This research is implementedthrough IKY scholarships programme and co-financed by the European Union and Greek nationalfunds through the action entitled ”Reinforcement ofPostdoctoral Researchers”, in the framework of theOperational Programme ”Human Resources Devel-opment Program, Education and Life-long Learn-ing” of the National Strategic Reference Framework

(NSRF) 2014-2020.

References

[1] Abiteboul, S., Hull, R., Vianu, V.: Foundations ofDatabases. Addison-Wesley (1995)

[2] Agrawal, R., Imielinski, T., Swami, A.N.: Mining associ-ation rules between sets of items in large databases. In:Proceedings of the 1993 ACM SIGMOD InternationalConference on Management of Data, Washington, DC,USA, May 26-28, 1993., pp. 207–216 (1993)

[3] Alzogbi, A., Lausen, G.: Similar structures inside rdf-graphs. In: Proceedings of the WWW2013 Workshopon Linked Data on the Web, Rio de Janeiro, Brazil, 14May, 2013 (2013)

[4] Baader, F., Calvanese, D., McGuinness, D.L., Nardi,D., Patel-Schneider, P.F. (eds.): The Description LogicHandbook: Theory, Implementation, and Applications.Cambridge University Press (2003)

[5] Basca, C., Bernstein, A.: Avalanche: Putting the spiritof the web back into semantic web querying. In: Pro-ceedings of the ISWC 2010 Posters & DemonstrationsTrack: Collected Abstracts, Shanghai, China, Novem-ber 9, 2010 (2010)

[6] Blom, S., Orzan, S.: A distributed algorithm for strongbisimulation reduction of state spaces. STTT 7(1), 74–86 (2005)

[7] Bloom, B.H.: Space/time trade-offs in hash coding withallowable errors. Commun. ACM 13(7), 422–426 (1970)

[8] Boldi, P., Vigna, S.: Axioms for centrality. InternetMathematics 10(3-4), 222–262 (2014)

[9] Bursztyn, D., Goasdoue, F., Manolescu, I.: Efficientquery answering in dl-lite through FOL reformulation(extended abstract). In: Proceedings of the 28th Interna-tional Workshop on Description Logics, Athens,Greece,June 7-10, 2015. (2015)

[10] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini,M., Rosati, R.: Tractable reasoning and efficient queryanswering in description logics: The DL-Lite family. J.Autom. Reasoning 39(3), 385–429 (2007)

[11] Campinas, S., Perry, T., Ceccarelli, D., Delbru, R., Tum-marello, G.: Introducing RDF graph summary with ap-plication to assisted SPARQL formulation. In: 23rd In-ternational Workshop on Database and Expert SystemsApplications, DEXA 2012, Vienna, Austria, September3-7, 2012, pp. 261–266 (2012)

[12] Cebiric, S., Goasdoue, F., Guzewicz, P., Manolescu,I.: Compact Summaries of Rich Heterogeneous Graphs.Research Report RR-8920, INRIA Saclay ; Univer-site Rennes 1 (2017). URL https://hal.inria.fr/

hal-01325900

33

[13] Cebiric, S., Goasdoue, F., Manolescu, I.: Query-orientedsummarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)

[14] Cebiric, S., Goasdoue, F., Manolescu, I.: Query-orientedsummarization of RDF graphs. In: Data Science - 30thBritish International Conference on Databases, BICOD2015, Edinburgh, UK, July 6-8, 2015, Proceedings, pp.87–91 (2015)

[15] Chen, C., Lin, C.X., Fredrikson, M., Christodorescu, M.,Yan, X., Han, J.: Mining graph patterns efficiently viarandomized summaries. PVLDB 2(1) (2009)

[16] Chen, C., Yan, X., Zhu, F., Han, J., Yu, P.S.: GraphOLAP: towards online analytical processing on graphs.In: Proceedings of the 8th IEEE International Confer-ence on Data Mining (ICDM 2008), December 15-19,2008, Pisa, Italy (2008)

[17] Consens, M.P., Fionda, V., Khatchadourian, S., Pirro,G.: S+epps: Construct and explore bisimulation sum-maries, plus optimize navigational queries; all on exist-ing SPARQL systems. PVLDB 8(12), 2028–2031 (2015)

[18] Consens, M.P., Miller, R.J., Rizzolo, F., Vaisman, A.A.:Exploring XML web collections with DescribeX. TWEB4(3), 11:1–11:46 (2010)

[19] Dolby, J., Fokoue, A., Kalyanpur, A., Kershenbaum,A., Schonberg, E., Srinivas, K., Ma, L.: Scalable se-mantic retrieval through summarization and refinement.In: Proceedings of the Twenty-Second AAAI Conferenceon Artificial Intelligence, July 22-26, 2007, Vancouver,British Columbia, Canada, pp. 299–304 (2007)

[20] Dolby, J., Fokoue, A., Kalyanpur, A., Schonberg,E., Srinivas, K.: Scalable highly expressive reasoner(SHER). J. Web Sem. 7(4), 357–361 (2009)

[21] Dudas, M., Svatek, V., Mynarz, J.: Dataset summary vi-sualization with lodsight. In: The Semantic Web: ESWC2015 Satellite Events - ESWC 2015 Satellite Events Por-toroz, Slovenia, May 31 - June 4, 2015, Revised SelectedPapers, pp. 36–40 (2015)

[22] Fan, W., Li, J., Wang, X., Wu, Y.: Query preservinggraph compression. In: Proceedings of the ACM SIG-MOD International Conference on Management of Data,SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012,pp. 157–168 (2012)

[23] Fokoue, A., Kershenbaum, A., Ma, L.: SHIN abox re-duction. In: Proceedings of the 2006 International Work-shop on Description Logics (DL2006), Windermere, LakeDistrict, UK, May 30 - June 1, 2006 (2006)

[24] Fokoue, A., Kershenbaum, A., Ma, L., Schonberg, E.,Srinivas, K.: The summary abox: Cutting ontologiesdown to size. In: The Semantic Web - ISWC 2006, 5thInternational Semantic Web Conference, ISWC 2006,Athens, GA, USA, November 5-9, 2006, Proceedings,pp. 343–356 (2006)

[25] Glimm, B., Kazakov, Y., Liebig, T., Tran, T., Vialard,V.: Abstraction refinement for ontology materialization.In: The Semantic Web - ISWC 2014 - 13th InternationalSemantic Web Conference, Riva del Garda, Italy, Octo-ber 19-23, 2014. Proceedings, Part II, pp. 180–195 (2014)

[26] Goasdoue, F., Karanasos, K., Leblay, J., Manolescu, I.:View selection in semantic web databases. PVLDB 5(2),97–108 (2011)

[27] Goasdoue, F., Manolescu, I., Roatis, A.: Efficient queryanswering against dynamic RDF databases. In: Joint2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings,Genoa, Italy, March 18-22, 2013, pp. 299–310 (2013)

[28] Goldman, R., Widom, J.: Dataguides: Enablingquery formulation and optimization in semistructureddatabases. In: VLDB’97, Proceedings of 23rd Interna-tional Conference on Very Large Data Bases, August25-29, 1997, Athens, Greece, pp. 436–445 (1997)

[29] Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.:Using graph summarization for join-ahead pruning in adistributed RDF engine. In: Proceedings of the SixthWorkshop on Semantic Web Information Management,SWIM 2014, Snowbird, UT, USA, June 22-27, 2014(2014)

[30] Hakimi, S.L.: Steiner’s problem in graphs and its impli-cations. Networks 1(2), 113–133 (1971)

[31] Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent pat-terns without candidate generation: A frequent-patterntree approach. Data Min. Knowl. Discov. 8(1), 53–87(2004)

[32] Henzinger, M.R., Henzinger, T.A., Kopke, P.W.: Com-puting simulations on finite and infinite graphs. In:FOCS (1995)

[33] Hose, K., Schenkel, R.: Towards benefit-based RDFsource selection for SPARQL queries. In: Proceedings ofthe 4th International Workshop on Semantic Web Infor-mation Management, SWIM 2012, Scottsdale, AZ, USA,May 20, 2012, p. 2 (2012)

[34] Jiang, X., Zhang, X., Gao, F., Pu, C., Wang, P.:Graph compression strategies for instance-focused se-mantic mining. In: Linked Data and Knowledge Graph -7th Chinese Semantic Web Symposium and 2nd ChineseWeb Science Conference, CSWS 2013, Shanghai, China,August 12-16, 2013. Revised Selected Papers, pp. 50–61(2013)

[35] Joshi, A.K., Hitzler, P., Dong, G.: Towards logicallinked data compression. In: Proceedings of the JointWorkshop on Large and Heterogeneous Data and Quan-titative Formalization in the Semantic Web, LHD+SemQuant2012, at the 11th International Semantic WebConference, ISWC2012. Citeseer (2012)

[36] Joshi, A.K., Hitzler, P., Dong, G.: Logical linked datacompression. In: The Semantic Web: Semantics andBig Data, 10th International Conference, ESWC 2013,Montpellier, France, May 26-30, 2013. Proceedings, pp.170–184 (2013)

34

[37] Kaushik, R., Bohannon, P., Naughton, J.F., Korth,H.F.: Covering indexes for branching path queries. In:Proceedings of the 2002 ACM SIGMOD InternationalConference on Management of Data, Madison, Wiscon-sin, June 3-6, 2002, pp. 133–144 (2002)

[38] Kaushik, R., Shenoy, P., Bohannon, P., Gudes, E.: Ex-ploiting local similarity for indexing paths in graph-structured data. In: Proceedings of the 18th Interna-tional Conference on Data Engineering, San Jose, CA,USA, February 26 - March 1, 2002, pp. 129–140 (2002)

[39] Kellou-Menouer, K., Kedad, Z.: Schema discovery inRDF data sources. In: Conceptual Modeling - 34th In-ternational Conference, ER 2015, Stockholm, Sweden,October 19-22, 2015, Proceedings (2015)

[40] Khan, A., Bhowmick, S.S., Bonchi, F.: Summarizingstatic and dynamic big graphs. PVLDB 10(12), 1981–1984 (2017)

[41] Khan, K., Nawaz, W., Lee, Y.: Set-based approximateapproach for lossless graph summarization. Computing97(12), 1185–1207 (2015)

[42] Khatchadourian, S., Consens, M.P.: Explod: Summary-based exploration of interlinking and RDF usage in thelinked open data cloud. In: The Semantic Web: Re-search and Applications, 7th Extended Semantic WebConference, ESWC 2010, Heraklion, Crete, Greece, May30 - June 3, 2010, Proceedings, Part II, pp. 272–287(2010)

[43] Khatchadourian, S., Consens, M.P.: Exploring RDF us-age and interlinking in the Linked Open Data Cloudusing ExpLOD. In: WWW2010 Workshop on LinkedData on the Web (LDOW) (2010)

[44] Khatchadourian, S., Consens, M.P.: Understanding bil-lions of triples with usage summaries. Semantic WebChallenge (2011)

[45] Khatchadourian, S., Consens, M.P.: Constructing bisim-ulation summaries on a multi-core graph processingframework. In: Proceedings of the Third InternationalWorkshop on Graph Data Management Experiences andSystems, GRADES 2015, Melbourne, VIC, Australia,May 31 - June 4, 2015 (2015)

[46] Kondylakis, H., Plexousakis, D.: Ontology evolution indata integration: Query rewriting to the rescue. In: Con-ceptual Modeling - ER 2011, 30th International Confer-ence, ER2011, Brussels, Belgium, October 31 - Novem-ber 3, 2011. Proceedings, pp. 393–401 (2011)

[47] Kondylakis, H., Plexousakis, D.: Ontology evolution:Assisting query migration. In: Conceptual Modeling -31st International Conference ER 2012, Florence, Italy,October 15-18, 2012. Proceedings, pp. 331–344 (2012)

[48] Kondylakis, H., Plexousakis, D.: Ontology evolutionwithout tears. J. Web Sem. 19, 42–58 (2013)

[49] Konrath, M., Gottron, T., Scherp, A.: Schemex–web-scale indexed schema extraction of Linked Open Data.

Semantic Web Challenge, Submission to the BillionTriple Track pp. 52–58 (2011)

[50] Konrath, M., Gottron, T., Staab, S., Scherp, A.:Schemex - efficient construction of a data catalogue bystream-based indexing of linked data. J. Web Sem. 16,52–58 (2012)

[51] Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: Sum-marizing and understanding large graphs. StatisticalAnalysis and Data Mining 8(3), 183–202 (2015)

[52] Kyrola, A., Blelloch, G.E., Guestrin, C.: Graphchi:Large-scale graph computation on just a PC. In: 10thUSENIX Symposium on Operating Systems Design andImplementation, OSDI 2012, Hollywood, CA, USA, Oc-tober 8-10, 2012, pp. 31–46 (2012)

[53] Lanti, D., Rezk, M., Xiao, G., Calvanese, D.: The NPDbenchmark: Reality check for OBDA systems. In: Pro-ceedings of the 18th International Conference on Extend-ing Database Technology, EDBT 2015, Brussels, Bel-gium, March 23-27, 2015., pp. 617–628 (2015)

[54] Le, W., Li, F., Kementsietsidis, A., Duan, S.: Scalablekeyword search on large RDF data. IEEE TKDE 26(11)(2014)

[55] LeFevre, K., Terzi, E.: Grass: Graph structure sum-marization. In: Proceedings of the SIAM InternationalConference on Data Mining, SDM 2010, April 29 - May1, 2010, Columbus, Ohio, USA, pp. 454–465 (2010)

[56] Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining ofMassive Datasets, 2nd Ed. Cambridge University Press(2014)

[57] Lin, S.D., Yeh, M.Y., Li, C.T.: Sampling and summa-rization for social networks (tutorial) (2013)

[58] Liu, X., Tian, Y., He, Q., Lee, W., McPherson, J.:Distributed graph summarization. In: Proceedings ofthe 23rd ACM International Conference on Conferenceon Information and Knowledge Management, CIKM2014, Shanghai, China, November 3-7, 2014, pp. 799–808 (2014)

[59] Liu, Y., Safavi, T., Dighe, A., Koutra, D.: Graph sum-marization methods and applications: A survey. ACMComput. Surv. 51(3), 62:1–62:34 (2018)

[60] Louati, A., Aufaure, M., Lechevallier, Y.: Graph aggre-gation : Application to social networks. In: Advances inTheory and Applications of High Dimensional and Sym-bolic Data Analysis, HDSDA 2011, October 27-30, 2011,Beihang University, Beijing, China, pp. 157–177 (2011)

[61] Lucchese, C., Orlando, S., Perego, R.: A unifying frame-work for mining approximate top- \(k\) binary pat-terns. IEEE Trans. Knowl. Data Eng. 26(12), 2900–2913(2014)

[62] Luo, Y., Fletcher, G.H.L., Hidders, J., Wu, Y., Bra,P.D.: External memory k-bisimulation reduction of biggraphs. In: 22nd ACM International Conference on In-formation and Knowledge Management, CIKM’13, San

35

Francisco, CA, USA, October 27 - November 1, 2013,pp. 919–928 (2013)

[63] Marketakis, Y., Minadakis, N., Kondylakis, H., Konso-laki, K., Samaritakis, G., Theodoridou, M., Flouris, G.,Doerr, M.: X3ML mapping framework for informationintegration in cultural heritage and beyond. Int. J. onDigital Libraries 18(4), 301–319 (2017)

[64] Milo, T., Suciu, D.: Index structures for path expres-sions. In: Database Theory - ICDT ’99, 7th InternationalConference, Jerusalem, Israel, January 10-12, 1999, Pro-ceedings., pp. 277–295 (1999)

[65] Minadakis, N., Marketakis, Y., Kondylakis, H., Flouris,G., Theodoridou, M., de Jong, G., Doerr, M.: X3MLframework: An effective suite for supporting data map-pings. In: Proceedings of the Workshop on Extend-ing, Mapping and Focusing the CRM co-located with19th International Conference on Theory and Practiceof Digital Libraries (2015), Poznan, Poland, September17, 2015., pp. 1–12 (2015)

[66] Motta, E., Mulholland, P., Peroni, S., d’Aquin, M.,Gomez-Perez, J.M., Mendez, V., Zablith, F.: A novelapproach to visualizing and navigating ontologies. In:The Semantic Web - ISWC 2011 - 10th InternationalSemantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I, pp. 470–486 (2011)

[67] Murtagh, F.: A survey of recent advances in hierarchicalclustering algorithms. Comput. J. 26(4), 354–359 (1983)

[68] Mynarz, J., Dudas, M., Tomeo, P., Svatek, V.: Gener-ating examples of paths summarizing RDF datasets. In:Joint Proceedings of the Posters and Demos Track ofthe 12th International Conference on Semantic Systems- SEMANTiCS2016 and the 1st International Work-shop on Semantic Change & Evolving Semantics (SuC-CESS’16) co-located with the 12th International Confer-ence on Semantic Systems (SEMANTiCS 2016), Leipzig,Germany, September 12-15, 2016. (2016)

[69] Navlakha, S., Rastogi, R., Shrivastava, N.: Graph sum-marization with bounded error. In: Proceedings of theACM SIGMOD International Conference on Manage-ment of Data, SIGMOD 2008, Vancouver, BC, Canada,June 10-12, 2008 (2008)

[70] Neumann, T., Weikum, G.: The RDF-3X engine for scal-able management of RDF data. VLDB J. 19(1), 91–113(2010)

[71] Paige, R., Tarjan, R.E.: Three partition refinement al-gorithms. SIAM J. Comput. 16(6), 973–989 (1987)

[72] Paige, R., Tarjan, R.E.: Three partition refinement al-gorithms. SIAM J. Comput. 16(6), 973–989 (1987)

[73] Palmonari, M., Rula, A., Porrini, R., Maurino, A.,Spahiu, B., Ferme, V.: ABSTAT: linked data sum-maries with abstraction and statistics. In: The SemanticWeb: ESWC 2015 Satellite Events - ESWC 2015 Satel-lite Events Portoroz, Slovenia, May 31 - June 4, 2015,Revised Selected Papers, pp. 128–132 (2015)

[74] Pan, J.Z., Gomez-Perez, J.M., Ren, Y., Wu, H., Wang,H., Zhu, M.: Graph pattern based RDF data compres-sion. In: Semantic Technology - 4th Joint InternationalConference, JIST 2014, Chiang Mai, Thailand, Novem-ber 9-11, 2014. Revised Selected Papers, pp. 239–256(2014)

[75] Pappas, A., Troullinou, G., Roussakis, G., Kondylakis,H., Plexousakis, D.: Exploring importance measures forsummarizing RDF/S kbs. In: The Semantic Web - 14thInternational Conference, ESWC 2017, Portoroz, Slove-nia, May 28 - June 1, 2017, Proceedings, Part I, pp.387–403 (2017)

[76] Peroni, S., Motta, E., d’Aquin, M.: Identifying key con-cepts in an ontology, through the integration of cognitiveprinciples with statistical and topological measures. In:The Semantic Web, 3rd Asian Semantic Web Confer-ence, ASWC 2008, Bangkok, Thailand, December 8-11,2008. Proceedings, pp. 242–256 (2008)

[77] Pham, M., Passing, L., Erling, O., Boncz, P.A.: De-riving an emergent relational schema from RDF data.In: Proceedings of the 24th International Conference onWorld Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, pp. 864–874 (2015)

[78] Picalausa, F., Luo, Y., Fletcher, G.H.L., Hidders, J.,Vansummeren, S.: A structural approach to indexingtriples. In: The Semantic Web: Research and Applica-tions - 9th Extended Semantic Web Conference, ESWC2012, Heraklion, Crete, Greece, May 27-31, 2012. Pro-ceedings (2012)

[79] Pires, C.E.S., Queiroz-Sousa, P.O., Kedad, Z., Salgado,A.C.: Summarizing ontology-based schemas in PDMS.In: Workshops Proceedings of the 26th InternationalConference on Data Engineering, ICDE 2010, March 1-6,2010, Long Beach, California, USA, pp. 239–244 (2010)

[80] Pouriyeh, S.A., Allahyari, M., Kochut, K., Arabnia,H.R.: A comprehensive survey of ontology summariza-tion: Measures and methods. CoRR abs/1801.01937(2018)

[81] Presutti, V., Aroyo, L., Adamou, A., Schopman, B.A.C.,Gangemi, A., Schreiber, G.: Extracting core knowl-edge from linked data. In: Proceedings of the Sec-ond International Workshop on Consuming Linked Data(COLD2011), Bonn, Germany, October 23, 2011 (2011)

[82] Queiroz-Sousa, P.O., Salgado, A.C., Pires, C.E.S.: Amethod for building personalized ontology summaries.JIDM 4(3), 236–250 (2013)

[83] Qun, C., Lim, A., Ong, K.W.: D(k)-index: An adap-tive structural summary for graph-structured data. In:Proceedings of the 2003 ACM SIGMOD InternationalConference on Management of Data, San Diego, Califor-nia, USA, June 9-12, 2003, pp. 134–144 (2003)

[84] Riondato, M., Garcıa-Soriano, D., Bonchi, F.: Graphsummarization with quality guarantees. Data Miningand Knowledge Discovery 31(2), 314–349 (2017)

36

[85] Rissanen, J.: Modeling by shortest data description. Au-tomatica 14(5), 465–471 (1978)

[86] Rudolf, M., Paradies, M., Bornhovd, C., Lehner, W.:Synopsys: large graph analytics in the SAP HANAdatabase through summarization. In: First Interna-tional Workshop on Graph Data Management Experi-ences and Systems, GRADES 2013, co-loated with SIG-MOD/PODS 2013, New York, NY, USA, June 24, 2013,p. 16 (2013)

[87] Schatzle, A., Neu, A., Lausen, G., Przyjaciel-Zablocki,M.: Large-scale bisimulation of RDF graphs. In: Pro-ceedings of the Fifth Workshop on Semantic Web In-formation Management, SWIM@SIGMOD Conference2013, New York, NY, USA, June 23, 2013, pp. 1:1–1:8(2013)

[88] Schmachtenberg, M., Bizer, C., Paulheim, H.: State ofthe LOD Cloud 2014. http://linkeddatacatalog.dws.

informatik.uni-mannheim.de/state/. Accessed: 2017-03-30

[89] Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt,M.: Fedx: Optimization techniques for federated queryprocessing on linked data. In: The Semantic Web -ISWC 2011 - 10th International Semantic Web Confer-ence, Bonn, Germany, October 23-27, 2011, Proceedings,Part I, pp. 601–616 (2011)

[90] Seah, B., Bhowmick, S.S., Jr., C.F.D., Yu, H.: FUSE: aprofit maximization approach for functional summariza-tion of biological networks. BMC Bioinformatics 13(S-3), S10 (2012)

[91] Song, Q., Wu, Y., Dong, X.L.: Mining summaries forknowledge graph search. In: IEEE 16th InternationalConference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp. 1215–1220 (2016)

[92] Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Mau-rino, A.: ABSTAT: ontology-driven linked data sum-maries with pattern minimalization. In: SumPre (2016)

[93] Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Mau-rino, A.: ABSTAT: ontology-driven linked data sum-maries with pattern minimalization. In: The SemanticWeb - ESWC 2016 Satellite Events, Heraklion, Crete,Greece, May 29 - June 2, 2016, Revised Selected Papers,pp. 381–395 (2016)

[94] Stefanoni, G., Motik, B., Kostylev, E.V.: Estimating theCardinality of Conjunctive Queries over RDF Data Us-ing Graph Summarisation. Research report, Universityof Oxford (2017). URL https://www.cs.ox.ac.uk/isg/

tools/SumRDF/paper-tr.pdf

[95] Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C.,Reynolds, D.: SPARQL basic graph pattern optimiza-tion using selectivity estimation. In: Proceedings ofthe 17th International Conference on World Wide Web,WWW 2008, Beijing, China, April 21-25, 2008, pp. 595–604 (2008)

[96] Sydow, M., Pikula, M., Schenkel, R.: The notion of di-versity in graphical entity summarisation on semanticknowledge graphs. J. Intell. Inf. Syst. 41(2), 109–149(2013)

[97] Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggrega-tion for graph summarization. In: Proceedings of theACM SIGMOD International Conference on Manage-ment of Data, SIGMOD 2008, Vancouver, BC, Canada,June 10-12, 2008, pp. 567–580 (2008)

[98] Tian, Y., Patel, J.M.: Interactive graph summarization.In: Link Mining: Models, Algorithms, and Applications,pp. 389–409. Springer (2010)

[99] Tran, T., Ladwig, G., Rudolph, S.: Managing structuredand semistructured RDF data using structure indexes.IEEE TKDE 25(9) (2013)

[100] Troullinou, G., Kondylakis, H., Daskalaki, E., Plex-ousakis, D.: RDF digest: Efficient summarization ofRDF/S kbs. In: The Semantic Web. Latest Advancesand New Domains - 12th European Semantic Web Con-ference, ESWC 2015, Portoroz, Slovenia, May 31 - June4, 2015. Proceedings, pp. 119–134 (2015)

[101] Troullinou, G., Kondylakis, H., Daskalaki, E., Plex-ousakis, D.: RDF digest: Ontology exploration usingsummaries. In: Proceedings of the ISWC 2015 Posters& Demonstrations Track co-located with the 14th Inter-national Semantic Web Conference (ISWC-2015), Beth-lehem, PA, USA, October 11, 2015. (2015)

[102] Troullinou, G., Kondylakis, H., Daskalaki, E., Plex-ousakis, D.: Ontology understanding without tears: Thesummarization approach. Semantic Web 8(6), 797–815(2017)

[103] Troullinou, G., Kondylakis, H., Stefanidis, K., Plex-ousakis, D.: Exploring RDFS kbs using summaries. In:The Semantic Web - ISWC 2018 - 17th International Se-mantic Web Conference, Monterey, CA, USA, October8-12, 2018, Proceedings, Part I, pp. 268–284 (2018)

[104] Troullinou, G., Kondylakis, H., Stefanidis, K., Plex-ousakis, D.: Rdfdigest+: A summary-driven system forkbs exploration. In: Proceedings of the ISWC 2018Posters & Demonstrations, Industry and Blue Sky IdeasTracks co-located with 17th International Semantic WebConference (ISWC 2018), Monterey, USA, October 8th- to - 12th, 2018. (2018)

[105] Udrea, O., Pugliese, A., Subrahmanian, V.S.: GRIN: Agraph based RDF index. In: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July22-26, 2007, Vancouver, British Columbia, Canada, pp.1465–1470 (2007)

[106] Valiant, L.G.: A bridging model for parallel computa-tion. Commun. ACM 33(8), 103–111 (1990)

[107] W3C: Resource description framework.http://www.w3.org/RDF/

[108] W3C: Owl 1 web ontology language.https://www.w3.org/TR/owl-features/ (2012)

37

[109] W3C: Owl 2 web ontology language.https://www.w3.org/TR/owl2-overview/ (2012)

[110] W3C: SPARQL 1.1 query language.http://www.w3.org/TR/sparql11-query/ (2013)

[111] Wu, G., Li, J., Feng, L., Wang, K.: Identifying po-tentially important concepts and relations in an ontol-ogy. In: The Semantic Web - ISWC 2008, 7th Interna-tional Semantic Web Conference, ISWC 2008, Karlsruhe,Germany, October 26-30, 2008. Proceedings, pp. 33–49(2008)

[112] Yan, X., Yu, P.S., Han, J.: Graph indexing: A fre-quent structure-based approach. In: Proceedings of theACM SIGMOD International Conference on Manage-ment of Data, Paris, France, June 13-18, 2004, pp. 335–346 (2004)

[113] You, J., Pan, Q., Shi, W., Zhang, Z., Hu, J.: Towardsgraph summary and aggregation: A survey. In: SocialMedia Retrieval and Mining, pp. 3–12. Springer (2013)

[114] Zhang, H., Duan, Y., Yuan, X., Zhang, Y.: ASSG: adap-tive structural summary for RDF graph data. In: Pro-ceedings of the ISWC 2014 Posters & DemonstrationsTrack a track within the 13th International SemanticWeb Conference, ISWC 2014, Riva del Garda, Italy, Oc-tober 21, 2014., pp. 233–236 (2014)

[115] Zhang, N., Tian, Y., Patel, J.M.: Discovery-driven graphsummarization. In: Proceedings of the 26th Interna-tional Conference on Data Engineering, ICDE 2010,March 1-6, 2010, Long Beach, California, USA, pp. 880–891 (2010)

[116] Zhang, X., Cheng, G., Ge, W., Qu, Y.: Summarizingvocabularies in the global semantic web. J. Comput.Sci. Technol. 24(1), 165–174 (2009)

[117] Zhang, X., Cheng, G., Qu, Y.: Ontology summarizationbased on rdf sentence graph. In: Proceedings of the 16thInternational Conference on World Wide Web, WWW2007, Banff, Alberta, Canada, May 8-12, 2007, pp. 707–716 (2007)

[118] Zhao, P., Yu, J.X., Yu, P.S.: Graph indexing: Tree +delta >= graph. In: Proceedings of the 33rd Interna-tional Conference on Very Large Data Bases, Universityof Vienna, Austria, September 23-27, 2007 (2007)

[119] Zheng, W., Zou, L., Peng, W., Yan, X., Song, S.,Zhao, D.: Semantic SPARQL similarity search over RDFknowledge graphs. PVLDB 9(11), 840–851 (2016)

[120] Zneika, M., Lucchese, C., Vodislav, D., Kotzinos, D.:RDF graph summarization based on approximate pat-terns. In: Information Search, Integration, and Personal-ization - 10th International Workshop, ISIP 2015, GrandForks, ND, USA, October 1-2, 2015, Revised SelectedPapers, pp. 69–87 (2015)

[121] Zneika, M., Lucchese, C., Vodislav, D., Kotzinos, D.:Summarizing linked data RDF graphs using approxi-mate graph pattern mining. In: Proceedings of the 19th

International Conference on Extending Database Tech-nology, EDBT 2016, Bordeaux, France, March 15-16,2016., pp. 684–685 (2016)

[122] Zneika, M., Vodislav, D., Kotzinos, D.: Quality MetricsFor RDF Graph Summarization. Semantic Web Journal(SWJ) accepted, to appear (2018)

38

Work Method Input requirements Purpose Outputtype

Output na-ture

System -Theory

Dataguide[28]

Structuralnon-quotient

None Indexing,queryanswering

Singleroot,node-and edge-labeledgraphs

Instance System

Rudolf etal. [86]



Visualization Propertygraph

Instance System

Chen etal. GraphOLAP [16]


None OLAP Multiplenon-RDFgraphs

Multiple In-stances

Theory

Zhao et al.[118]

Patternmining

None Indexing,graph con-tainmentqueries

Set offrequenttrees

Instance Theory

Yan et al.[112]

Patternmining

Parameterized userinput

Indexing,queryanswering

Tree Instance Theory

Koutra etal. [51]

Clustering,patternmining

None Visualization Graph Instance Theory

Khan etal. [41]

Structuralnon-quotient,datamining

Required degreethreshold, overlapratio


Navlakhaet al. [69]


Optional user-specified boundederror parameter


LeFevre etal. [55],Riondatoet al. [84]


Required number ofsummary nodes, sizeof the extent of sum-mary nodes

Answeringadjacency,degree andcentralityqueries


Chen etal. Ran-domizedsummaries[15]

Structuralnon-quotient,patternmining

Required supportthreshold


Table 2: Other graph summary proposals.

39

Work RDFinputcompo-nent

Inputrequire-ments

Purpose Outputtype

Outputnature

System -Theory

ExpLOD [42][43]

Instance None Data explo-ration, visual-ization

Graph Instanceandschema

System

Campinas et al.[11]

Instance None Query formula-tion

RDFgraph

Instanceandschema

Theory

Consens et al.[17, 45]

Instance None Query answer-ing

RDFgraph

Instanceandschema

System

Khatchadourianet al. [44]

Instance None Data explo-ration, visual-ization


Theory

Schatzle et al.[87]

Instance None Graph reduc-tion


Theory

ASSG [114] Instance Requireduser-selectedqueries

Query answer-ing

Graph Compressedgraph

Theory

Cebiric et al.[12]

Instanceandschema

None Query optimiza-tion, query for-mulation, visu-alization

RDFgraph

Instanceandschema

System

Jiang et al. [34] Instance None Semantic min-ing

Labeledgraph

Instance Theory

Picalausa et al.[78]

Instance None Indexing, queryanswering


System

Tran et al. [99] Instance Optionalparam-etersFW/BW/FBand neigh-borhoodsize

Indexing, datapartitioning,query process-ing


Theory

Table 3: Structural quotient RDF summaries.

40


Input require-ments

Purpose Outputtype

Outputnature

System -Theory

SchemEX[49, 50]

Instance Data streamwindow size

Indexing RDFgraph

Instance Theory

RDF SentenceGraph [117],[116]

Schema Requiredschema, Pa-rameterizeduser input,RDF/OWL

Visualization Labeledgraph

Schema System

KCE [76], [66] Instanceandschema

Requiredschema, Pa-rameterizeduser input,RDF/OWL

Visualization Isolatednodes

Schemanodes

System

RDFDigest [102],[100], [75]

Instanceandschema

Requiredschema, Pa-rameterizeduser input,RDF/OWL,Semantics-aware, Handleimplicit data

Visualization,query an-sweringtasks

Labeledgraph

Schema System

Queiroz et al.[82]

Schema Requiredschema, Pa-rameterizeduser input,RDF/OWL

Visualization Labeledgraph

Schema System

Sydow et al. [96] Instance Entity of inter-est

Visualization RDFgraph

Instance Theory

Gurajada et al.[29]

Instance None Query an-swering

RDFgraph

Instance System

Kellou et al. [39] InstanceandSchema

None Schemadiscovery

Graph Schema Theory

Le et al. [54] Instance Required neigh-borhood size

Indexing,keywordqueries

PartitionedRDFgraph

Instance Theory

Udrea et al. [105] Instance Assumes sat-urated inputgraph, Requiredsize k of theinput graphpartition

Indexing,visualquerying

Balancedbinarytree(leaves= parti-tion overresources)

Instance System

Table 4: Structural non-quotient RDF summaries.41


Input requirements Purpose Outputtype

Outputnature

System -Theory

Zneika etal. [120,121]

Instance Optional user param-eters

Query answer-ing

RDFgraph

Instanceandschema

Theory

Joshi et al.[36, 35]

Instance None Compression Graphand logicalrules

Instance Theory

Pan et al.[74]

Instance None Compression Graphand logicalrules

Instance Theory

Song et al.[91]

Instance Bounded hop neigh-bors d, maximum sizeof patterns K,

Query answer-ing

Graphpatterns

Schema Theory

Table 5: Pattern mining RDF summaries.

42


Input requirements Purpose Outputtype

Outputnature

System -Theory

Hose et al.[33]

Instance None Source se-lection

Statisticalinforma-tion

Instance Theory

Wu et al.[111]

Schema Requires schema,RDF/OWL, minoruser input

Visualization Isolatedschemanodes

Schema Theory

Pires et al.[79]

Schema Requires schema,OWL

Queryansweringtasks

Labeledgraph

Schema System

LODSight[21]

Instanceandschema

None Compression,Visualiza-tion

Labeledgraph

Instance System

Mynarz etal. [68]

Instanceandschema

Required schemasummary patterns,the number (k) ofselected examples,

Understandingof dataset,Visualiza-tion

Labeledgraphs

Instance System

Presutti etal. [81]

Instanceandschema

None QueryingDataset

Labeledgraphs

Constructan ontol-ogy andthe corre-spondinginstancesthat sum-marize keyfeaturesof datasetand iden-tify thecore KPs

Theory

Table 6: Statistical RDF summaries.

43

Work Method RDFinputcompo-nent

Input require-ments

Purpose Outputtype

Outputnature

System -Theory

Alzogbi etal. [3]

Structuralquotient,clustering

Instance None Compression Graph Instanceandschema

Theory

Stefanoniet al. [94]


Instance Optional pa-rameters forsummary re-finement

Conjunctivequery car-dinalityestimation


ABSTAT[73] [92]

Statistical,patternmining

Instanceandschema

RDF/OWL,Semantic-aware, Handlesimplicit data

Visualization,schemadiscovery

Labeledgraph(pat-terns)

Schemagraph(AbstractKnowl-edgePatterns)

System

Pham etal. [77]

Structuralnon-quotient,statistical

Instanceandschema

Required max.number of sum-mary nodes,min. numberof nodes rep-resented by asummary node,infrequencythreshold, simi-larity threshold

Query op-timization

Relationaltablesandforeignkeys

Instance Theory

Zheng etal. [119]

Structuralquotient,pattern-mining

Instanceandschema

Each instanceshould have atype, the listof meaning-equivalent in-stances shouldbe provided

Query op-timization

Multi-layerGraph

Schema Theory

Glimm etal. [25]

Patternmining

Instanceandschema

DescriptionLogics,Semantic-aware, Handlesimplicit data

Compression ABox(facts)

Instance Theory

Fokoue etal. [24, 23,19, 20]

Structuralnon quo-tient

Instanceandschema

DescriptionLogics,Semantic-aware, Handlesimplicit data

Consistency-checkingand queryanswering

ABox(facts)

Instance System

Table 7: Hybrid RDF summaries.

44

summarizing semantic graphs: a survey › hal-01925496 › file › hal.pdfin conjunction with rdf...

Documents