mandolin: a knowledge discovery framework for …mandolin: a knowledge discovery framework for the...

6
Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany [email protected] Diego Esteves University of Bonn, Germany [email protected] Edgard Marx Leipzig University of Applied Sciences, Germany [email protected] Axel-Cyrille Ngonga Ngomo University of Paderborn, Germany [email protected] Abstract Markov Logic Networks join probabilistic modeling with first-order logic and have been shown to integrate well with the Semantic Web foundations. While several approaches have been devised to tackle the subproblems of rule min- ing, grounding, and inference, no comprehensive workflow has been proposed so far. In this paper, we fill this gap by introducing a framework called MANDOLIN, which imple- ments a workflow for knowledge discovery specifically on RDF datasets. Our framework imports knowledge from refer- enced graphs, creates similarity relationships among similar literals, and relies on state-of-the-art techniques for rule min- ing, grounding, and inference computation. We show that our best configuration scales well and achieves at least compara- ble results with respect to other statistical-relational-learning algorithms on link prediction. Introduction The Linked Data cloud has grown considerably since its inception. To date, the total number of facts exceeds 130 billion, spread in over 2,500 available datasets. 1 This mas- sive quantity of data has thus become an object of inter- est for disciplines as diverse as Machine Learning (Spohr, Hollink, and Cimiano 2011; Nikolov, dAquin, and Motta 2012; Rowe, Angeletou, and Alani 2011), Evolutionary Al- gorithms (Wang, Ding, and Jiang 2006; Ngonga Ngomo and Lyko 2012), Generative Models (Bhattacharya and Getoor 2006), and Statistical Relational Learning (SRL) (Singla and Domingos 2006). One of the main objectives of the appli- cation of such algorithms is to address the fourth Linked Data principle, which states “include links to other URIs, so that they [the visitors] can discover more things” (Berners- Lee 2006). Two years later, (Domingos et al. 2008) proposed Markov Logic Networks (MLNs) – a well-known approach to Knowledge Discovery in knowledge bases (Richardson and Domingos 2006) – to be a promising framework for the Semantic Web. Bringing the power of probabilistic model- ing to first-order logic, MLNs associate a weight to each for- mula (i.e., first-order logic rule) and are able to natively per- form probabilistic inference. Several tools based on MLNs have been designed so far (Kok et al. 2009; Niu et al. 2011; Copyright c 2017. 1 Retrieved on February 15, 2017, from http://lodstats. aksw.org/. Noessner, Niepert, and Stuckenschmidt 2013; Bodart et al. 2014). Yet, none of the existing MLN frameworks devel- ops the entire pipeline from the generation of rules to the discovery of new relationships in a dataset. Moreover, the size of the Web of Data represents today an enormous chal- lenge for such learning algorithms, which often have to be re-engineered in order to scale to larger datasets. In the last years, this problem has been tackled by proposing algo- rithms that benefit of massive parallelism. Approximate re- sults with some confidence degree have been preferred over exact ones, as they often require less computational power, yet leading to acceptable performances. In this paper, we propose a new workflow for proba- bilistic knowledge discovery on RDF datasets and imple- ment it in a framework called MANDOLIN, Ma rkov Logic N etworks for the D iscovery o f Lin ks. To the best of our knowledge, our framework is the first one to implement the entire pipeline for link prediction on RDF datasets. Making use of RDFS/OWL semantics, MANDOLIN can (i) import knowledge from referenced graphs, (ii) compute the forward chaining, and (iii) create similarity relationships among sim- ilar literals. We show that this additional information al- lows the discovery of links even between different knowl- edge bases. We evaluate the framework on two benchmark datasets for link prediction and show that it can achieve com- parable results w.r.t. other SRL algorithms and outperform them on two accuracy indices. Related Work Machine-learning techniques have been successfully applied to ontology and instance matching, where the aim is to match classes, properties, and instances belonging to differ- ent ontologies or knowledge bases (Ngonga Ngomo et al. 2011; Ngonga Ngomo and Lyko 2012; Shvaiko et al. 2016). Also, evolutionary algorithms have been used to the same scope (Martinez-Gil, Alba, and Montes 2008). For instance, genetic programming has shown to find good link specifi- cations (i.e., similarity-based decision trees) in both a semi- supervised and unsupervised fashion (Ngonga Ngomo and Lyko 2012). Generative models are statistical approaches which do not belong to the ML and SRL branches. Latent Dirichlet allocation is an example of application to entity resolution (Bhattacharya and Getoor 2006) and topic mod- eling (R ¨ oder et al. 2016). arXiv:1711.01283v1 [cs.DB] 3 Nov 2017

Upload: others

Post on 19-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

Mandolin: A Knowledge Discovery Framework for the Web of DataTommaso Soru

University of Leipzig, [email protected]

Diego EstevesUniversity of Bonn, Germany

[email protected]

Edgard MarxLeipzig University of Applied Sciences, Germany

[email protected]

Axel-Cyrille Ngonga NgomoUniversity of Paderborn, Germany

[email protected]

Abstract

Markov Logic Networks join probabilistic modeling withfirst-order logic and have been shown to integrate well withthe Semantic Web foundations. While several approacheshave been devised to tackle the subproblems of rule min-ing, grounding, and inference, no comprehensive workflowhas been proposed so far. In this paper, we fill this gap byintroducing a framework called MANDOLIN, which imple-ments a workflow for knowledge discovery specifically onRDF datasets. Our framework imports knowledge from refer-enced graphs, creates similarity relationships among similarliterals, and relies on state-of-the-art techniques for rule min-ing, grounding, and inference computation. We show that ourbest configuration scales well and achieves at least compara-ble results with respect to other statistical-relational-learningalgorithms on link prediction.

IntroductionThe Linked Data cloud has grown considerably since itsinception. To date, the total number of facts exceeds 130billion, spread in over 2,500 available datasets.1 This mas-sive quantity of data has thus become an object of inter-est for disciplines as diverse as Machine Learning (Spohr,Hollink, and Cimiano 2011; Nikolov, dAquin, and Motta2012; Rowe, Angeletou, and Alani 2011), Evolutionary Al-gorithms (Wang, Ding, and Jiang 2006; Ngonga Ngomo andLyko 2012), Generative Models (Bhattacharya and Getoor2006), and Statistical Relational Learning (SRL) (Singla andDomingos 2006). One of the main objectives of the appli-cation of such algorithms is to address the fourth LinkedData principle, which states “include links to other URIs, sothat they [the visitors] can discover more things” (Berners-Lee 2006). Two years later, (Domingos et al. 2008) proposedMarkov Logic Networks (MLNs) – a well-known approachto Knowledge Discovery in knowledge bases (Richardsonand Domingos 2006) – to be a promising framework for theSemantic Web. Bringing the power of probabilistic model-ing to first-order logic, MLNs associate a weight to each for-mula (i.e., first-order logic rule) and are able to natively per-form probabilistic inference. Several tools based on MLNshave been designed so far (Kok et al. 2009; Niu et al. 2011;

Copyright c© 2017.1Retrieved on February 15, 2017, from http://lodstats.

aksw.org/.

Noessner, Niepert, and Stuckenschmidt 2013; Bodart et al.2014). Yet, none of the existing MLN frameworks devel-ops the entire pipeline from the generation of rules to thediscovery of new relationships in a dataset. Moreover, thesize of the Web of Data represents today an enormous chal-lenge for such learning algorithms, which often have to bere-engineered in order to scale to larger datasets. In the lastyears, this problem has been tackled by proposing algo-rithms that benefit of massive parallelism. Approximate re-sults with some confidence degree have been preferred overexact ones, as they often require less computational power,yet leading to acceptable performances.

In this paper, we propose a new workflow for proba-bilistic knowledge discovery on RDF datasets and imple-ment it in a framework called MANDOLIN, Markov LogicNetworks for the Discovery of Links. To the best of ourknowledge, our framework is the first one to implement theentire pipeline for link prediction on RDF datasets. Makinguse of RDFS/OWL semantics, MANDOLIN can (i) importknowledge from referenced graphs, (ii) compute the forwardchaining, and (iii) create similarity relationships among sim-ilar literals. We show that this additional information al-lows the discovery of links even between different knowl-edge bases. We evaluate the framework on two benchmarkdatasets for link prediction and show that it can achieve com-parable results w.r.t. other SRL algorithms and outperformthem on two accuracy indices.

Related WorkMachine-learning techniques have been successfully appliedto ontology and instance matching, where the aim is tomatch classes, properties, and instances belonging to differ-ent ontologies or knowledge bases (Ngonga Ngomo et al.2011; Ngonga Ngomo and Lyko 2012; Shvaiko et al. 2016).Also, evolutionary algorithms have been used to the samescope (Martinez-Gil, Alba, and Montes 2008). For instance,genetic programming has shown to find good link specifi-cations (i.e., similarity-based decision trees) in both a semi-supervised and unsupervised fashion (Ngonga Ngomo andLyko 2012). Generative models are statistical approacheswhich do not belong to the ML and SRL branches. LatentDirichlet allocation is an example of application to entityresolution (Bhattacharya and Getoor 2006) and topic mod-eling (Roder et al. 2016).

arX

iv:1

711.

0128

3v1

[cs

.DB

] 3

Nov

201

7

Page 2: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

SRL techniques such as Markov-logic (Richardson andDomingos 2006) and tensor-factorization models (Nickel,Jiang, and Tresp 2014) have been proposed for link pre-diction and triple classification; the formers have also beenapplied on problems like entity resolution (Singla andDomingos 2006). Among the frameworks which operateon MLNs, we can mention NetKit-SRL (Macskassy andProvost 2005), Alchemy (Kok et al. 2009), Tuffy (Niu et al.2011), ArThUR (Bodart et al. 2014), and RockIt (Noess-ner, Niepert, and Stuckenschmidt 2013). Several approacheswhich rely on translations have been devised to perform linkprediction via generation of embeddings (Bordes et al. 2013;Lin et al. 2015; Wang, Wang, and Guo 2015; Xiao et al.2016). The Google Knowledge Vault is a huge structuredknowledge repository backed by a probabilistic inferencesystem (i.e., ER-MLP) that computes calibrated probabili-ties of fact correctness (Dong et al. 2014).

This work is also related to link prediction in social net-works (Liben-Nowell and Kleinberg 2007; Scellato, Noulas,and Mascolo 2011). Being social networks the represen-tation of social interactions, they can be seen as RDFgraphs having only one property. Recently, approaches suchas DeepWalk (Perozzi, Al-Rfou, and Skiena 2014) andnode2vec (Grover and Leskovec 2016) showed impressivescalability to large graphs.

The link discovery frameworks Silk (Volz et al. 2009)and LIMES (Ngonga Ngomo and Auer 2011) present avariety of methods for the discovery of links among dif-ferent knowledge bases (Jentzsch, Isele, and Bizer 2010;Ngonga Ngomo and Lyko 2012; Sherif and Ngonga Ngomo2015). As presented in the next section, instance matching isa sub-problem of link discovery where the sought propertyis an equivalence linking instances. Most instance matchingtools (Li et al. 2009; Jimenez-Ruiz and Grau 2011) take orhave taken part in the Ontology Matching Evaluation Initia-tive (OAEI).

PreliminariesFirst-order knowledge bases are composed of statements andformulas expressed in first-order logic (Genesereth and Nils-son 1987). In probabilistic knowledge bases, every state-ment (i.e., edge) has a weight associated with it (Wuthrich1995). The weighting function can be represented as ω :E → [0, 1]. This means that a relationship might ex-ist within some confidence or probability degree. The cur-rent Semantic Web vision does not foresee weights for agiven triple. However, a probabilistic interpretation of RDFgraphs with weights partly lower than 1 has shown to beable to help to solve many problems, such as instancematching, question answering, and class expression learn-ing (Leitao, Calado, and Weis 2007; Shekarpour et al. 2014;Buhmann et al. 2014).

As mentioned in the introduction, MLNs join first-orderlogic with a probabilistic model by assigning a weight toeach formula. Formally, a Markov Logic Network can be de-scribed as a set (Fi, wi) of pairs of formulas Fi, expressedin first-order logic, and their corresponding weights wi ∈ R.The weight wi associated with each formula Fi softens thecrisp behavior of Boolean logic as follows. Along with a set

of constants C, an MLN can be viewed as a template forbuilding a Markov Network. Given C, a so-called GroundMarkov Network is thus constructed, leaving to each ground-ing the same weight as its respective formula. In a groundnetwork, each ground node corresponds to a statement. Oncesuch network is built, it is possible to compute a probabilityvalue ∈ [0, 1] for each statement (Richardson and Domingos2006).

The WorkflowOur workflow comprises five modules: RDFS/OWL enrich-ment, Rule mining, Interpretation, Grounding, and Infer-ence. As can be seen in Figure 1, the modules are alignedin a sequential manner. Taking a union of RDF graphs G =⋃

iGi as input, the process ends with the generation of anenriched graph G′ = (V ′, E′) = lp(G) where lp is the linkprediction algorithm modeled as function.

RDFS/OWL enrichmentThe RDFS/OWL enrichment module activates optionally andfeatures three different operations: Similarity join, Ontologyimport, and Forward chaining. Its function is to add a layerof relationships to the input graph G.

Similarity join. A node in an RDF graph may representeither a URI, a literal, or a blank node. While URI or a blanknode have no restrictions w.r.t. their end in the triple (i.e.,they can both be subjects or objects), a literal can only beput as an object (i.e., have only incoming edges). Literalscan be of different datatype (e.g., strings, integers, floats).In order to generate the similarity relationships, we first col-lect all literals in the graph into as many buckets as thereare datatype properties. We chose to use the Jaccard simi-larity on q-grams (Gravano et al. 2001) to compare strings.To tackle the quadratic time complexity for the extractionof similar candidate pairs, we apply a positional filtering onprefixes and suffixes (Xiao et al. 2008) as implemented inthe LIMES framework (Ngonga Ngomo 2012) within a sim-ilarity threshold θ. Once the candidate pairs (i.e., datatypevalues) are extracted, we create a new similarity property foreach datatype property and for θ = 0.1, ..., 1.0 to connectthe respective subjects (e.g., :foaf name 0.6 to link twopersons having names with similarity greater than 0.6). In-tuitively, these similarity properties form a hierarchy whereproperties with a higher threshold are sub-properties of theones with lower threshold.

The procedure above is repeated for each datatype prop-erty. Numerical and time values, are sorted by value andlinked via a similarity predicate whenever their differenceis less than the threshold θ. The rationale behind the use ofsimilarity joins is that (i) they can foster the discovery ofequivalence relationships and (ii) similarity properties canbe included in inference rules.

Ontology import and Forward chaining. RDF datasetson the Web are published so that their content can be acces-sible from everywhere. The vision of the Semantic Web ex-pects URIs to be referenced from different knowledge bases.In any knowledge-representation application, one option to

Page 3: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

Figure 1: The knowledge discovery workflow as implemented in MANDOLIN.

process the semantics associated with a URI is to import theontology (or the available RDF data) which defines such en-tity. To accomplish this, MANDOLIN dereferences externalURIs, imports the data into its graph G, and performs for-ward chaining (i.e., semantic closure) on the whole graph.This additional information can be useful for the Markovlogic, since it fosters connectivity on G.

Rule mining and InterpretationThe mining of rules in a knowledge base is not a task strictlyrelated to MLN systems. Instead, the set of MLN rules isusually given as input to the MLN system. MANDOLIN in-tegrates the rule mining phase in the workflow exploiting astate-of-the-art algorithm.

The rule mining module takes an RDF graph as input andyields rules expressed as first-order Horn clauses. A Hornclause is a logic clause having at most one positive literalif written in the disjunctive normal form (DNF). Any DNFclause ¬a(x, y) ∨ c(x, y) can be rewritten as a(x, y) ⇒c(x, y), thus featuring an implication. The part that remainsleft of the implication is called body, whereas the right oneis called head. In MANDOLIN, a rule can have a body size of1 or 2, belonging to one of 6 different classes: a(x, y) =⇒c(x, y), a(y, x) =⇒ c(x, y), a(z, x) ∧ b(z, y) =⇒ c(x, y),a(x, z) ∧ b(z, y) =⇒ c(x, y), a(z, x) ∧ b(y, z) =⇒ c(x, y),and a(x, z) ∧ b(y, z) =⇒ c(x, y). Intuitively, consideringonly a subset of Horn clauses decreases expressivity butalso the search space. In large-scale knowledge bases, thisstrategy is preferred since it allows to scale. For the searchof rules in the graph, we rely on the AMIE+ algorithm de-scribed in (Galarraga et al. 2015).

In the interpretation module, the set of rules returned bythe rule miner is collected, filtered, and translated for thenext phase, i.e. the grounding. At the end of the miningphase, we perform a selection of rules based on their headcoverage, i.e. F ′ = {F ∈ F : η(F ) ≥ η}. We preferred touse PCA confidence over head coverage because previousliterature showed its greater effectiveness (Dong et al. 2014;Galarraga et al. 2015).

GroundingGrounding is the phase where the ground Markov network(factor graph) is built starting from the graph and a set of

MLN rules. A factor graph is a graph consisting of twotypes of nodes: factors and variables where the factors con-nect all the variables in their scope. Given a set of factorsφ = {φ1, . . . , φN}, φi is a function over a random vectorXi

which indicates the value of a variable in a formula. As thecomputational complexity for grounding is NP-complete,the problem of scalability has been addressed by using re-lational databases. However, frameworks such as Tuffy orAlchemy showed they are not able to scale, even in datasetswith a few thousand statements (Chen and Wang 2013).Tuffy, for example, stores the ground network data into aDBMS loaded on a RAM-disk for best performances (Niu etal. 2011); however, growing exponentially, the RAM cannotcontain the ground network data, resulting in the programgoing out of memory. For this reason, in the MANDOLINmodule for grounding, we integrated ProbKB, a state-of-the-art algorithm for the computation of factors. The mainstrength of this approach is the exploitation of the simplestructure of Horn clauses (Chen and Wang 2014), differentlyfrom other frameworks where any first-order logic formulais allowed. It consists of a two-step method, i.e. (1) newstatements are inferred until convergence and (2) the factornetwork is built. Each statement is read in-memory at most3 times; differently from Tuffy, where it is read every time itappears in the knowledge base (Chen and Wang 2013).

InferenceMAP Inference in Markov networks is a P#-complete prob-lem (Roth 1996). However, the final probability values canbe approximated using techniques such as Gibbs sampling– which showed to perform best (Noessner, Niepert, andStuckenschmidt 2013) – and belief propagation (Richardsonand Domingos 2006).

Every statement a(x, y) is associated with a node in thefactor graph. Therefore, its probability is proportional to theproduct of all potential functions φk(x{k}) applied to thestate of the cliques touching that node. Once we computethe probabilities of all sampled candidates, we normalizethem so that the minimum and maximum values are 0 and1 respectively. The final set P of predicted links is then de-fined as those statements whose probability is greater than athreshold τ ∈ [0, 1]. Typically, the number of iterations forthe Gibbs sampler is γ = 100 ∗ |E| (Noessner, Niepert, and

Page 4: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

Stuckenschmidt 2013).

ExperimentsAny directed labeled graph can be easily transformed into anRDF graph by simply creating a namespace and prependingit to entities and properties in statements. We thus created anRDF version of a benchmark for knowledge discovery usedin (Bordes et al. 2013; Lin et al. 2015; Wang, Wang, and Guo2015; Xiao et al. 2016; Nickel, Rosasco, and Poggio 2015).The benchmark consists of two datasets: WN18, built uponthe WordNet glossary, and FB15k, a subset of the Freebasecollaborative knowledge base. Using these datasets, we eval-uated MANDOLIN on link prediction. Finally, we employedthe DBLP-ACM (Kopcke, Thor, and Rahm 2010) dataset totest cross-dataset linking, as well as the large-scale datasetDBpedia 3.8 to evaluate the scalability of our approach. Allexperiments were carried out on a 64-core server with 256GB RAM equipped with Ubuntu 16.04. The MANDOLINframework was mainly developed in Java 1.8 and its sourcecode is available online2 along with all datasets used for theevaluation.

Dataset # triples # nodes # prop.WN18 146,442 40,943 18FB15k 533,144 14,951 1,345DBLP–ACM 20,759 5,003 34DBpedia 3.8 11,024,066 ∼2,200,000 650

Table 1: Datasets used in the evaluation.

We evaluated the link prediction task on two measures,Mean Reciprocal Rank (MRR) and Hits@k. The benchmarkdatasets are divided into training, validation, and test sets.We used the training set to build the models and the valida-tion set to find the hyperparameters, which are introducedlater. Afterwards, we used the union of the training and val-idation sets to build the final model. For each test triple(s, p, o), we generated as many corrupted triples (s, p, o) asthere are nodes in the graph such that o 6= o ∈ V . We com-puted the probability for all these triples when this valuewas available; when not, we assumed it 0. Then, we rankedthe triples in descending order and checked the position of(s, p, o) in the rank. The Hits@k index is the ratio (%) of testtriples that have been ranked among the top-k. We computedHits@1,3,10 with a filtered setting, i.e. all corrupted triplesranked above (s, p, o) which are present in the training setswere discarded before computing the rank.

The results for link prediction on the WN18-FB15k bench-mark are shown in Table 2. We compare MANDOLIN withother SRL techniques based on embeddings and tensor fac-torization. On WN18, we overperform all other approachesw.r.t. the Hits@10 index (96.0%). However, HOLE (Nickel,Rosasco, and Poggio 2015) recorded the highest perfor-mance on MRR and Hits@1; the two approaches achievedalmost the same value on Hits@3. Since the two datasetsabove contain no datatype values and no statements using

2https://github.com/AKSW/Mandolin

the RDF schema3, we did not activate the RDF-specific set-tings introduced in the previous section.

Our framework depends on the following hyperparame-ters:

• minimum head coverage (η), used to filter rules;

• Gibbs sampling iterations (γ).

To compute the optimal configuration on the trade-off be-tween computational needs and overall performances, weperformed further experiments on the link prediction bench-mark. We investigated the relationship between numberof Gibbs sampling iterations, runtime, and accuracy byrunning our approach using the following values: γ ={1, 2, 3, 5, 10, 50, 100} · 106. The runtime is, excluding anoverhead, linear w.r.t. the number of iterations. As can beseen in Figure 2, the Hits@10 index tends to stabilize ataround γ = 5 · 106, however higher accuracy can be foundby increasing this value.

0 50 100

20

30

40

Gibbs sampling iterations (millions)H

its@

10ac

cura

cy

Figure 2: Hits@10 on FB15k with η = 0.9.

We performed tests on the DBLP–ACM dataset for thediscovery of equivalence relationships and on the large-scaledataset DBpedia4. We compared our approach with otherMLN frameworks, i.e. NetKit-SRL, ProbCog, Alchemy, andTuffy. As these frameworks can learn rule weights but notrules themselves, we fed them with the rules found by ourrule miner. We set η = 0.9 and γ = 107. The results (seeTable 3) showed that, in all cases, MANDOLIN is the onlyframework that was able to terminate the computation. In theDBLP–ACM dataset, we were able to discover equivalencelinks among articles and authors that had not been linkedin the original datasets. After dividing the mapping M intotwo folds (90% − 10%), we used the larger as training set.We were able to predict 71.0% of the correct owl:sameAslinks in the remaining test set.

DiscussionWe have witnessed a different behavior of our algorithmwhen evaluated on the two datasets for link prediction. Thismight be explained by the different structure of the graphs:Relying on first-order Horn clauses, new relationships canonly be discovered if they belong to a 3-vertex clique where

3https://www.w3.org/TR/rdf-schema/4Version 3.8 from http://www.mpi-inf.mpg.de/

departments/ontologies/projects/amie/.

Page 5: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

Table 2: Results for link prediction on the WordNet (WN18) and Freebase (FB15k) datasets.WN18 FB15k

MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10

TRANSE 0.495 11.3 88.8 94.3 0.463 29.7 57.8 74.9TRANSR 0.605 33.5 87.6 94.0 0.346 21.8 40.4 58.2ER-MLP 0.712 62.6 77.5 86.3 0.288 17.3 31.7 50.1RESCAL 0.890 84.2 90.4 92.8 0.354 23.5 40.9 58.7HOLE 0.938 93.0 94.5 94.9 0.524 40.2 61.3 73.9

MANDOLIN 0.892 89.2 94.3 96.0 0.404 40.4 48.4 52.6

Table 3: Runtime, number of rules after the filtering, andnumber of predicted links for η = 0.9 and γ = 107.

Dataset Runtime (s) (|F ′|) (|P |)DBLP–ACM 2,460 1,500 4,730DBpedia 85,334 1,500 179,201

the other two edges are already in the graph. Therefore, rule-based algorithms might need one more step, such as longerbody in rules or more iterations, to discover them on a lessconnected graph such as FB15k. A more detailed view onthe learned rules is provided at the project repository. Thereason why approaches like RESCAL and ER-MLP haveperformed worse than others is probably overfitting. Embed-ding methods have shown to achieve excellent results, how-ever no method significantly overperformed the others. Wethus believe that heterogeneity in Linked Data sets is still anopen challenge and structure plays a key role to the choiceof the algorithm. Although our MLN framework showed tobe more scalable and to be able to provide users with justifi-cations for adding triples through the rules it generates, werecognize that this aspect can be further investigated by re-placing one or more of its components to decrease the over-all runtime.

ConclusionIn this paper, we proposed a workflow for probabilisticknowledge discovery as implemented in MANDOLIN, aframework specifically designed for the Web of Data. To thebest of our knowledge, it is the first complete frameworkfor RDF link prediction based on Markov Logic Networkswhich features the entire pipeline necessary to achieve thistask. We showed that it is able to achieve results beyond theState of the Art for some measures on a well-known linkprediction benchmark. Moreover, it can predict equivalencelinks across datasets and scale on large graphs. We plan toextend this work in order to refine domain and range in rulesand build functionals using OWL rules and evaluate theireffectiveness on the predicted links.

ReferencesBerners-Lee, T. 2006. Linked data-design issues (2006).Published online.

Bhattacharya, I., and Getoor, L. 2006. A latent dirichletmodel for unsupervised entity resolution. In SDM, volume5, 7. SIAM.Bodart, A.; Evrard, K.; Ortiz, J.; and Schobbens, P.-Y. 2014.Arthur: A tool for markov logic network. In OTM Confed-erated International Conferences. Springer.Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; andYakhnenko, O. 2013. Translating embeddings for modelingmulti-relational data. In Advances in Neural InformationProcessing Systems.Buhmann, L.; Fleischhacker, D.; Lehmann, J.; Melo, A.; andVolker, J. 2014. Inductive lexical learning of class expres-sions. In Knowledge Engineering and Knowledge Manage-ment, volume 8876, 42–53.Chen, D. Y., and Wang, D. Z. 2013. Web-scale knowledgeinference using markov logic networks. In ICML workshopon Structured Learning: Inferring Graphs from Structuredand Unstructured Inputs. Association for ComputationalLinguistics.Chen, Y., and Wang, D. Z. 2014. Knowledge expansion overprobabilistic knowledge bases. In Proceedings of the 2014ACM SIGMOD international conference on Management ofdata. ACM.Domingos, P.; Lowd, D.; Kok, S.; Poon, H.; Richardson, M.;and Singla, P. 2008. Just add weights: Markov logic for thesemantic web. In Uncertainty Reasoning for the SemanticWeb I. Springer.Dong, X.; Gabrilovich, E.; Heitz, G.; Horn, W.; Lao, N.;Murphy, K.; Strohmann, T.; Sun, S.; and Zhang, W. 2014.Knowledge vault: A web-scale approach to probabilisticknowledge fusion. In SIGKDD. ACM.Galarraga, L.; Teflioudi, C.; Hose, K.; and Suchanek, F. M.2015. Fast rule mining in ontological knowledge bases withamie+. The VLDB Journal 24(6).Genesereth, M. R., and Nilsson, N. J. 1987. Logical Foun-dations of Artificial Intelligence. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.Gravano, L.; Ipeirotis, P. G.; Jagadish, H. V.; Koudas, N.;Muthukrishnan, S.; and Srivastava, D. 2001. Approximatestring joins in a database (almost) for free. In VLDB, VLDB’01. San Francisco, CA, USA: Morgan Kaufmann Publish-ers Inc.Grover, A., and Leskovec, J. 2016. node2vec: Scalable fea-ture learning for networks. In KDD 2016.

Page 6: Mandolin: A Knowledge Discovery Framework for …Mandolin: A Knowledge Discovery Framework for the Web of Data Tommaso Soru University of Leipzig, Germany tsoru@informatik.uni-leipzig.de

Jentzsch, A.; Isele, R.; and Bizer, C. 2010. Silk-generatingrdf links while publishing or consuming linked data. In Pro-ceedings of the 2010 International Conference on Posters &Demonstrations Track-Volume 658. CEUR-WS. org.Jimenez-Ruiz, E., and Grau, B. C. 2011. Logmap: Logic-based and scalable ontology matching. In International Se-mantic Web Conference. Springer.Kok, S.; Sumner, M.; Richardson, M.; Singla, P.; Poon, H.;Lowd, D.; Wang, J.; and Domingos, P. 2009. The Alchemysystem for statistical relational {AI}. Technical report, De-partment of Computer Science and Engineering, Universityof Washington.Kopcke, H.; Thor, A.; and Rahm, E. 2010. Evaluation ofentity resolution approaches on real-world match problems.PVLDB 3(1).Leitao, L.; Calado, P.; and Weis, M. 2007. Structure-basedinference of xml similarity for fuzzy duplicate detection. InCIKM. ACM.Li, J.; Tang, J.; Li, Y.; and Luo, Q. 2009. Rimom: A dy-namic multistrategy ontology alignment framework. IEEETransactions on Knowledge and Data Engineering 21(8).Liben-Nowell, D., and Kleinberg, J. 2007. The link-prediction problem for social networks. Journal of the Amer-ican society for information science and technology 58(7).Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015. Learn-ing entity and relation embeddings for knowledge graphcompletion. In AAAI.Macskassy, S. A., and Provost, F. 2005. Netkit-srl: A toolkitfor network learning and inference. In Proceeding of theNAACSOS Conference.Martinez-Gil, J.; Alba, E.; and Montes, J. F. A. 2008. Opti-mizing ontology alignments by using genetic algorithms. InNatuReS Workshop. CEUR-WS. org.Ngonga Ngomo, A.-C., and Auer, S. 2011. Limes - a time-efficient approach for large-scale link discovery on the webof data. In Proceedings of IJCAI.Ngonga Ngomo, A.-C., and Lyko, K. 2012. Eagle: Efficientactive learning of link specifications using genetic program-ming. In Proceedings of ESWC.Ngonga Ngomo, A.-C.; Heino, N.; Lyko, K.; Speck, R.; andKaltenbock, M. 2011. Scms - semantifying content man-agement systems. In ISWC 2011.Ngonga Ngomo, A.-C. 2012. Link discovery with guaran-teed reduction ratio in affine spaces with minkowski mea-sures. In Proceedings of ISWC.Nickel, M.; Jiang, X.; and Tresp, V. 2014. Reducing therank in relational factorization models by including observ-able patterns. In Advances in Neural Information ProcessingSystems.Nickel, M.; Rosasco, L.; and Poggio, T. 2015. Holographicembeddings of knowledge graphs. AAAI.Nikolov, A.; dAquin, M.; and Motta, E. 2012. Unsuper-vised learning of link discovery configuration. In ExtendedSemantic Web Conference. Springer.

Niu, F.; Re, C.; Doan, A.; and Shavlik, J. 2011. Tuffy: Scal-ing up statistical inference in markov logic networks usingan rdbms. VLDB 4(6).Noessner, J.; Niepert, M.; and Stuckenschmidt, H. 2013.Rockit: Exploiting parallelism and symmetry for map infer-ence in statistical relational models. AAAI.Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deep-walk: Online learning of social representations. In SIGKDD.ACM.Richardson, M., and Domingos, P. 2006. Markov logic net-works. Machine learning 62(1-2).Roder, M.; Ngonga Ngomo, A.-C.; Ermilov, I.; and Both,A. 2016. Detecting similar linked datasets using topic mod-elling. In ESWC, 3–19.Roth, D. 1996. On the hardness of approximate reasoning.Artif. Intell. 82(1-2).Rowe, M.; Angeletou, S.; and Alani, H. 2011. Predictingdiscussions on the social semantic web. In Extended Seman-tic Web Conference. Springer.Scellato, S.; Noulas, A.; and Mascolo, C. 2011. Exploit-ing place features in link prediction on location-based socialnetworks. In SIGKDD.Shekarpour, S.; Marx, E.; Ngonga Ngomo, A.-C.; and Auer,S. 2014. Sina: Semantic interpretation of user queries forQuestion Answering on interlinked data. Web Semantics.Sherif, M. A., and Ngonga Ngomo, A.-C. 2015. An opti-mization approach for load balancing in parallel link discov-ery. In SEMANTiCS 2015.Shvaiko, P.; Euzenat, J.; Jimenez-Ruiz, E.; Cheatham, M.;and Hassanzadeh, O., eds. 2016. Proceedings of the 10thInternational Workshop on Ontology Matching.Singla, P., and Domingos, P. 2006. Entity resolution withmarkov logic. In Sixth International Conference on DataMining (ICDM’06). IEEE.Spohr, D.; Hollink, L.; and Cimiano, P. 2011. A machinelearning approach to multilingual and cross-lingual ontologymatching. In ISWC. Springer.Volz, J.; Bizer, C.; Gaedke, M.; and Kobilarov, G. 2009.Silk-a link discovery framework for the web of data. LDOW538.Wang, J.; Ding, Z.; and Jiang, C. 2006. Gaom: Geneticalgorithm based ontology matching. In APSCC. IEEE.Wang, Q.; Wang, B.; and Guo, L. 2015. Knowledge basecompletion using embeddings and rules. In IJCAI.Wuthrich, B. 1995. Probabilistic knowledge bases. IEEETransactions on Knowledge and Data Engineering 7(5).Xiao, C.; Wang, W.; Lin, X.; and Yu, J. X. 2008. Effi-cient similarity joins for near duplicate detection. In WWW,WWW ’08. New York, NY, USA: ACM.Xiao, H.; Huang, M.; Hao, Y.; and Zhu, X. 2016. Transg: Agenerative mixture model for knowledge graph embedding.ACL.