ruleml2015 - using substitutive itemset mining framework for finding synonymous properties in linked...
TRANSCRIPT
Using Substitutive Itemset Mining Framework forFinding Synonymous Properties in Linked Data
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski
Poznan University of Technology, Poland
August 3rd, 2015RuleML 2015
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 1
/ 15
Outline
Motivating Scenario
Substitutive Sets Mining
Finding Synonymous Properties with Substitutive Sets Mining
Summary
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 2
/ 15
Motivating scenario 1/3
Wiki (mappings
Wikipedia infoboxes -‐> DBpedia ontology)
Norah_Jones
Denton,_Texas dbpedia-‐prop:origin
Mark_Knopfler Gosforth dbpedia-‐owl:hometown
Peter_Gabriel Godalming dbpedia-‐owl:hometown
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 3
/ 15
Motivating scenario 2/3
DBpedia 2014 ontology has 1310 object and 1725 data properties
Many large Linked Data use relatively lightweight schemas with ahigh number of object properties
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 4
/ 15
Motivating scenario 3/3
Wiki (mappings
Wikipedia infoboxes -‐> DBpedia ontology)
dbpedia-‐owl:MusicalArAst dbpedia-‐owl:PopulatedPlace
Norah_Jones
Denton,_Texas dbpedia-‐prop:origin
Mark_Knopfler Gosforth dbpedia-‐owl:hometown
Peter_Gabriel Godalming dbpedia-‐owl:hometown
dbpedia-‐owl:MusicalArAst dbpedia-‐owl:PopulatedPlace
dbpedia-‐owl:MusicalArAst dbpedia-‐owl:PopulatedPlace
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 5
/ 15
Substitutive Sets Mining Framework
Frequent(Itemset(Mining(
Subs1tu1ve(Set(Genera1on(
Transac1on(DB(
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 6
/ 15
Frequent Itemsets
I = {i1, i2, . . . , im} - a set of items
DT = {t1, t2, . . . , tn}, where ∀iti ⊆ I -a database of transactions
support(X ) = ∣{t∈DT ∶X⊆t}∣∣DT ∣
ID Items
1 Nachos, Pepsi, Salsa2 Nachos, Coca-Cola, Salsa3 Nachos, Coca-Cola4 Nachos, Pepsi, Salsa5 Milk, Bread
Frequent Itemset Support
{Nachos} 80%{Salsa} 60%{Coca-Cola} 40%{Pepsi} 40%{Nachos, Salsa} 60%{Nachos, Coca-Cola} 40%{Nachos, Pepsi} 40%{Salsa, Pepsi} 40%
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 7
/ 15
Covering Set
CS(i ∣L) = {X ∈ L ∶ {i} ∪X ∈ L}
coverage(i ∣L) = ∣CS(i ∣L)∣
Frequent Itemset
{Nachos}{Salsa}{Coca-Cola}{Pepsi}{Nachos, Salsa}{Nachos, Coca-Cola}{Nachos, Pepsi}{Salsa, Pepsi}
i CS(i) coverage
{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3{Salsa} {{Nachos}} 1{Coca-Cola} {{Nachos}} 1{Pepsi} {{Nachos}, {Salsa}} 2
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 8
/ 15
Substitutive Sets
A two-element itemset {x , y} is a substitutive itemset, if:
x ∈ L1,
y ∈ L1,
support({x} ∪ {y}) < ε, where ε is a user-defined thresholdrepresenting the highest amount of noise in the data allowed,∣CS(x ∣L)∩CS(y ∣L)∣
max{∣CS ∣L(x)∣,∣CS(y ∣L)∣} ⩾ mincommon.
i CS(i) coverage
{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3{Salsa} {{Nachos}} 1{Coca-Cola} {{Nachos}} 1{Pepsi} {{Nachos}, {Salsa}} 2
∣CS(Pepsi)∩CS(Coca−Cola)∣max{∣CS(Pepsi)∣,∣CS(Coca−Cola)∣}
= 0.5
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 9
/ 15
Create Substitutive Sets RapidMiner operator
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 10
/ 15
Use Case: DBpedia
DBpedia knowledge base version 2014
sets of 3–item transactions {c1,p, c2}, where c1 and c2 classes ofsubject and object of RDF triple, and p property connecting s and o
SELECT ?c1 ?p ?c2WHERE {?s rdf:type dbpedia-owl:Organization .?s ?p ?o .?s rdf:type ?c1 .?o rdf:type ?c2 .FILTER(?p != dbpedia-owl:wikiPageWikiLink) .FILTER(?p != rdf:type) .FILTER(?p != dbpedia-owl:wikiPageExternalLink) .FILTER(?p != dbpedia-owl:wikiPageID) .FILTER(?p != dbpedia-owl:wikiPageInterLanguageLink) .FILTER(?p != dbpedia-owl:wikiPageLength) .FILTER(?p != dbpedia-owl:wikiPageOutDegree) .FILTER(?p != dbpedia-owl:wikiPageRedirects) .FILTER(?p != dbpedia-owl:wikiPageRevisionID)}
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 11
/ 15
Transaction generation
dbpedia'owl:MusicalAr2st44 dbpedia'owl:PopulatedPlace44
Norah_Jones44
Denton,_Texas4dbpedia'prop:origin4
Mark_Knopfler4 Gosforth4dbpedia'owl:hometown4
dbpedia'owl:MusicalAr2st44 dbpedia'owl:PopulatedPlace44
s4
s4 p4
p4 o4
o4
c14
c14
c24
c24
Transactions
{c1 dbpedia-owl:MusicalArtist, dbpedia-owl:hometown, c2 dbpedia-owl:PopulatedPlace }{c1 dbpedia-owl:MusicalArtist, dbpedia-prop:origin , c2 dbpedia-owl:PopulatedPlace }...
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 12
/ 15
Experimental Setup
FP-Growth: min number of itemsets = 500, max number of retries =15, min support: 1.0E-4),
Create Substitutive Sets: min support = 1.0E-4, min common= 0.7, epsilon =1.0E-5,
a sample of 100k results per each query,
desktop computer with 12GB RAM and CPU Intel(R) Core(TM)i5-4570 3.20GHz,
a single run of mining substitutive sets (for a single class and 100ktransactions) took several seconds on average (ranging from 2s to12s)
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 13
/ 15
Sample substitutive properties for the class Organisation
Item X Item Y Common
dbpprop:parentOrganization dbo:parentOrganisation 1.000dbpprop:owner dbo:owner 1.000dbpprop:origin dbo:hometown 1.000dbpprop:headquarters dbpprop:parentOrganization 1.000dbpprop:formerAffiliations dbo:formerBroadcastNetwork 1.000dbo:product dbpprop:products 1.000dbpprop:keyPeople dbo:keyPerson 0.910dbpprop:commandStructure dbpprop:branch 0.857dbo:schoolPatron dbo:foundedBy 0.835dbpprop:notableCommanders dbo:notableCommander 0.824dbo:recordLabel dbpprop:label 0.803dbo:headquarter dbo:locationCountry 0.803dbpprop:country dbo:state 0.753
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 14
/ 15
Summary
Introduced a model for substitutive itemsets mining
Preliminary tests of this model within the task of deduplication ofobject properties in an RDF dataset (DBpedia)
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 15
/ 15
Acknowledgements
Foundation for Polish Science under the POMOST programme,cofinanced from European Union, Regional Development Fund (NoPOMOST/2013-7/8) (2013-2015)
EU FP7 ICT-2007.4.4 (No 231519) ”e-LICO: An e-Laboratory forInterdisciplinary Collaborative Research in Data Mining andData-Intensive Science” (2009-2012)
Thanks to Ewa Kowalczuk for debugging the RapidMiner plugin
Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 16
/ 15