learning to assess linked data relationships using genetic programming
TRANSCRIPT
Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta
Learning to AssessLinked Data RelationshipsUsing Genetic Programming
@IlaTiddi
20.10.201615th International Semantic Web Conference (ISWC 2016)
Research ProblemAutomatically discover what makes a strong relationship between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities
ASongOfIceAndFire(novel) GoTASongOfIce
AndFire(topic)dc:subject dc:subject
Research ProblemAutomatically discover what makes a strong relationship between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities• automatically : through graph search techniques
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
:born
:airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
Research ProblemProblem • Entities/properties in a path might come from a number
of different, unknown data sources
Solution (the easy one)• indexing & preprocessing of a portion of Linked Data • a priori knowledge, computational resources
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
:born
:airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel) GoT
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel) GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
dc:subject
Fantasy
dc:subject
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel) GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
dc:subject
Fantasy
dc:subject
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel) GoTASongOfIce
AndFire(topic)dc:subject
Fantasy
dc:subject
UnitedStates:born
GeorgeRRMartin
:author
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel) GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
dc:subject
Fantasy
dc:subject
UnitedStates:born
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author
dc:subjectdc:subject
Fantasy
dc:subject
:born
Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author :airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
:born
Research Problem
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author :airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
Solution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data
:born
Research HypothesisProblemUninformed searches require a cost-function to explore the graph following the most promising paths
HypoLinked Data information can drive a cost-function that detects strong relationships between entities
ASongOfIceAndFire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIceAndFire(topic)
:author :airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
:born
Research QuestionsWhat makes a path strong? • Which topological or semantic features of nodes/edges?
✗ e.g. length of a path? entities of different datasets are connected by many
paths of similar length
How can we use Linked Data to assess strong relationships?• Which information do we need?• Can we use structural features of the graph?
Challenges• find topological/semantic features to detect strong
relationships• combine these features in a cost-function• perform an effective blind search
Proposed Approach
• A set of topological/semantic characteristics of the Linked Data graph
• a benchmark of human-evaluated relationship paths
Identify the cost-function for a blind search that best performs in ranking sets of alternative relationship paths
Automatically learn a cost-function to detect strong relationships between Linked Data entities using a supervised method (Genetic Programming)
Proposed Approach
Genetic Programming: why?• Flexible learning process• Suitable for wide search spaces (such as Linked Data)• Results assessed with a fitness (scores vs. functions)• Human-understandable results• Easy to integrate in a graph search
Automatically learn a cost-function to detect strong relationships between Linked Data entities using a supervised method (Genetic Programming)
VS
Genetic ProgrammingPrograms (solutions for a problem)• trees of primitives• functions : internal nodes (mathematical or logical
operations) • terminals : leaf nodes (constants or variables)
Fitness function (evaluation)• how well the program solves the problem
Genetic operations (evolution) • reproduction • crossover from two parents • mutation from one parent
Termination condition • maximum number of evolutions• a desired fitness
Genetic ProgrammingProcedure• Create random population of programs based on the primitives
• Evolve population until an ideal situation is met
✗✗✗ ✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti
Genetic ProgrammingGiven• a starting population of randomly generated cost-functions• sets of alternative paths between two Linked Data entities,
ranked by humans
Determine how good each cost-function is in ranking paths compared to the human evaluators
✗✗✗ ✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti
Genetic ProgrammingPrimitives
Constant terminals • Z= {0, 1000}
Aggregated terminals • Topological edge weighs
indegree, outdegree, constant weight• Semantic edge weighs
usage of namespaces, taxonomies, vocabularies • Aggregators along the path
sum, avg, min, max
Functions (combining different information)• Math operations
addition, multiplication, division, log
Genetic ProgrammingFitnessNormalised Discounted Cumulative Gain (nDCG)• (IR) quality of rankings provided by search engines based on
the graded relevance of the returned documents• how good is a program in ranking paths based on human ranks• avg(nDCG) across the dataset• length penalty
Genetic operations• Reproduction• Crossover• Mutation
Learning• Training set + test set• Keep fittest program for each runs on training set• Test them (discard inconsistent)
ExperimentsDataset
Entities (random types from different sources)• 12,630 events from Yago• 8,185 people from the VIAF dataset• 999 movies from the LMDB• 1,174 countries/capitals from Geonames/ the UNESCO dataset
Paths (a set of possible paths between them)• select a random pair• bidirectional breadth-first search
Assessment• 100 pairs (~10 possible paths per pair)• 8 judges• from (2) highly relevant to (0) not relevant
db:Dina-Korzun
viaf:Dina-Korzungn:Europegn:United-
Kingdomlmdb:TheSkinGame
owl:sameAsdbo:citizenshipgno:parentFeature
foaf:based_near
ExperimentsResults
Different runs (fitness on training set/test set)(T) Topological primitives only(S) Topological + semantic primitives(N) Topological + namespaces primitives
Runs Best program Fitness TR Fitness TS
T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79
T2 log(min.cd)/(avg.cd + 87) 0.77 0.78
T3 min.cd × (min.cd/max.cd) 0.78 0.72
N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81
N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77
N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75
S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83
S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86
S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86
ExperimentsResults
Lower performance for T-runs and N-runsRecurrent terminals• conditional degree (node degree depending on the RDF
triple)• namespace variety • number of topic properties
(dc:subject/skos:broader/foaf:primaryTopic)Runs Best program Fitness TR Fitness TS
T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79
T2 log(min.cd)/(avg.cd + 87) 0.77 0.78
T3 min.cd × (min.cd/max.cd) 0.78 0.72
N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81
N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77
N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75
S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86
ExperimentsComparative evaluationBest programs• automatically learntvs. literature functions• RECAP,RelFinder,Everything Is Connected Engine, Moore et al.• ad-hoc / handcrafted information theoretical measures
ExperimentsWhich cost-function?
Interpretation• pass through nodes with rich node descriptions
higher min_namespaces = higher path score• not high level entities / few topic categories
few incoming topic categories = higher path score• more specific entities (not hubs) for path with few topic categories
ratio conditional_degree / inTopicCategories
specific paths are privileged over general paths
ConclusionsContributionsA measure to detect strong relationships in Linked Data
can be integrated in uninformed searches over Linked Datavs. indexing/pre-processing techniques
derived empirically through Genetic Programmingvs. domain-specific / handcrafted measures
what is important in Linked Datatopological features + little knowledge about the edge vocabulary
Future work• Integrate the measure in the blind-search process• Explore more characteristics• Improve the measure