from linked data to tightly integrated data
DESCRIPTION
Invited Talk at the 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing. Reykjavik, Iceland, 27th May 2014 The ideas behind the Web of Linked Data have great allure. Apart from the prospect of large amounts of freely available data, we are also promised nearly effortless interoperability. Common data formats and protocols have indeed made it easier than ever to obtain and work with information from different sources simultaneously, opening up new opportunities in linguistics, library science, and many other areas. In this talk, however, I argue that the true potential of Linked Data can only be appreciated when extensive cross-linkage and integration engenders an even higher degree of interconnectedness. This can take the form of shared identifiers, e.g. those based on Wikipedia and WordNet, which can be used to describe numerous forms of linguistic and commonsense knowledge. An alternative is to rely on sameAs and similarity links, which can automatically be discovered using scalable approaches like the LINDA algorithm but need to be interpreted with great care, as we have observed in experimental studies. A closer level of linkage is achieved when resources are also connected at the taxonomic level, as exemplified by the MENTA approach to taxonomic data integration. Such integration means that one can buy into ecosystems already carrying a range of valuable pre-existing assets. Even more tightly integrated resources like Lexvo.org combine triples from multiple sources into unified, coherent knowledge bases. Finally, I also comment on how to address some remaining challenges that are still impeding a more widespread adoption of Linked Data on the Web. In the long run, I believe that such steps will lead us to significantly more tightly integrated Linked Data.TRANSCRIPT
From Linked Data toTightly Integrated Data
May 2014
Gerard de MeloTsinghua University, Beijing
From Linked Data toTightly Integrated Data
May 2014
Gerard de MeloTsinghua University, Beijing
25 Years of the World Wide Web:1989−2014
25 Years of the World Wide Web:1989−2014
http://geekcom.wordpress.com/2009/03/19/
Tim Berners-Lee
Gerard de Melo
25 Years of the World Wide Web:1989−2014
25 Years of the World Wide Web:1989−2014
http://geekcom.wordpress.com/2009/03/19/
Tim Berners-Lee Documents forhuman viewingDocuments forhuman viewing
Gerard de Melo
From Text to Structured DataFrom Text to Structured Data
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
IE
Source: Marko Grobelnik, Dunja Mladenic. KDD 2007.
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
Gerard de Melo
The Semantic WebThe Semantic Web
http://geekcom.wordpress.com/2009/03/19/
Tim Berners-Lee
col-league
born in Frankfurt
describedby
created by
Publish datain the right formright from the start
Publish datain the right formright from the start
createdby
Gerard de Melo
The Semantic WebThe Semantic Web
Assign URIs not just toDocuments, also to People, etc.
Assign URIs not just toDocuments, also to People, etc.
http://www.demelo.org/gdm/#GDMhttp://dblp.l3s.de/d2r/page/publications/conf/cikm/MeloW09
Assign URIs to Predicates (Edge Types)Assign URIs to Predicates (Edge Types)
created by
http://purl.org/dc/elements/1.1./creator
Gerard de Melo
Challenge:Simplify Publishing
Challenge:Simplify Publishing
Gerard de Melo
Challenge:Simplify Publishing
Challenge:Simplify Publishing
http://www.gauson.com/blog/2007/12/09/minimal-template-for-blogspot/
Gerard de Melo
Challenge:Simplify Publishing
Challenge:Simplify Publishing
Freebase:Better UI butnot universal
Freebase:Better UI butnot universal
Gerard de Melo
Big Knowledge GraphsBig Knowledge Graphs
Gerard de Melo
Big Knowledge Graphs
YAGO2. Hoffart et al. WWW 2011.
YAGO2. Hoffart et al. WWW 2011.
Gerard de Melo
Lexical Knowledge Bases
Gerard de Melo
Etymological Wordnet
LREC 2014Poster Session P17
16:45-18:05
LREC 2014Poster Session P17
16:45-18:05
Also Christian Chiarcos
today
Also Christian Chiarcos
today
Gerard de Melo
Lexical Intensity OrderingsLexical Intensity Orderings
goodgood
okayokay
greatgreat
superbsuperb
<
<
<
weak
strong
de Melo & BansalTransactionsof the ACL,
2013.
de Melo & BansalTransactionsof the ACL,
2013.
Gerard de Melo
Metaphors: ICSI MetaNet Project
Gerard de Melo
Common-Sense
Relations,Properties,
Comparisons
Tandon et al.WSDM 2014.
Tandon et al.AAAI 2014.
Tandon et al.AAAI 2011.
Common-Sense
Relations,Properties,
Comparisons
Tandon et al.WSDM 2014.
Tandon et al.AAAI 2014.
Tandon et al.AAAI 2011.
WebChild: Common-SenseWebChild: Common-Sense
Gerard de Melo
Input: Keywords, the World's Data
Output:Address User's Needs
Linked Data in UseLinked Data in Use
Gerard de Melo
Linked Data In Use
Gerard de Melo
Linked Data In Use
used in IBM's Jeopardy!-winning Watson system
Gerard de Melo
The PlanThe Plan
Linked Data
Really Linked Data
Integrated Data
Tightly Integrated Data
The PlanThe Plan
Linked Data
Really Linked Data
Integrated Data
Tightly Integrated Data
Really Linked DataReally Linked Data
Just converting toRDF is trivial
Just converting toRDF is trivial
Gerard de Melo
Really Linked DataReally Linked Data
use entitiesinstead of
literals wherepossible
use entitiesinstead of
literals wherepossible
Book 23Book 23 “Franz Kafka”“Franz Kafka”author
Gerard de Melo
Really Linked DataReally Linked Data
use entitiesinstead of
literals wherepossible
use entitiesinstead of
literals wherepossible
Book 23Book 23
“Franz Kafka”“Franz Kafka”
authorAuthor 14Author 14
name
PraguePrague
born in
Gerard de Melo
Really Linked DataReally Linked Data
use entitiesinstead of
literals wherepossible
use entitiesinstead of
literals wherepossible
Performance 1Performance 1 “en”“en”language
Performance 2Performance 2 “English”“English”language
Performance 3Performance 3 “engl.”“engl.”language
Gerard de Melo
Really Linked DataReally Linked Data
use entitiesinstead of
literals wherepossible
use entitiesinstead of
literals wherepossible
Performance 1Performance 1 language
Performance 2Performance 2 EnglishEnglishlanguage
Performance 3Performance 3 language
http://lexvo.org/id/iso639-3/eng
Gerard de Melo
Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use
http://lov.okfn.org/
Gerard de Melo
Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use
Gerard de Melo
Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use
Gerard de Melo
Linked Data CloudLinked Data Cloud
Gerard de Melo
Linked Data CloudLinked Data Cloud
Gerard de Melo
Identifiers and Cross-LinkageIdentifiers and Cross-Linkage
Arguably more important than RDF as a format
Example: Google Knowledge Graph
Buy intorich existingeco-systems
Buy intorich existingeco-systems
Gerard de Melo
Focal Point: WordNet
UWN (CIKM 2009):over 1,000,000 words in over 100 languages
Gerard de Melo
UWN/MENTA: Universal WordNetUWN/MENTA: Universal WordNet
Gerard de Melo
Lexvo.org
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Cyrllic(Script) Cyrllic(Script)
Ukraine Ukraine
GeoNames
Ukraine Ukraine
owl:sameAs
UkrainianUkrainianUkrainianUkrainian
Ukraine Ukraine
Gerard de Melo
Lexvo.org
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Cyrllic(Script) Cyrllic(Script)
Ukraine Ukraine
UkrainianUkrainianUkrainianUkrainian
Ukraine Ukraine
My Resource
UkrainianUkrainian
Lexvo.org APIIdentifiers
.getLanguageURIforISO639P1("uk")
Gerard de Melo
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Lexvo.org APIIdentifiers
.getTermURI("car", "eng")
RDF “car”@en l:means sumo:Automobile
lexvo:term/eng/car l:means sumo:Automobile
Gerard de Melo
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Gerard de Melo
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Gerard de Melo
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Gerard de Melo
Focal Point: Lexvo.org
Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo
Focal Point: Lexvo.orgFocal Point: Lexvo.org
Lexvo.orgLexvo.org
Roget'sThesaurus
Roget'sThesaurus
WordNetEvocation Links
WordNetEvocation Links
EtymologicalWordNet
EtymologicalWordNet
PropBanklexicon
PropBanklexicon
NomBanklexicon
NomBanklexicon
MPQA SubjectivityLexicon
MPQA SubjectivityLexicon
MPQA SubjectivityLexicon
MPQA SubjectivityLexicon
AFINNAffective Lexicon
AFINNAffective Lexicon
CMU Pronunciation
Dictionary
CMU Pronunciation
Dictionary
Gerard de Melo
Linked EntitiesLinked Entities
Source: Gerhard Weikum. For a few Triples more.
Gerard de Melo
Linked EntitiesLinked Entities
Gerard de Melo
LINDA: Creating Links
Gerard de Melo
LINDA: Creating Links
Gerard de Melo
LINDA:Böhm et al.CIKM 2012
LINDA:Böhm et al.CIKM 2012
LINDA: Creating Links
Gerard de Melo
LINDA:Böhm et al.CIKM 2012
LINDA:Böhm et al.CIKM 2012
LINDA: Creating Links
Gerard de Melo
LINDA:Böhm et al.CIKM 2012
LINDA:Böhm et al.CIKM 2012
LINDA: Creating LinksLINDA: Creating LinksLINDA: Creating LinksLINDA: Creating Links
LINDA:Böhm et al.CIKM 2012
LINDA:Böhm et al.CIKM 2012
Scale to Billion Triples Challenge Datasetdespite dependenciesScale to Billion Triples Challenge Datasetdespite dependencies
Gerard de Melo
Lexvo.org
SameAs LinksSameAs Links
Ukraine Ukraine
GeoNames
Ukraine Ukraine
owl:sameAs
Ukraine Ukraine
Leibnizian Identity
For all x:x=x
For all x, y, p:x=y => p(x)=p(y)
Gerard de Melo
Identity vs. Near-IdentityIdentity vs. Near-Identity
OfficialStandard& Leibniz
Automaticlinkers &
sameas.org
Einstein's Miracle YearEinstein's
Miracle Year
owl:sameAs
EinsteinEinstein
Gerard de Melo
Merging Lexical Resources
ACL 2010AAAI 2013ACL 2010AAAI 2013
Gerard de Melo
Merging Lexical Resources
ACL 2010AAAI 2013ACL 2010AAAI 2013
Gerard de Melo
Identity ConstraintsIdentity Constraints
Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions
Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions
dbpedia: Pauldbpedia: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paula
dbpedia: Pauladbpedia: Paula
freebase: Paulfreebase: Paul
Gerard de Melo
Identity ConstraintsIdentity Constraints
Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions
Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions
dbpedia: Pauldbpedia: Paul
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paula
dbpedia: Pauladbpedia: Paula
freebase: Paulfreebase: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
Gerard de Melo
Identity ConstraintsIdentity Constraints
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paulafreebase: Paulfreebase: Paul
dbpedia: Pauldbpedia: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
Use set-based formalism to Use set-based formalism to account for exceptions + account for exceptions + to avoid quadratic number of to avoid quadratic number of pairwise constraintspairwise constraints
Use set-based formalism to Use set-based formalism to account for exceptions + account for exceptions + to avoid quadratic number of to avoid quadratic number of pairwise constraintspairwise constraints
dbpedia: Pauladbpedia: Paula
Gerard de Melo
Identity ConstraintsIdentity Constraints
Add edge weightsAdd edge weightsAdd edge weightsAdd edge weights
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paulafreebase: Paulfreebase: Paul
2 2
1
1
1
1
dbpedia: Pauldbpedia: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
dbpedia: Pauladbpedia: Paula
Goal: Consistency Goal: Consistency minimizing weightedminimizing weightededge deletionsedge deletions
Goal: Consistency Goal: Consistency minimizing weightedminimizing weightededge deletionsedge deletions
Gerard de Melo
Capture separation betweennodes, which requiresedge deletions along all paths
Capture separation betweennodes, which requiresedge deletions along all paths
AlgorithmAlgorithm
See Paper for details, incl. relationship toHungarian Algorithm and Graph Cuts
See Paper for details, incl. relationship toHungarian Algorithm and Graph Cuts
Gerard de Melo
AlgorithmAlgorithm
Leighton & Rao style Leighton & Rao style Region GrowingRegion GrowingLeighton & Rao style Leighton & Rao style Region GrowingRegion Growing
dbpedia: Pauldbpedia: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paula
dbpedia: Pauladbpedia: Paula
freebase: Paulfreebase: Paul
2 2
1
1
1
1
Gerard de Melo
AlgorithmAlgorithm
Leighton & Rao style Leighton & Rao style Region GrowingRegion GrowingLeighton & Rao style Leighton & Rao style Region GrowingRegion Growing
dbpedia: Pauldbpedia: Paul
dbpedia:Paulie (redirect)
dbpedia:Paulie (redirect)
musicbrainz: Paulie
musicbrainz: Paulie
dblp: Pauladblp: Paula
dbpedia: Pauladbpedia: Paula
freebase: Paulfreebase: Paul
2 2
1
1
1
1
Gerard de Melo
ExperimentsExperiments
BTC: BTC: Large Linked Data Web crawl, 20GB gzipped
sameas.org:sameas.org:Most well-known collections of sameAs links,aggregated from various Linked Data sources
BTC: BTC: Large Linked Data Web crawl, 20GB gzipped
sameas.org:sameas.org:Most well-known collections of sameAs links,aggregated from various Linked Data sources
Gerard de Melo
Identity ConstraintsIdentity Constraints
Gerard de Melo
ExperimentsExperiments
>500,000 node pairs,>500,000 node pairs,but algorithm removesbut algorithm removesonly 280,000 edgesonly 280,000 edges
>500,000 node pairs,>500,000 node pairs,but algorithm removesbut algorithm removesonly 280,000 edgesonly 280,000 edges
Gerard de Melo
Identity LinksIdentity Links
Must distinguish identity fromnear-identityCan automatically identify 500,000 inconsistent URI pairsFix using LP Graph Algorithm
Must distinguish identity fromnear-identityCan automatically identify 500,000 inconsistent URI pairsFix using LP Graph Algorithm
Use more specific properties!
lvont:strictlySameAs (Lexvo.org)skos:closeMatch
etc.
Use more specific properties!
lvont:strictlySameAs (Lexvo.org)skos:closeMatch
etc.Gerard de Melo
Questions?Questions?
Image: Question Answering over Linked Data Workshop
Gerard de Melo
The PlanThe Plan
Linked Data
Really Linked Data
Integrated Data
Tightly Integrated Data
Taxonomic Links
a user wantsa list of
„Art Schools in Europe“
Gerard de Melo
Multilingual Taxonomies
a Swedish user wants
a list of
„Konstskolor i Europa“
Gerard de Melo
MENTA
200+ Wikipedia editions200+ Wikipedia editionsWordNetWordNetEtc.Etc.
200+ Wikipedia editions200+ Wikipedia editionsWordNetWordNetEtc.Etc.
Gerard de Melo
Predict Individual Identity Links:WordNet-WikipediaArticle-RedirectArticle-Categoryetc.
Predict Individual Identity Links:WordNet-WikipediaArticle-RedirectArticle-Categoryetc.
MENTA
Gerard de Melo
MENTA
Predict Individual Taxonomic Links:Article → CategoryCategory → WordNet
Predict Individual Taxonomic Links:Article → CategoryCategory → WordNet
MENTA
Gerard de Melo
Taxonomic Links:MENTA
Gerard de Melo
Taxonomic Links:MENTA
Use Identity ConstraintAlgorithm to form equivalence classes
Use Identity ConstraintAlgorithm to form equivalence classes
Markov Chain RandomWalk with Restartsto Rank Parents
Markov Chain RandomWalk with Restartsto Rank Parents Gerard de Melo
Taxonomic Links:MENTA
Gerard de Melo
UWN/MENTA
CIKM 2010CIKM 2010Best Paper AwardBest Paper AwardCIKM 2010CIKM 2010Best Paper AwardBest Paper Award Gerard de Melo
MENTA: Multilingual Entity Taxonomy
UWN/MENTA (de Melo & Weikum 2010)
● multilingual extension of WordNet, with 800,000 words in 250 languages
● 4,8 million instances/classesfrom multilingual Wikipedia editions
Gerard de Melo
UWN/MENTA
multilingual extension of WordNet forword senses and taxonomical information over 200 languages
Gerard de Melo
Questions?Questions?
Image: Question Answering over Linked Data Workshop
Gerard de Melo
The PlanThe Plan
Linked Data
Really Linked Data
Integrated Data
Tightly Integrated Data
Challenge: Locked Away DataChallenge: Locked Away Data
Hard to runadvanced algorithmsover a SPARQLinterface
Many sites don'tprovide downloads.
Hard to runadvanced algorithmsover a SPARQLinterface
Many sites don'tprovide downloads.
Gerard de Melo
Challenge: Lost DataChallenge: Lost Data
http://sparqles.okfn.org/
Servers offlinePoor archivingServers offlinePoor archiving
Dumps need to be archived and integrated.
Dumps need to be archived and integrated.
Gerard de Melo
Challenge: UpdatesChallenge: Updates
Need to be able toupdate when data changes
Need to be able toupdate when data changes
Need algorithmic solutions, not one-time process.
Need algorithmic solutions, not one-time process.
YAGO2s: Biega et al. 2013Gerard de Melo
Requirement: Integration Algorithm Pipelines
Requirement: Integration Algorithm Pipelines
Gerard de Melo
Input: Various Data
Input: Various Data
Output:
Tightly IntegratedData
Output:
Tightly IntegratedData
Lexvo.orgLexvo.org
Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo
Lexvo.orgLexvo.org
Gerard de Melo
Lexvo.orgLexvo.org
Lexvo.orgLexvo.org
Lexvo.orgLexvo.org
Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo
Most large-scale knowledge bases have ground facts only
But language is much more expressive
Knowledge GraphsKnowledge Graphs
bornIn(Einstein,Ulm)acquired(Microsoft,Powerset)bornIn(Einstein,Ulm)acquired(Microsoft,Powerset)
● All humans are mortal.● At least three but not more than 10 people
know this secret.● Three years ago, most people believed that
Microsoft would buy Yahoo within months.
● All humans are mortal.● At least three but not more than 10 people
know this secret.● Three years ago, most people believed that
Microsoft would buy Yahoo within months.
Gerard de Melo
Challenge: TimeChallenge: Time
Temporal scope missingTemporal scope missing
Source: Gerhard Weikum. For a few Triples more.
Gerard de Melo
OWL, RDFS, Description LogicsOWL, RDFS, Description Logics
WebProtégéhttp://protege.stanford.edu/
Limit expressivityto get decidability.
Focus on classhierarchies
and propertyaxioms.
Limit expressivityto get decidability.
Focus on classhierarchies
and propertyaxioms.
Cannot create new rulese.g. to model
“grandparent”, “uncle”,“legal adult”!
Cannot create new rulese.g. to model
“grandparent”, “uncle”,“legal adult”!
Gerard de Melo
ReasoningReasoning
Humans cannot act before being born(or, actually, before being conceived)
(=>(and
(human ?HUMAN)(birthdate ?HUMAN ?T)(agent ?PROCESS ?HUMAN))
(beforeOrEqual(daysBefore (BeginFn ?T) 365)(BeginFn (WhenFn ?PROCESS))))
Humans cannot act before being born(or, actually, before being conceived)
(=>(and
(human ?HUMAN)(birthdate ?HUMAN ?T)(agent ?PROCESS ?HUMAN))
(beforeOrEqual(daysBefore (BeginFn ?T) 365)(BeginFn (WhenFn ?PROCESS))))
Reasoning: SPASS-XDBReasoning: SPASS-XDB
Gerard de Melo
Search Interfaces
“Which companies were created during the last century in Silicon Valley ?”
YAGO2:WWW 2011
Best Demo Award
YAGO2:WWW 2011
Best Demo Award
Gerard de Melo
Common-Sense Inference
Gerard de Melo
I found the following restaurant near your current location:
La Dolce Vita Pizza. 2318 Columbus Ave.
I'd rather have somethinghealthier
Tandon et al.AAAI 2014
Tandon et al.AAAI 2014
Conclusion
Really Linked Data► Shared Identifiers► Proper Interlinking
Integrated Data► Taxonomical Integration
Tightly Integrated Data► Processing Pipelines► Towards Common-SenseInference
Gerard de Melo
[email protected]@demelo.org