deriving human readable labels from sparql queries
Post on 10-May-2015
93 Views
Preview:
DESCRIPTION
TRANSCRIPT
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
INSTITUTE FOR APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS
www.kit.edu
Deriving Human-Readable Labels from SPARQL Queries
Basil Ell, Denny Vrandečić, and Elena Simperl
7th International Conference on Semantic Systems, Graz
7 September 2011
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
2 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries
Outline
Motivation
Human-readability of the LOD cloud
Method
Evaluation
Conclusions
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
3 31.03.2014
Introduction
Entities are identified by URIs, such as
http://de.dbpedia.org/resource/Graz
http://rdf.freebase.com/ns/m.043j22x
Human-readable names can be provided e.g. using the property rdfs:label dbpedia:Austria
rdfs:label
"Österreich"@de
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
4 31.03.2014
Motivation – Why are labels necessary? Scenario: linked data browsing
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
[SIGMA]
Is this
meaningful to
human users?
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
5 31.03.2014
Human-Readability of the LOD Cloud
BTC2010 Corpus [BTC2010]
3,167,799,445 ntriples
159,177,123 distinct subjects
137,156,213 (86.17%) have no value for any of the properties rdfs:label, rdfs:comment,
dc:title, and foaf:name.
61.8% of the analyzed non-information resources have
no label (regarding 36 labeling properties) [Ell et al. 2011]
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
6 31.03.2014
Main Idea
Can we automatically derive labels for entities by
analyzing SPARQL queries?
station can be used as a label for
http://dbpedia.org/ontology/RadioStation
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?station WHERE {
?station rdf:type dbo:RadioStation
}
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
7 31.03.2014
Analyzed data set
USEWOD2011 corpus[USEWOD2011]
Contains log files from DBpedia and SWDF
distinct parsable SPARQL SELECT queries:
1,212,932 (DBpedia)
195,641 (SWDF)
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
Semantic Web Dog Food
(SWDF)
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
8 31.03.2014
Classification of variable names
Class Description
short String length up to 2 chars. Common: s, p, o, x.
stop Known no-short strings that cannot be used as labels, e.g. subject,
instance, uri.
lang A no-stop string that belongs to a natural language or that consists of
separated words of a natural language, e.g. Artist and RadioStation.
Checked for the languages {de, en, es, fr, it} using the [Corpex]
webservice.
(The Corpex dataset consists of all words and their frequencies as
extracted and counted from instances of Wikipedia in multiple
languages. [Vrandecic et al. 2011])
nolang Variable names that are neither short, nor stop, nor lang.
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
9 31.03.2014
Classification of triple patterns
Triple pattern classes P = {RRV, RVR, VRL, ...}
R is a resource, V is a variable, L is a literal
Ignoring features such as UNION, OPTIONAL etc.
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
SELECT ... WHERE {
...
dbpedia:Karlsruhe dbo:populationTotal ?population .
...
}
RRV pattern
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
10 31.03.2014
Classification of triple patterns (2)
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
DBpedia
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
11 31.03.2014
DBpedia – top query patterns (pruned n >= 5000)
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
8312 queries
consist of one
VVL triple and
three VRV triples
Graph pattern classes
visualized as hypergraph:
n Number of
instances
TP Name of
triple pattern
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
12 31.03.2014
SWDF – top query patterns pruned (n >= 1000)
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
Graph pattern classes
visualized as hypergraph:
n Number of
instances
TP Name of
triple pattern
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
13 31.03.2014
Derivation pattern 1: 1 x RRV (31.75% of all DBpedia queries)
Assumption: V‘ is a human-readable label for
property R2 iff local_name(R2) = V and lang(V).
V‘ can be derived from V by substituting
separators and splitting camel-cased words into
constituents.
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
<http://dbpedia.org/page/NASA> R1
<http://dbpedia.org/property/agencyName> R2
?agencyName V
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
14 31.03.2014
Derivation pattern 2: Any graph with VRR (22.32% of all DBpedia queries)
Assumption: V‘ is a human-readable label for
class R2 iff lang(V) and R1 = rdf:type
Example:
?place rdf:type dbo:Location
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
?paper V
<http://data.semanticweb.org/ns/swc/ontology#isPartOf> R1
<http://data.semanticweb.org/conference/www/2009/proceedings> R2
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
15 31.03.2014
Evaluation – 1 x RRV
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
1,366,363 triples of class RRV
549,093 cases: local_name(R2) = V
817,269 cases: local_name(R2) ≠ V
226 pairs (URI, guessed label)
54.5% correct: sufficiently similar to existing labels
14% correct: manual evaluation
9.1% correct within a given context (location for dbo:residence)
22.4% wrong (contained for dbprop:creator)
68%
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
16 31.03.2014
Evaluation – Any graph with VRR
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
80,455 triples of class RRV
549,093 cases: local_name(R2) = V
60 distinct URIs, 36 labels
25% correct: sufficiently similar to existing labels
39.975% correct: manual evaluation
35.025% wrong (scientist for dbo:SoccerPlayer)
64.975%
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
17 31.03.2014
Conclusions
Approach for automatically deriving labels
Acceptable precision: most derived labels
matched the already existing labels (atypical
datasets)
Derived variable names less specific
Derived labels for terminological entities
(properties and classes), not for instances.
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
18 31.03.2014
References & Acknowledgements
[BTC 2010]http://km.aifb.kit.edu/projects/btc-2010/
[Ell et al. 2011] Labels in the Web of Data, ISWC2011, to appear.
[SIGMA] http://sig.ma/search?q=Sidney+Bechet
[USEWOD2011] http://data.semanticweb.org/usewod/2011/challenge.html
[Corpex] http://km.aifb.kit.edu/sites/corpex/
[Vrandecic et al. 2011]
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
Part of this work has been carried out in the framework of the German Research
Foundation (DFG) project entitled: "Entwicklung einer Virtuellen Forschungs-
umgebung für die Historische Bildungsforschung mit Semantischer Wiki-Techno-
logie - Semantic MediaWiki for Collaborative Corpora Analysis"
(INST 5580/1-1), in the domain of "Scientific Library Services and Information
Systems" (LIS).
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
19 31.03.2014
THANK YOU FOR YOUR ATTENTION
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
20 31.03.2014
BACKUP SLIDES
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
21 31.03.2014
Triple pattern classes (SWDF)
Basil Ell – Deriving Human-Readable Labels from SPARQL queries
KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods
22 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries
top related