Transcript
Page 1: Deriving human readable labels from sparql queries

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

INSTITUTE FOR APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS

www.kit.edu

Deriving Human-Readable Labels from SPARQL Queries

Basil Ell, Denny Vrandečić, and Elena Simperl

7th International Conference on Semantic Systems, Graz

7 September 2011

Page 2: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

2 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Outline

Motivation

Human-readability of the LOD cloud

Method

Evaluation

Conclusions

Page 3: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

3 31.03.2014

Introduction

Entities are identified by URIs, such as

http://de.dbpedia.org/resource/Graz

http://rdf.freebase.com/ns/m.043j22x

Human-readable names can be provided e.g. using the property rdfs:label dbpedia:Austria

rdfs:label

"Österreich"@de

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 4: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

4 31.03.2014

Motivation – Why are labels necessary? Scenario: linked data browsing

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

[SIGMA]

Is this

meaningful to

human users?

Page 5: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

5 31.03.2014

Human-Readability of the LOD Cloud

BTC2010 Corpus [BTC2010]

3,167,799,445 ntriples

159,177,123 distinct subjects

137,156,213 (86.17%) have no value for any of the properties rdfs:label, rdfs:comment,

dc:title, and foaf:name.

61.8% of the analyzed non-information resources have

no label (regarding 36 labeling properties) [Ell et al. 2011]

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 6: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

6 31.03.2014

Main Idea

Can we automatically derive labels for entities by

analyzing SPARQL queries?

station can be used as a label for

http://dbpedia.org/ontology/RadioStation

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?station WHERE {

?station rdf:type dbo:RadioStation

}

Page 7: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

7 31.03.2014

Analyzed data set

USEWOD2011 corpus[USEWOD2011]

Contains log files from DBpedia and SWDF

distinct parsable SPARQL SELECT queries:

1,212,932 (DBpedia)

195,641 (SWDF)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Semantic Web Dog Food

(SWDF)

Page 8: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

8 31.03.2014

Classification of variable names

Class Description

short String length up to 2 chars. Common: s, p, o, x.

stop Known no-short strings that cannot be used as labels, e.g. subject,

instance, uri.

lang A no-stop string that belongs to a natural language or that consists of

separated words of a natural language, e.g. Artist and RadioStation.

Checked for the languages {de, en, es, fr, it} using the [Corpex]

webservice.

(The Corpex dataset consists of all words and their frequencies as

extracted and counted from instances of Wikipedia in multiple

languages. [Vrandecic et al. 2011])

nolang Variable names that are neither short, nor stop, nor lang.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 9: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

9 31.03.2014

Classification of triple patterns

Triple pattern classes P = {RRV, RVR, VRL, ...}

R is a resource, V is a variable, L is a literal

Ignoring features such as UNION, OPTIONAL etc.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

SELECT ... WHERE {

...

dbpedia:Karlsruhe dbo:populationTotal ?population .

...

}

RRV pattern

Page 10: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

10 31.03.2014

Classification of triple patterns (2)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

DBpedia

Page 11: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

11 31.03.2014

DBpedia – top query patterns (pruned n >= 5000)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

8312 queries

consist of one

VVL triple and

three VRV triples

Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern

Page 12: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

12 31.03.2014

SWDF – top query patterns pruned (n >= 1000)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern

Page 13: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

13 31.03.2014

Derivation pattern 1: 1 x RRV (31.75% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

property R2 iff local_name(R2) = V and lang(V).

V‘ can be derived from V by substituting

separators and splitting camel-cased words into

constituents.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

<http://dbpedia.org/page/NASA> R1

<http://dbpedia.org/property/agencyName> R2

?agencyName V

Page 14: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

14 31.03.2014

Derivation pattern 2: Any graph with VRR (22.32% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

class R2 iff lang(V) and R1 = rdf:type

Example:

?place rdf:type dbo:Location

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

?paper V

<http://data.semanticweb.org/ns/swc/ontology#isPartOf> R1

<http://data.semanticweb.org/conference/www/2009/proceedings> R2

Page 15: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

15 31.03.2014

Evaluation – 1 x RRV

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

1,366,363 triples of class RRV

549,093 cases: local_name(R2) = V

817,269 cases: local_name(R2) ≠ V

226 pairs (URI, guessed label)

54.5% correct: sufficiently similar to existing labels

14% correct: manual evaluation

9.1% correct within a given context (location for dbo:residence)

22.4% wrong (contained for dbprop:creator)

68%

Page 16: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

16 31.03.2014

Evaluation – Any graph with VRR

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

80,455 triples of class RRV

549,093 cases: local_name(R2) = V

60 distinct URIs, 36 labels

25% correct: sufficiently similar to existing labels

39.975% correct: manual evaluation

35.025% wrong (scientist for dbo:SoccerPlayer)

64.975%

Page 17: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

17 31.03.2014

Conclusions

Approach for automatically deriving labels

Acceptable precision: most derived labels

matched the already existing labels (atypical

datasets)

Derived variable names less specific

Derived labels for terminological entities

(properties and classes), not for instances.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 18: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

18 31.03.2014

References & Acknowledgements

[BTC 2010]http://km.aifb.kit.edu/projects/btc-2010/

[Ell et al. 2011] Labels in the Web of Data, ISWC2011, to appear.

[SIGMA] http://sig.ma/search?q=Sidney+Bechet

[USEWOD2011] http://data.semanticweb.org/usewod/2011/challenge.html

[Corpex] http://km.aifb.kit.edu/sites/corpex/

[Vrandecic et al. 2011]

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Part of this work has been carried out in the framework of the German Research

Foundation (DFG) project entitled: "Entwicklung einer Virtuellen Forschungs-

umgebung für die Historische Bildungsforschung mit Semantischer Wiki-Techno-

logie - Semantic MediaWiki for Collaborative Corpora Analysis"

(INST 5580/1-1), in the domain of "Scientific Library Services and Information

Systems" (LIS).

Page 19: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

19 31.03.2014

THANK YOU FOR YOUR ATTENTION

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 20: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

20 31.03.2014

BACKUP SLIDES

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 21: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

21 31.03.2014

Triple pattern classes (SWDF)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 22: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

22 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries


Top Related