turning data into knowledge (kesw2014 keynote)

26
Turning Data into Knowledge profiling and interlinking Web datasets Stefan Dietze L3S Research Center - KESW2014 - 30/09/14 1 Stefan Dietze KESW2014

Upload: stefan-dietze

Post on 05-Dec-2014

292 views

Category:

Technology


1 download

DESCRIPTION

Keynote at KESW2014 (http://2014.kesw.ru/) about Dataset Characterisation and Linking.

TRANSCRIPT

Page 1: Turning Data into Knowledge (KESW2014 Keynote)

Turning Data into Knowledge –

profiling and interlinking Web datasets

Stefan Dietze

L3S Research Center

- KESW2014 -

30/09/14 1 Stefan Dietze KESW2014

Page 2: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Recent work on Linked Data exploration/discovery/search

Entity interlinking & dataset interlinking recommendation

Dataset profiling

Data consistency & conflicts

Research areas

Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)

Application domains: education/TEL, Web archiving, …

Some projects

Introduction

http://www.l3s.de/

30/09/14 2

See also: http://purl.org/dietze

Stefan Dietze

Page 3: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

…why are there so few datasets actually used?

Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc

Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)

Explanations?

Linked Data is awesome, but...

30/09/14

„HTTP-accessibility“ (SPARQL, URI-dereferencing)

„Structure“ & „Semantics“ (=> shared/linked vocabularies)

„Interlinked“

„Persistent“

Hm,

really?

Stefan Dietze 3

Page 4: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Linked data is more diverse (and messy) than we think

SPARQL endpoint availability over time [Buil-Aranda et al 2013]

Accessibility of datasets?

Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]

“THE” SPARQL protocol? No, but many variants & subsets

“Semantics”, links, quality?

…data accuracy (eg DBpedia)? [Paulheim2013]

…vocabulary reuse? [D’AquinWebSci13]

…schema compliance (RDFS, schemas) [HoganJWS2012]

Stefan Dietze

SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda,

Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic

Web Conference 2013, (ISWC2013).

Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,

Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.

Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC

2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525

An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,

A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012

30/09/14 4

Page 5: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

What about data consistency?

Analyzing Relative Incompleteness of Movie Descriptions

in the Web of Data: A Case Study, Yuan, W., Demidova, E.,

Dietze, S., Zhu, X., International Semantic Web Conference

2014 (ISWC2014)

30/09/14 Stefan Dietze 5

Page 6: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Too many/diverse datasets, too little knowledge

Stefan Dietze 30/09/14

? ? ?

? ? ?

Topics? Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?

Types? Which datasets describe statistics, videos, slides, publications etc?

Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?

6

Page 7: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

db:Astro. Objects

Dataset Metadata

Stefan Dietze 30/09/14

BIBO

AAISO

FOAF

contains

Entity & dataset disambiguation & linking [ESWC13]

Topic profile extraction [WWW13, ESCW14]

db:Astronomy

db:Astro. Objects

Dataset

Catalog/Registry

yov:Video

po:Programme

BBC Programme

<po:Programme …>

<po:Series>Wonders of the Solar System</.>

<po:Actor>Brian Cox</…>

</po:Programme…>

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

bibo:Fil bibo:Fi

bibo:Film

Schema mappings [WebSci13]

Data mapping, linking and profiling

7

Page 8: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Schemas/vocabularies on the Web: XKCD 927

Stefan Dietze 30/09/14

https://xkcd.com/927/ schemas & vocabularies

8

Page 9: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

po:Programme sioc:Item

30/09/14

yov:Video

?

Stefan Dietze 9

Page 10: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

typeX typeX

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Co-occurence after mapping into most frequent schemas

(201 frequent types mapped into 79

classes)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

bibo:Film

bibo:Document

po:Programme sioc:Item

30/09/14

foaf:Document

yov:Video

typeX

10

Page 11: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Application: LinkedUp Data Catalog in a nutshell

http://datahub.io/group/linked-education

RDF (VoID) dataset catalog: browse & query distributed datasets

Federated queries using type mappings

Live information about endpoint accessibility

Stefan Dietze 30/09/14

11

http://data.linkededucation.org/linkedup/catalog/

http://datahub.io/group/linked-education

DBpedia categories

Page 12: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014 Stefan Dietze 30/09/14

contains yov:Video

po:Programme

BBC Programme

<po:Programme …>

<po:Series>Wonders of the Solar System</.>

<po:Actor>Brian Cox</…>

</po:Programme…>

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

Towards profiling: dataset disambiguation/linking

?

Relatedness of entities, meaningfulness of paths? [ESWC13]

Extraction of “topics” & relatedness of datasets [ESWC14]

?

?

?

14

db:Astro. Objects

db:CartoonCharacters

?

Page 13: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014 Stefan Dietze 30/09/14

contains yov:Video

po:Programme

BBC Programme

<po:Programme …>

<po:Series>Wonders of the Solar System</.>

<po:Actor>Brian Cox</…>

</po:Programme…>

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

db:Pluto

(Dwarf

Planet)

db:Astrono-

mical Objects

db:Sun

db:Astronomy

Computation of connectivity scores between entities

Combination of a (i) semantic (graph-based) connectivity score (SCS) with (ii) a Web co-occurence-based measure (CBM) (similar to NGD)

For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)

SCS = 0.32

CBM = 0.24

15

Dataset disambiguation/linking

Page 14: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Entity linking: evaluation

30/09/14 16 Stefan Dietze

Evaluation based on USA Today News items (80.000 entity pairs)

Manually created gold standard (1000 entity pairs)

Baseline: Explicit Semantic Analysis (ESA)

=> CBM/SCS: „relatedness“; ESA: „similarity“

Precision/Recall/F1 for SCS, CBM, ESA.

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Page 15: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

„SCS Connector“ demo

http://lod2.inf.puc-rio.br/scs/SemConnectivities

SCS Connector – Quantifying and Visualising Semantic

Paths between Entity Pairs, Nunes, B. P., Herrera, J. E. T.,

Taibi, D., Lopes, G. R., Casanova, M. A., Dietze, S., Demo

Paper at 11th Extended Semantic Web Conference

(ESWC2014), Heraklion, Crete, Greece, (2014. –

*BEST ESWC2014 DEMO AWARD*

17 Stefan Dietze 30/09/14

Page 16: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Dataset Metadata

db:Astronomy

db:Astro. Objects

Dataset

Catalog/Registry

yov:Video

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

Extracting representative (DBpedia) categories („topic profile“) & entities for arbitrary datasets

Sounds easy? But how to do that for 300+ datasets with < 50 bn triples?

Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance [ESWC2014] (applied to all responsive LOD datasets)

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

Dataset profiling: what‘s the data about?

18 Stefan Dietze 30/09/14

db:Pluto

(Dwarf

Planet)

Page 17: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Efficient dataset profiling: method

1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling)

2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion)

3. Normalisation and ranking (using graphical-models such as PageRank with Priors, HITS with Priors and K-Step Markov)

Result: weighted dataset-topic profile graph

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

19 Stefan Dietze 30/09/14

Page 18: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/

Automatic extraction of dataset “topics” [ESWC2014] => RDF/VoiD dataset profiles

Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)

Includes all (responsive) datasets of LOD Cloud

20 Stefan Dietze 30/09/14

Page 19: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Dataset profiling: evaluation

NDCG (averaged over all datasets) .

Datasets & Ground Truth

Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood

Crowd-sourced topic indicators from datasets (keywords, tags)

Manual mapping to entities & category extraction (ranking according to frequency)

Baselines

1) LDA, 2) tf/idf (applied to entire datasets)

Topic extraction according to our approach, weighting/ranking based on term weight

Measure

NDCG @ rank l

Performance (time/NDCG) for different sampling strategies/sizes etc

21 Stefan Dietze 30/09/14

Page 20: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

30/09/14

What (dataset) have these categories in common?

dbp:Category:1955_births

dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Category:Semantic_Web

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

dbp:Category:World_Wide_Web

dbp:Category:Royal_Medal_winners

Stefan Dietze 22

?

?

Page 21: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

30/09/14

Diversity of category profile for a single publication

Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine.

foaf:Person foaf:Document

dbp:Tim_Berners-Lee

dbp:Category:1955_births

dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Semantic_Web

dbp:Category:Semantic_Web

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

first-level categories (dcterms:subject)

dbp:Category:World_Wide_Web

dbp:Category:Royal_Medal_winners

Stefan Dietze

DBLP

23

Page 22: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

30/09/14

http://data-observatory.org/led-explorer/

Type specific views on datasets/ categories

“Document” (foaf:document)

“Person “ (foaf:person)

“Course” (aaiso:course)

Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here)

Type-specific exploration of dataset categories

Stefan Dietze

Exploring type-specific topic profiles of datasets:

a demo for educational linked data, Taibi, D.,

Dietze, S., Fetahu, B., Fulantelli, G., Demo at

International Semantic Web Conference 2014

(ISWC2014)

24

Page 23: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

data.l3s.de – the L3S DataHub

Page 24: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

KEYSTONE & PROFILES 2014

30/09/14 27 Stefan Dietze

http://www.keystone-cost.eu/

KEYSTONE: semantic keyword-based search on structured data sources (2013-2017)

Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.

Open to new members (beyond Europe)

http://www.keystone-cost.eu/profiles

http://www.ijswis.org/?q=node/51/

PROFILES2014 - Dataset PROFIling & fEderated Search for Linked Data

Workshop collocated with ESWC2014

IJSWIS Special Issue on … LD search & profiling

Deadline 8 December 2014

Page 25: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Summing up

Summary

Increasing amounts of data => require knowledge about nature and relationships of datasets

Profiling: scalable methods for extracting dataset metadata

Interlinking: connectivity of entities or datasets

What about LD evolution?

In RDF graphs (eg LOD Cloud), „all“ nodes are connected

Impact of evolution on preservation, linking and enrichment?

Which parts of datasets to preserve (entity „neighbourhood“)? => semantic relatedness /relevance/entity retrieval

Link correctness in evolving LD?

….

30/09/14 29 Stefan Dietze

Page 26: Turning Data into Knowledge (KESW2014 Keynote)

KESW2014

Спасибо! Thank You!

WWW See also (general)

http://purl.org/dietze

http://linkedup-project.eu

http://duraark.eu

http://data.l3s.de

See also (data)

http://data.l3s.de

http://data.linkededucation.org

http://lak.linkededucation.org

30/09/14 30 Stefan Dietze

Besnik Fetahu (L3S)

Elena Demidova (L3S)

Bernardo Pereira Nunes (PUC Rio)

Marco Casanova (PUC Rio)

Luiz Andre Paes Leme (PUC Rio)

Giseli Lopes (PUC Rio)

Davide Taibi (CNR, IT)

Mathieu d’Aquin (Open University, UK)

and many more…

Acknowledgements