agrégation et interrogation de données hétérogènes ...€¦ · knowledge formalization...

34
1 Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France F. Michel Université Côte d’Azur, CNRS, Inia, laboratore I3S Défi MASTODONS - Les Big Data en recherche, 13 Juin 2019 Agrégation et interrogation de données hétérogènes distribuées à l'échelle du Web par des techniques d'alignement sémantique

Upload: others

Post on 18-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

1Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, FranceFranck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

F. MichelUniversité Côte d’Azur, CNRS, Inia, laboratore I3S

Défi MASTODONS - Les Big Data en recherche, 13 Juin 2019

Agrégation et interrogation dedonnées hétérogènes distribuées à l'échelle du Web

par des techniques d'alignement sémantique

Page 2: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

2Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

More data sources More Data Integration opportunities

Page 3: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

3Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Hortus Sanitatis. First Natural History encyclopaedia, 1485.

Page 4: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

4Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Data Integration ex. in Digital Humanities

Archaeological excavationConservation biology*

*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins

Hortus Sanitatis, 1485.

Page 5: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

5Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Data Integration ex. in Digital Humanities

Archaeological excavationConservation biology*

*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins

First Natural History Encycloedia, 1485.

Knowledge formalization

Controlled vocabularies,taxonomies,

domain ontologies…

Page 6: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

6Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

fédération de données et de

ConnaissancEs Distribuées en Imagerie BiomédicaLE

Scientific annual workshops 2012, 2013, 2014, 2015

Issues:High heterogeneity

Increasing amount/number of sources

Need for cross-factor analysis

Sensitive (privacy, access policies)

Methods: Knowledge formalization

Semantic alignment

Mediation towards common formats

Distributed querying

Page 7: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

7Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

How to enable RDF-based integrationof heterogeneous data sources?

Page 8: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

8Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

RDF-based Data Integration

GraphMaterialization

(ETL like)

Virtual Graph

Queryrewriting

SPARQL

SPARQL

Heterogeneous data sources

ID NAME

Page 9: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

9Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Many methods for many types of data sources

AstroGrid-D, SPARQL2XQuery, XSPARQL

XML

XLWrap, Linked CSV, CSVW, RML

CSV/TSV/Spreadsheets

D2RQ, R2O, Ultrawrap, Triplify, SM

R2RML: Morph-RDB, ontop, Virtuoso

Relational Databases

RML, TARQL, Apache Any23, DataLift, SPARQL-Generate

Multiple formats

RDFa, Microformats, JSON-LD

HTML

TARQL, JSON-LD, RML

JSON

xR2RML (MongoDB), ontop (MongoDB),

[Mugnier et al, 2016] (key-value stores)

NoSQL

M.L. Mugnier, M.C. Rousset, and F. Ulliana. “Ontology-Mediated Queries for NOSQL Databases.” In Proc. AAAI. 2016.

SPARQL Micro-services, Linked REST APIs

Web APIs

Page 10: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

10Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Agenda

xR2RML: Generic translation of heterogeneous data sources into RDF

SPARQL micro-services:Bridging Web APIs and the Web of Data

Applications in the biodiversity domain

Page 11: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

11Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Agenda

xR2RML: Generic translation of heterogeneous data sources into RDF

SPARQL micro-services:Bridging Web APIs and the Web of Data

Applications in the biodiversity domain

Page 12: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

12Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

The generic translation of heterogeneous data sources into RDF

requires a generic mapping description.

Page 13: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

13Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

TEACHERS

ID FNAME TEACHES

7 Catherine Semantic Web

8 Philippe Software Engineering

… … …http://example.org/teacher/7

Catherine

foaf:name ex:teaches

https://www.wikidata.org/entity/Q54837

Mapping description

Page 14: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

14Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

The xR2RML mapping languageUniform description of mappings from

most common types of DB to RDF

Extends R2RML, the W3C recommendation

for RDBs, and RML

Rich iteration model to accommodate

nested, hierarchical documents

Flexibility:• Allow any query language

• Allow any syntax to reference data elements

from query results

http://i3s.unice.fr/~fmichel/xr2rml_specification_v5.html

Page 15: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

15Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

How to query a data source with SPARQL using such a mapping description?

Page 16: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

16Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

SPARQL rewriting techniques for SQL and XQuery

Semantics-preserving 1-to-1 rewriting

Closely coupled with the target QL capabilities:

Support of joins, unions, nested queries, filtering, string fctn, etc.

Optimization:Enforced on the target query,

or delegated to the DB query-processing engine

SQL: Bizer & Cyganiak, 2006; Unbehauen et al., 2013a; Priyatna et al., 2014; Rodríguez-Muro & Rezk, 2015XQuery: Bikakis et al., 2015Optimization: Unbehauen et al., 2013b; Rodríguez-Muro & Rezk, 2015; Elliott et al., 2009; Sequeda & Miranker, 2013

Page 17: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

17Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

How much of the SPARQL rewriting process can be done in a DB-agnostic yet optimized manner?

Page 18: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

18Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Abstract Query Language (AQL)

Embark enough information for translation towards “any” DB QL.

Early optimizations

Self-Join Elimination, Self-Union Elimination, Filter propagation

SPARQLquery

xR2RML mappings

Abstractquery

Concrete DBquery

Page 19: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

19Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Application to

AQL-to-MongoDB rewriting challenging:

Expressiveness gap: SPARQL AQL MongoDBJoins not supported, nested query hardly supported, limited filter expressions

Semantic ambiguity

Page 20: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

20Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Filling the gap between the two worlds is not straightforward

Yet, NoSQL DBs are a huge, quickly increasing source of data.Potential for RDF-based data integration and publication in the Web of Data.

Semantic Web vs. NoSQL

Semantic Web NoSQL

highly connected graphs isolated documents, joins hardly supported

rich query expressiveness low expressiveness

reasoning _

? high throughput, high availability

_ horizontal elasticity

Page 21: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

21Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Generic approach suitable when direct access to the data source

GraphMaterialization

query rewriting

ID NAME

What if we access the data source via an API?

SPARQL

Page 22: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

22Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Agenda

xR2RML: Generic translation of heterogeneous data sources into RDF

SPARQL micro-services:Bridging Web APIs and the Web of Data

Applications in the biodiversity domain

Page 23: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

23Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Web APIs: APIs all over the web21,700+ Web APIs are registered on ProgrammableWeb.com (Jun. 2019)

Limitations:

• Standard formats (e.g. JSON, XML) but proprietary vocabularies

• Documented in web pages but not machine-processable, no explicit semantics

• Internal resource identifiers, no hyperlinks to resources

• Partial view over the database by means of predefined services

Page 24: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

24Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

The SPARQL Micro-Service ArchitectureLightweight method to query a Web API with SPARQL

SPARQL

Client

SPARQL

Micro-Service

(1) SPARQL query

(2) Web API query(4) SPARQL

response

(3) Web API response

Page 25: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

25Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Bridging Web APIs and the Web of Data

Assign dereferenceable URIs to Web API resources

Brooklyn Bridge sunset

schema:nameschema:contentUtl

unlock

http://example.org/photo/53735656

SPARQL

µ-service

Expose in the Web of Data resources locked in a silo

Page 26: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

26Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Agenda

xR2RML: Generic translation of heterogeneous data sources into RDF

SPARQL micro-services:Bridging Web APIs and the Web of Data

Applications in the biodiversity domain

Page 27: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

27Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Use case

TAXREFFrench TAXonomic REFerence for fauna, flora, fungus maintained by the Muséum National d’Histoire Naturelle.

570,000+ scientific names, 260,000+ taxa

Mainland France and overseas territories,

Web site, Web service, downloadable text file

Page 28: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

28Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Biodiversity studies (e.g. impact of global warming on species distributions) require mashing up data from multiple stakeholders

How to make biodiversity data FAIR?

Page 29: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

29Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

TAXREF-LD

Linking Open Data cloud diagram, 2019. J.P. McCrae, A. Abele, P. Buitelaar, R. Cyganiak, A. Jentzsch, V. Andryushechkin and J. Debattista. http://lod-cloud.net/

http://taxref.mnhn.fr/sparql

Several steps involved…

• Modelling of taxonomic information as Linked Data

• Write and enact xR2RMLmappings(JSON MongoDB RDF)

• Publish on the Web of Data

Page 30: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

30Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Web app.SPARQL HTML

SPARQL Micro-services

TAXREF-LD

NCBI

TaxonConcept

Agrovoc

Page 31: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

31Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

SPARQL micro-services to compare TAXREF information with 7 biodiversity sources:

• FishBase

• Global Biodiversity Information Framework

• World Register of Marine Species

• Pan-European Species directoris Infrstructure

• Index Fungorum

• Tropicos

• Sandre – Service d’Administration National des Donées et Référentiels de l’Eau

Page 32: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

32Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

htt

p:/

/sm

s.i3

s.u

nic

e.fr

/dem

o-s

ms?

par

am=D

elp

hin

apte

rus+

leu

cas

Page 33: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

33Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Take-aways

More data sources => new data integration scenarios

Need for explicit, machine-processable data semantics

The SW provides tools to do thatRDF, SPARQL, ontologies…

Various methods to translate heterogeneous data sources to RDF

Mapping language-basedWrapper-based

More research needed to:• Allow automatic discovery of data sources,

e.g. data portals, search engines…

• Automatic generation of federated queries

• Automate semantic alignment of data sources represented in RDF

These technics are a way to achieve Open Data, Open Science, FAIRness

Page 34: Agrégation et interrogation de données hétérogènes ...€¦ · Knowledge formalization Controlled vocabularies, ... Graph Materialization (ETL like) Virtual Graph Query rewriting

34Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

Related publicationsGeneric translation to RDF

Michel F., Djimenou L., Faron-Zucker C. & Montagnat J. (2015). Translation of Relational and Non-Relational Databases into RDF with xR2RML. In Proceeding of the WebIST, pp. 443–454. Lisbon, Portugal.

Michel F., Faron-Zucker C. & Montagnat J. (2016). A Generic Mapping-Based Query Translation from SPARQL to Various Target Database Query Languages. In Proceeding of WebIST vol. 2, pp. 147–158. Rome, Italy.

Michel F., Faron-Zucker C. & Montagnat J. (2016). A Mapping-based Method to Query MongoDB Documents with SPARQL. In Proceedings of DEXA vol. 9828, LNCS, pp. 52–67. Porto, Portugal.

Michel F., Catherine F. Z. & Montagnat J. (2018). Bridging the Semantic Web and NoSQL Worlds: Generic SPARQL Query Translation and Application to MongoDB. Transactions on Large-Scale Data- and Knowledge-Centered Systems (LNCS 11360):125–165.

Biodiversity

Michel F., Gargominy O., Tercerie S. & Faron-Zucker C. (2017). A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. Application to the French Taxonomic Register, TAXREF. In Proceedings of the ISWC2017 workshop on Semantics for Biodiversity (S4BioDiv) vol. 1933. Vienna, Austria.

Michel F., Faron-Zucker C., Tercerie S. & Olivier G. (2018). Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities. In Biodiversity Information Science and Standards, TDWG 2018 Proceedings vol. 2, p. e26235. Dunedin, New Zealand.

SPARL micro-services

Michel F., Faron-Zucker C. & Gandon F (2018). SPARQL Micro-Services: Lightweight Integration of Web APIs and Linked Data. In Proceedings of the Linked Data on the Web Workshop (LDOW2018). Lyon, France.

Michel F., Zucker C., Gargominy O. & Gandon F. (2018). Integration of Web APIs and Linked Data Using SPARQL Micro-Services—Application to Biodiversity Use Cases. Information 9(12):310.

F. Michel, C. Faron-Zucker, O. Corby & F. Gandon. Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards. In Companion Proceedings of the 2019 World Wide Web Conference(WWW ’19 Companion), 2019, San Francisco, CA, USA.