agrégation et interrogation de données hétérogènes ...€¦ · knowledge formalization...
TRANSCRIPT
1Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, FranceFranck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
F. MichelUniversité Côte d’Azur, CNRS, Inia, laboratore I3S
Défi MASTODONS - Les Big Data en recherche, 13 Juin 2019
Agrégation et interrogation dedonnées hétérogènes distribuées à l'échelle du Web
par des techniques d'alignement sémantique
2Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
More data sources More Data Integration opportunities
3Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Hortus Sanitatis. First Natural History encyclopaedia, 1485.
4Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Data Integration ex. in Digital Humanities
Archaeological excavationConservation biology*
*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins
Hortus Sanitatis, 1485.
5Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Data Integration ex. in Digital Humanities
Archaeological excavationConservation biology*
*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins
First Natural History Encycloedia, 1485.
Knowledge formalization
Controlled vocabularies,taxonomies,
domain ontologies…
6Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
fédération de données et de
ConnaissancEs Distribuées en Imagerie BiomédicaLE
Scientific annual workshops 2012, 2013, 2014, 2015
Issues:High heterogeneity
Increasing amount/number of sources
Need for cross-factor analysis
Sensitive (privacy, access policies)
Methods: Knowledge formalization
Semantic alignment
Mediation towards common formats
Distributed querying
7Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How to enable RDF-based integrationof heterogeneous data sources?
8Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
RDF-based Data Integration
GraphMaterialization
(ETL like)
Virtual Graph
Queryrewriting
SPARQL
SPARQL
Heterogeneous data sources
ID NAME
9Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Many methods for many types of data sources
AstroGrid-D, SPARQL2XQuery, XSPARQL
XML
XLWrap, Linked CSV, CSVW, RML
CSV/TSV/Spreadsheets
D2RQ, R2O, Ultrawrap, Triplify, SM
R2RML: Morph-RDB, ontop, Virtuoso
Relational Databases
RML, TARQL, Apache Any23, DataLift, SPARQL-Generate
Multiple formats
RDFa, Microformats, JSON-LD
HTML
TARQL, JSON-LD, RML
JSON
xR2RML (MongoDB), ontop (MongoDB),
[Mugnier et al, 2016] (key-value stores)
NoSQL
M.L. Mugnier, M.C. Rousset, and F. Ulliana. “Ontology-Mediated Queries for NOSQL Databases.” In Proc. AAAI. 2016.
SPARQL Micro-services, Linked REST APIs
Web APIs
10Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of heterogeneous data sources into RDF
SPARQL micro-services:Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
11Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of heterogeneous data sources into RDF
SPARQL micro-services:Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
12Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The generic translation of heterogeneous data sources into RDF
requires a generic mapping description.
13Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
TEACHERS
ID FNAME TEACHES
7 Catherine Semantic Web
8 Philippe Software Engineering
… … …http://example.org/teacher/7
Catherine
foaf:name ex:teaches
https://www.wikidata.org/entity/Q54837
Mapping description
14Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The xR2RML mapping languageUniform description of mappings from
most common types of DB to RDF
Extends R2RML, the W3C recommendation
for RDBs, and RML
Rich iteration model to accommodate
nested, hierarchical documents
Flexibility:• Allow any query language
• Allow any syntax to reference data elements
from query results
http://i3s.unice.fr/~fmichel/xr2rml_specification_v5.html
15Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How to query a data source with SPARQL using such a mapping description?
16Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
SPARQL rewriting techniques for SQL and XQuery
Semantics-preserving 1-to-1 rewriting
Closely coupled with the target QL capabilities:
Support of joins, unions, nested queries, filtering, string fctn, etc.
Optimization:Enforced on the target query,
or delegated to the DB query-processing engine
SQL: Bizer & Cyganiak, 2006; Unbehauen et al., 2013a; Priyatna et al., 2014; Rodríguez-Muro & Rezk, 2015XQuery: Bikakis et al., 2015Optimization: Unbehauen et al., 2013b; Rodríguez-Muro & Rezk, 2015; Elliott et al., 2009; Sequeda & Miranker, 2013
17Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How much of the SPARQL rewriting process can be done in a DB-agnostic yet optimized manner?
18Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Abstract Query Language (AQL)
Embark enough information for translation towards “any” DB QL.
Early optimizations
Self-Join Elimination, Self-Union Elimination, Filter propagation
SPARQLquery
xR2RML mappings
Abstractquery
Concrete DBquery
19Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Application to
AQL-to-MongoDB rewriting challenging:
Expressiveness gap: SPARQL AQL MongoDBJoins not supported, nested query hardly supported, limited filter expressions
Semantic ambiguity
20Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Filling the gap between the two worlds is not straightforward
Yet, NoSQL DBs are a huge, quickly increasing source of data.Potential for RDF-based data integration and publication in the Web of Data.
Semantic Web vs. NoSQL
Semantic Web NoSQL
highly connected graphs isolated documents, joins hardly supported
rich query expressiveness low expressiveness
reasoning _
? high throughput, high availability
_ horizontal elasticity
21Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Generic approach suitable when direct access to the data source
GraphMaterialization
query rewriting
ID NAME
What if we access the data source via an API?
SPARQL
22Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of heterogeneous data sources into RDF
SPARQL micro-services:Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
23Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Web APIs: APIs all over the web21,700+ Web APIs are registered on ProgrammableWeb.com (Jun. 2019)
Limitations:
• Standard formats (e.g. JSON, XML) but proprietary vocabularies
• Documented in web pages but not machine-processable, no explicit semantics
• Internal resource identifiers, no hyperlinks to resources
• Partial view over the database by means of predefined services
24Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The SPARQL Micro-Service ArchitectureLightweight method to query a Web API with SPARQL
SPARQL
Client
SPARQL
Micro-Service
(1) SPARQL query
(2) Web API query(4) SPARQL
response
(3) Web API response
25Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Bridging Web APIs and the Web of Data
Assign dereferenceable URIs to Web API resources
Brooklyn Bridge sunset
schema:nameschema:contentUtl
unlock
http://example.org/photo/53735656
SPARQL
µ-service
Expose in the Web of Data resources locked in a silo
26Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of heterogeneous data sources into RDF
SPARQL micro-services:Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
27Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Use case
TAXREFFrench TAXonomic REFerence for fauna, flora, fungus maintained by the Muséum National d’Histoire Naturelle.
570,000+ scientific names, 260,000+ taxa
Mainland France and overseas territories,
Web site, Web service, downloadable text file
28Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Biodiversity studies (e.g. impact of global warming on species distributions) require mashing up data from multiple stakeholders
How to make biodiversity data FAIR?
29Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
TAXREF-LD
Linking Open Data cloud diagram, 2019. J.P. McCrae, A. Abele, P. Buitelaar, R. Cyganiak, A. Jentzsch, V. Andryushechkin and J. Debattista. http://lod-cloud.net/
http://taxref.mnhn.fr/sparql
Several steps involved…
• Modelling of taxonomic information as Linked Data
• Write and enact xR2RMLmappings(JSON MongoDB RDF)
• Publish on the Web of Data
30Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Web app.SPARQL HTML
SPARQL Micro-services
TAXREF-LD
NCBI
TaxonConcept
Agrovoc
31Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
SPARQL micro-services to compare TAXREF information with 7 biodiversity sources:
• FishBase
• Global Biodiversity Information Framework
• World Register of Marine Species
• Pan-European Species directoris Infrstructure
• Index Fungorum
• Tropicos
• Sandre – Service d’Administration National des Donées et Référentiels de l’Eau
32Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
htt
p:/
/sm
s.i3
s.u
nic
e.fr
/dem
o-s
ms?
par
am=D
elp
hin
apte
rus+
leu
cas
33Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Take-aways
More data sources => new data integration scenarios
Need for explicit, machine-processable data semantics
The SW provides tools to do thatRDF, SPARQL, ontologies…
Various methods to translate heterogeneous data sources to RDF
Mapping language-basedWrapper-based
More research needed to:• Allow automatic discovery of data sources,
e.g. data portals, search engines…
• Automatic generation of federated queries
• Automate semantic alignment of data sources represented in RDF
These technics are a way to achieve Open Data, Open Science, FAIRness
34Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Related publicationsGeneric translation to RDF
Michel F., Djimenou L., Faron-Zucker C. & Montagnat J. (2015). Translation of Relational and Non-Relational Databases into RDF with xR2RML. In Proceeding of the WebIST, pp. 443–454. Lisbon, Portugal.
Michel F., Faron-Zucker C. & Montagnat J. (2016). A Generic Mapping-Based Query Translation from SPARQL to Various Target Database Query Languages. In Proceeding of WebIST vol. 2, pp. 147–158. Rome, Italy.
Michel F., Faron-Zucker C. & Montagnat J. (2016). A Mapping-based Method to Query MongoDB Documents with SPARQL. In Proceedings of DEXA vol. 9828, LNCS, pp. 52–67. Porto, Portugal.
Michel F., Catherine F. Z. & Montagnat J. (2018). Bridging the Semantic Web and NoSQL Worlds: Generic SPARQL Query Translation and Application to MongoDB. Transactions on Large-Scale Data- and Knowledge-Centered Systems (LNCS 11360):125–165.
Biodiversity
Michel F., Gargominy O., Tercerie S. & Faron-Zucker C. (2017). A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. Application to the French Taxonomic Register, TAXREF. In Proceedings of the ISWC2017 workshop on Semantics for Biodiversity (S4BioDiv) vol. 1933. Vienna, Austria.
Michel F., Faron-Zucker C., Tercerie S. & Olivier G. (2018). Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities. In Biodiversity Information Science and Standards, TDWG 2018 Proceedings vol. 2, p. e26235. Dunedin, New Zealand.
SPARL micro-services
Michel F., Faron-Zucker C. & Gandon F (2018). SPARQL Micro-Services: Lightweight Integration of Web APIs and Linked Data. In Proceedings of the Linked Data on the Web Workshop (LDOW2018). Lyon, France.
Michel F., Zucker C., Gargominy O. & Gandon F. (2018). Integration of Web APIs and Linked Data Using SPARQL Micro-Services—Application to Biodiversity Use Cases. Information 9(12):310.
F. Michel, C. Faron-Zucker, O. Corby & F. Gandon. Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards. In Companion Proceedings of the 2019 World Wide Web Conference(WWW ’19 Companion), 2019, San Francisco, CA, USA.