strategies for processing and explaining distributed queries on linked data

Strategies for Processing and Explaining Distributed Queries on Linked Data

Rakebul HasanWimmics

Inria Sophia Antipolis

Research Theme

• Distribute Query Processing– Optimization techniques for querying Linked Data

• Distributed Query Explanation– Query plan explanation• How query solving works

– Query result explanation• Why• Where

DISTRIBUTE QUERY PROCESSING

Querying Linked Data

• Bottom-up strategies– Discover sources during query processing by

following links between sources• Top-down strategies– Sources are known

Querying Linked DataFedBench Query CD5: Find the director and the

genre of movies directed by Italians

SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . ?x linkedMDB:genre ?genre }

Querying Linked DataFedBench Query CD5: Find the director and the genre of movies directed by

Italians

SELECT ?film ?director ?genre WHERE {

SERVICE <http://dbpedia.org/sparql> { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . }

SERVICE <http://data.linkedmdb.org/sparql> { ?x linkedMDB:genre ?genre } }

SPARQL 1.1 SERVICE clause

Need knowledge of which part of the query should be solved by which endpoint

• Top-down approaches– Data warehousing approach– Virtual integration approach

• Data warehousing approach– Collect the data in a central triple store– Process queries on that triple store

• Disadvantages• Expensive preprocessing (data collection + integration)

and maintenance• Data not up to date

• Virtual integration approach– A query federation middleware• Parse and split into subqueries• Source selection for subqueries• Evaluate the subqueries against corresponding sources

directly• Merge the results

– Advantages• no preprocessing and maintenance• up to date data

Running Example

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

Parsing and Source Selecting

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

?drug drugbank:drugCategory drugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id

?keggDrug rdf:type kegg:Drug?keggDrug bio2rdf:xRef ?id

?keggDrug purl:title ?title

Send ask queries to all the sources

ASK {?drug drugbank:drugCategory drugbank-category:micronutrient }ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrug rdf:type kegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrug purl:title ?title }

Evaluating subqueries

• Two options– All triple patterns are individually evaluated– Nested loop join (NLJ): evaluate iteratively pattern

by pattern

Evaluating subqueries

• Example of NLJ

Optimization techniques

• Source Selection– Indexing: characterization of RDF graphs– Statistics based catalogue– Caching

• Exclusive grouping

• Bound join– Effective for NLJ– Mappings of variable values from the intermediate

result to the next subquery– SPARQL 1.1 VALUES clause

• Hash join– Hash table for intermediate mappings

• FILTER optimization– Evaluating the subqueries with corresponding

FILTERS– Reduces the number of intermediate results

• Parallelization – Effective for individual subquery evaluation

approach

• Join order– Selectivity estimation: join order heuristic based

on the number of bound and free variables [1]– Statistics: based on cardinalities of the triple

patterns [2]

[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08)[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10)

Selectivity Estimation

• Ideal for the Linked Data scenario– No need for statistics about the underlying data– Estimations can be wrong however

BGP Number of results Selectivity estimation

?v1 p1 ?v2?v1 p2 ?v3

?v4 p3 c1?v4 p4 ?v1

Query Performance Prediction

• Learn query performance from already executed queries

• Statistics about the underlying data not required

• Applications– Query optimization: join order– Workload management/scheduling

Regression

f(X) = y

X = feature vector, vector representation of SPARQL queryy = performance metric (e.g. latency, number of results)

Learn a mapping function

Support vector machine with nu-SVRk-nearest neighbors regression

Feature Extraction

• How can we represent SPARQL queries as vectors?

SPARQL

• SPARQL algebra features• Graph pattern features

SPARQL Algebra

http://www.w3.org/TR/sparql11-query/#sparqlQuery

SPARQL Algebra Features

Graph Pattern Features

• Clustering Training Queries– K-mediods clustering algorithm with approximated

edit distance as distance function• Selects data points as cluster centers• Arbitrary distance function

• Graph Edit Distance– Minimum amount of distortion needed to

transform one graph to another

– Compute similarity by inversing distance

• Graph Edit Distance– Usually computed using A* search • Exponential running time

– Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with

classification problems

Experiment Setup

• Triple store and hardware– Jena TDB 1.0.0– 16 GB memory– 2.53 GHz CPU– 48 GB system RAM– Linux 2.6.32 operating system

Experiment Setup

• Datasets– Training, validation, test datasets generated from

25 DBPSB query templates • 1260 training queries• 420 validation queries• 420 test queries

– RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework

Prediction Models

• Support Vector Machine (SVM) with the nu-SVR kernel for regression

• k-nearest neighbors (k-NN) regression– Euclidean distance as the distance function– k-dimensional tree (k-d tree) data structure to

compute the nearest neighbors

Evaluation Measures

• Coefficient of determination

• Root mean squared error (RMSE)

Predicting Query Execution Time

• SPARQL algebra features

Predicting Query Execution Time

• SPARQL algebra and graph pattern features

What’s Next

• Apply QPP in join order optimization

• Benchmarking– FedBench: a benchmarking framework for

federated query processing• Systematic generation of training queries– Bootstrapping– Refining training queries from query logs

Statistical Analysis of Query Logs

• Approach to Systematic generation of training queries

Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011)

Summary

• Distributed query processing• Optimization techniques• Query performance prediction– Join order optimization

DISTRIBUTED QUERY EXPLANATION

Query Explanation

• Query plan explanation• Query result explanation

• Motivation– Understanding– Transparency– Trust

Query Explanation in the Semantic Web

Query plan explanation Query result explanation

Sesame

Virtuoso *

BigOWLIM +

* Explanation of SPARQL to SQL translation+ a debugging feature on the query engine side

Distributed Query Explanation in the Semantic WebQuery plan explanation Query result explanation

SemWIQ

ADERIS

Anapsid

ADERIS Query Plan Explanation

Related Work

• Database: Why, how, where provenance• Semantic Web:– Inference explanation• Provenance• Generating justifications

Distributed Query ExplanationWhat We Provide

• Query plan explanation – Work in progress– Prior to query execution• Includes predicted performance metrics

– After query execution with performance metrics• Query result explanation– Why provenance– Where provenance

Query Result Explanation

Result Explainer Plug-in

(RDF model, Query, Result)

Can be any RDF model (RDF graph, SPARQL endpoint)

We generate result explanations by querying this model for why provenance

Why Provenance

• Triples in virtual model from which the result is derived

<http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" .

Where Provenance

• Keep the provenance of sources in the virtual model

fedqld:source1 {<http://example.org/book/book8>

<http://purl.org/dc/elements/1.1/creator><http://www-sop.inria.fr/members/Alice> ;

<http://purl.org/dc/elements/1.1/title>"Distributed Query Processing for Linked Data" .

fedqld:source2 { <http://www-sop.inria.fr/members/Charlie>

<http://xmlns.com/foaf/0.1/name>"Charlie" .

<http://www-sop.inria.fr/members/Alice>

<http://xmlns.com/foaf/0.1/knows><http://www-sop.inria.fr/members/Charlie> ;

<http://xmlns.com/foaf/0.1/name>"Alice" .

fedqld:prov {fedqld:source1 void:sparqlEndpoint <http://localhost:3030/books/query> .fedqld:source2 void:sparqlEndpoint <http://localhost:3031/person/query> .

Where the triples in this graph come from

What’s next

• Explanation user interfaces• Evaluating the impacts of our explanations

References• Andreas Schwarte, Peter Haase, Katja Hoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for

distributed query processing on linked open data. In ESWC, 2011• Shinji Kikuchi, Shelly Sachdeva, Subhash Bhalla, Steven Lynden, Isao Kojima, Akiyoshi Matono, and Yusuke Tanimura.

Adaptive integration of distributed semantic web data. Databases in Networked Information Systems• Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a

benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11)

strategies for processing and explaining distributed queries on linked data

drug drugbank

maintenance data

date data

data summaries

linked data sce

kegg drug names

linked data fedbench

keggdrug bio2rdf

Documents

tutorial "an introduction to sparql and queries over linked...

lodop { multi-query optimization for linked data pro...

lodop - multi-query optimization for linked data profiling...

arm technology speeds hunt milestone— for disease-linked...

explod: a framework for explaining recommendations based on...

sparql query verbalization for explaining semantic search...

an introduction to sparql and queries over linked data ·...

scorpion: explaining away outliers in aggregate...

sbo41sp2_web intelligence queries based on bex queries

building sap businessobjects web intelligence queries...

spring: ranking the results of sparql queries on linked...

scorpion - github...

processing sparql queries over linked data— a...

provenance management over linked data streams · dynamic...

introduction to peoplesoft query. agenda overview peoplesoft...

uva€¦ · 9 789461 826527 isbn 978-94-6182-652-7 expanded...

explaining wrong queries using small...

explaining linked open data with pizza - #osoc15

spatial queries nearest neighbor and join queries

natural language queries over heterogeneous linked data...