strategies for processing and explaining distributed queries on linked data

Strategies for Processing and Explaining Distributed Queries on Linked Data

Rakebul HasanWimmics

Inria Sophia Antipolis

2

Research Theme

• Distribute Query Processing– Optimization techniques for querying Linked Data

• Distributed Query Explanation– Query plan explanation• How query solving works

– Query result explanation• Why• Where

3

DISTRIBUTE QUERY PROCESSING

4

Querying Linked Data

• Bottom-up strategies– Discover sources during query processing by

following links between sources• Top-down strategies– Sources are known

5

Querying Linked DataFedBench Query CD5: Find the director and the

genre of movies directed by Italians

SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . ?x linkedMDB:genre ?genre }

6

Querying Linked DataFedBench Query CD5: Find the director and the genre of movies directed by

Italians

SELECT ?film ?director ?genre WHERE {

SERVICE <http://dbpedia.org/sparql> { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . }

SERVICE <http://data.linkedmdb.org/sparql> { ?x linkedMDB:genre ?genre } }

SPARQL 1.1 SERVICE clause

Need knowledge of which part of the query should be solved by which endpoint

7

• Top-down approaches– Data warehousing approach– Virtual integration approach

8


• Data warehousing approach– Collect the data in a central triple store– Process queries on that triple store

• Disadvantages• Expensive preprocessing (data collection + integration)

and maintenance• Data not up to date

9


• Virtual integration approach– A query federation middleware• Parse and split into subqueries• Source selection for subqueries• Evaluate the subqueries against corresponding sources

directly• Merge the results

– Advantages• no preprocessing and maintenance• up to date data

10

Running Example

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

KEGG

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

11

Parsing and Source Selecting


http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

KEGG

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

?drug drugbank:drugCategory drugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id

?keggDrug rdf:type kegg:Drug?keggDrug bio2rdf:xRef ?id

?keggDrug purl:title ?title

Send ask queries to all the sources

ASK {?drug drugbank:drugCategory drugbank-category:micronutrient }ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrug rdf:type kegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrug purl:title ?title }

12

Evaluating subqueries

• Two options– All triple patterns are individually evaluated– Nested loop join (NLJ): evaluate iteratively pattern

by pattern

13

Evaluating subqueries

• Example of NLJ


14

Optimization techniques

• Source Selection– Indexing: characterization of RDF graphs– Statistics based catalogue– Caching

15


• Exclusive grouping


16


• Bound join– Effective for NLJ– Mappings of variable values from the intermediate

result to the next subquery– SPARQL 1.1 VALUES clause

17


• Hash join– Hash table for intermediate mappings

18


• FILTER optimization– Evaluating the subqueries with corresponding

FILTERS– Reduces the number of intermediate results

19


• Parallelization – Effective for individual subquery evaluation

approach

20


• Join order– Selectivity estimation: join order heuristic based

on the number of bound and free variables [1]– Statistics: based on cardinalities of the triple

patterns [2]

[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08)[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10)

21

Selectivity Estimation

• Ideal for the Linked Data scenario– No need for statistics about the underlying data– Estimations can be wrong however

BGP Number of results Selectivity estimation

?v1 p1 ?v2?v1 p2 ?v3

5 2

?v4 p3 c1?v4 p4 ?v1

10 1

22

Query Performance Prediction

• Learn query performance from already executed queries

• Statistics about the underlying data not required

• Applications– Query optimization: join order– Workload management/scheduling

23

Regression

f(X) = y

X = feature vector, vector representation of SPARQL queryy = performance metric (e.g. latency, number of results)

Learn a mapping function

Support vector machine with nu-SVRk-nearest neighbors regression

24

Feature Extraction

• How can we represent SPARQL queries as vectors?

25

SPARQL

• SPARQL algebra features• Graph pattern features

26

SPARQL Algebra

http://www.w3.org/TR/sparql11-query/#sparqlQuery

27

SPARQL Algebra Features

28

Graph Pattern Features

29


• Clustering Training Queries– K-mediods clustering algorithm with approximated

edit distance as distance function• Selects data points as cluster centers• Arbitrary distance function

30


• Graph Edit Distance– Minimum amount of distortion needed to

transform one graph to another

– Compute similarity by inversing distance

31


• Graph Edit Distance– Usually computed using A* search • Exponential running time

– Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with

classification problems

32

Experiment Setup

• Triple store and hardware– Jena TDB 1.0.0– 16 GB memory– 2.53 GHz CPU– 48 GB system RAM– Linux 2.6.32 operating system

33

Experiment Setup

• Datasets– Training, validation, test datasets generated from

25 DBPSB query templates • 1260 training queries• 420 validation queries• 420 test queries

– RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework

34

Prediction Models

• Support Vector Machine (SVM) with the nu-SVR kernel for regression

• k-nearest neighbors (k-NN) regression– Euclidean distance as the distance function– k-dimensional tree (k-d tree) data structure to

compute the nearest neighbors

35

Evaluation Measures

• Coefficient of determination

• Root mean squared error (RMSE)

36

Predicting Query Execution Time

• SPARQL algebra features

37

Predicting Query Execution Time

• SPARQL algebra and graph pattern features

38

What’s Next

• Apply QPP in join order optimization

• Benchmarking– FedBench: a benchmarking framework for

federated query processing• Systematic generation of training queries– Bootstrapping– Refining training queries from query logs

39

Statistical Analysis of Query Logs

• Approach to Systematic generation of training queries

Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011)

40

Summary

• Distributed query processing• Optimization techniques• Query performance prediction– Join order optimization

41

DISTRIBUTED QUERY EXPLANATION

42

Query Explanation

• Query plan explanation• Query result explanation

• Motivation– Understanding– Transparency– Trust

43

Query Explanation in the Semantic Web

Query plan explanation Query result explanation

Jena

Sesame

Virtuoso *

BigOWLIM +

* Explanation of SPARQL to SQL translation+ a debugging feature on the query engine side

44

Distributed Query Explanation in the Semantic WebQuery plan explanation Query result explanation

FedX

DARQ

SemWIQ

ADERIS

Anapsid

45

ADERIS Query Plan Explanation

46

Related Work

• Database: Why, how, where provenance• Semantic Web:– Inference explanation• Provenance• Generating justifications

47

Distributed Query ExplanationWhat We Provide

• Query plan explanation – Work in progress– Prior to query execution• Includes predicted performance metrics

– After query execution with performance metrics• Query result explanation– Why provenance– Where provenance

48

Query Result Explanation

Result Explainer Plug-in

(RDF model, Query, Result)

Can be any RDF model (RDF graph, SPARQL endpoint)

We generate result explanations by querying this model for why provenance

49

Why Provenance

• Triples in virtual model from which the result is derived

<http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" .

50

Where Provenance

• Keep the provenance of sources in the virtual model

fedqld:source1 {<http://example.org/book/book8>

<http://purl.org/dc/elements/1.1/creator><http://www-sop.inria.fr/members/Alice> ;

<http://purl.org/dc/elements/1.1/title>"Distributed Query Processing for Linked Data" .

}

fedqld:source2 { <http://www-sop.inria.fr/members/Charlie>

<http://xmlns.com/foaf/0.1/name>"Charlie" .

<http://www-sop.inria.fr/members/Alice>

<http://xmlns.com/foaf/0.1/knows><http://www-sop.inria.fr/members/Charlie> ;

<http://xmlns.com/foaf/0.1/name>"Alice" .

}

fedqld:prov {fedqld:source1 void:sparqlEndpoint <http://localhost:3030/books/query> .fedqld:source2 void:sparqlEndpoint <http://localhost:3031/person/query> .

}

Where the triples in this graph come from

51

What’s next

• Explanation user interfaces• Evaluating the impacts of our explanations

52

References• Andreas Schwarte, Peter Haase, Katja Hoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for

distributed query processing on linked open data. In ESWC, 2011• Shinji Kikuchi, Shelly Sachdeva, Subhash Bhalla, Steven Lynden, Isao Kojima, Akiyoshi Matono, and Yusuke Tanimura.

Adaptive integration of distributed semantic web data. Databases in Networked Information Systems• Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a

benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11)

strategies for processing and explaining distributed queries on linked data

Documents

drug drugbank

maintenance data

date data

data summaries

linked data sce

kegg drug names

linked data fedbench

keggdrug bio2rdf