strategies for processing and explaining distributed queries on linked data
DESCRIPTION
Strategies for Processing and Explaining Distributed Queries on Linked DataTRANSCRIPT
Strategies for Processing and Explaining Distributed Queries on Linked Data
Rakebul HasanWimmics
Inria Sophia Antipolis
2
Research Theme
• Distribute Query Processing– Optimization techniques for querying Linked Data
• Distributed Query Explanation– Query plan explanation• How query solving works
– Query result explanation• Why• Where
3
DISTRIBUTE QUERY PROCESSING
4
Querying Linked Data
• Bottom-up strategies– Discover sources during query processing by
following links between sources• Top-down strategies– Sources are known
5
Querying Linked DataFedBench Query CD5: Find the director and the
genre of movies directed by Italians
SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . ?x linkedMDB:genre ?genre }
6
Querying Linked DataFedBench Query CD5: Find the director and the genre of movies directed by
Italians
SELECT ?film ?director ?genre WHERE {
SERVICE <http://dbpedia.org/sparql> { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . }
SERVICE <http://data.linkedmdb.org/sparql> { ?x linkedMDB:genre ?genre } }
SPARQL 1.1 SERVICE clause
Need knowledge of which part of the query should be solved by which endpoint
7
• Top-down approaches– Data warehousing approach– Virtual integration approach
8
Querying Linked Data
• Data warehousing approach– Collect the data in a central triple store– Process queries on that triple store
• Disadvantages• Expensive preprocessing (data collection + integration)
and maintenance• Data not up to date
9
Querying Linked Data
• Virtual integration approach– A query federation middleware• Parse and split into subqueries• Source selection for subqueries• Evaluate the subqueries against corresponding sources
directly• Merge the results
– Advantages• no preprocessing and maintenance• up to date data
10
Running Example
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
drugbank
http://cu.bioportal.bio2rdf.org/sparql
KEGG
http://data.linkedmdb.org/sparql
DBpedia
Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient
11
Parsing and Source Selecting
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
drugbank
http://cu.bioportal.bio2rdf.org/sparql
KEGG
http://data.linkedmdb.org/sparql
DBpedia
Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient
?drug drugbank:drugCategory drugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id
?keggDrug rdf:type kegg:Drug?keggDrug bio2rdf:xRef ?id
?keggDrug purl:title ?title
Send ask queries to all the sources
ASK {?drug drugbank:drugCategory drugbank-category:micronutrient }ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrug rdf:type kegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrug purl:title ?title }
12
Evaluating subqueries
• Two options– All triple patterns are individually evaluated– Nested loop join (NLJ): evaluate iteratively pattern
by pattern
13
Evaluating subqueries
• Example of NLJ
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }
14
Optimization techniques
• Source Selection– Indexing: characterization of RDF graphs– Statistics based catalogue– Caching
15
Optimization techniques
• Exclusive grouping
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }
16
Optimization techniques
• Bound join– Effective for NLJ– Mappings of variable values from the intermediate
result to the next subquery– SPARQL 1.1 VALUES clause
17
Optimization techniques
• Hash join– Hash table for intermediate mappings
18
Optimization techniques
• FILTER optimization– Evaluating the subqueries with corresponding
FILTERS– Reduces the number of intermediate results
19
Optimization techniques
• Parallelization – Effective for individual subquery evaluation
approach
20
Optimization techniques
• Join order– Selectivity estimation: join order heuristic based
on the number of bound and free variables [1]– Statistics: based on cardinalities of the triple
patterns [2]
[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08)[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10)
21
Selectivity Estimation
• Ideal for the Linked Data scenario– No need for statistics about the underlying data– Estimations can be wrong however
BGP Number of results Selectivity estimation
?v1 p1 ?v2?v1 p2 ?v3
5 2
?v4 p3 c1?v4 p4 ?v1
10 1
22
Query Performance Prediction
• Learn query performance from already executed queries
• Statistics about the underlying data not required
• Applications– Query optimization: join order– Workload management/scheduling
23
Regression
f(X) = y
X = feature vector, vector representation of SPARQL queryy = performance metric (e.g. latency, number of results)
Learn a mapping function
Support vector machine with nu-SVRk-nearest neighbors regression
24
Feature Extraction
• How can we represent SPARQL queries as vectors?
25
SPARQL
• SPARQL algebra features• Graph pattern features
26
SPARQL Algebra
http://www.w3.org/TR/sparql11-query/#sparqlQuery
27
SPARQL Algebra Features
28
Graph Pattern Features
29
Graph Pattern Features
• Clustering Training Queries– K-mediods clustering algorithm with approximated
edit distance as distance function• Selects data points as cluster centers• Arbitrary distance function
30
Graph Pattern Features
• Graph Edit Distance– Minimum amount of distortion needed to
transform one graph to another
– Compute similarity by inversing distance
31
Graph Pattern Features
• Graph Edit Distance– Usually computed using A* search • Exponential running time
– Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with
classification problems
32
Experiment Setup
• Triple store and hardware– Jena TDB 1.0.0– 16 GB memory– 2.53 GHz CPU– 48 GB system RAM– Linux 2.6.32 operating system
33
Experiment Setup
• Datasets– Training, validation, test datasets generated from
25 DBPSB query templates • 1260 training queries• 420 validation queries• 420 test queries
– RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework
34
Prediction Models
• Support Vector Machine (SVM) with the nu-SVR kernel for regression
• k-nearest neighbors (k-NN) regression– Euclidean distance as the distance function– k-dimensional tree (k-d tree) data structure to
compute the nearest neighbors
35
Evaluation Measures
• Coefficient of determination
• Root mean squared error (RMSE)
36
Predicting Query Execution Time
• SPARQL algebra features
37
Predicting Query Execution Time
• SPARQL algebra and graph pattern features
38
What’s Next
• Apply QPP in join order optimization
• Benchmarking– FedBench: a benchmarking framework for
federated query processing• Systematic generation of training queries– Bootstrapping– Refining training queries from query logs
39
Statistical Analysis of Query Logs
• Approach to Systematic generation of training queries
Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011)
40
Summary
• Distributed query processing• Optimization techniques• Query performance prediction– Join order optimization
41
DISTRIBUTED QUERY EXPLANATION
42
Query Explanation
• Query plan explanation• Query result explanation
• Motivation– Understanding– Transparency– Trust
43
Query Explanation in the Semantic Web
Query plan explanation Query result explanation
Jena
Sesame
Virtuoso *
BigOWLIM +
* Explanation of SPARQL to SQL translation+ a debugging feature on the query engine side
44
Distributed Query Explanation in the Semantic WebQuery plan explanation Query result explanation
FedX
DARQ
SemWIQ
ADERIS
Anapsid
45
ADERIS Query Plan Explanation
46
Related Work
• Database: Why, how, where provenance• Semantic Web:– Inference explanation• Provenance• Generating justifications
47
Distributed Query ExplanationWhat We Provide
• Query plan explanation – Work in progress– Prior to query execution• Includes predicted performance metrics
– After query execution with performance metrics• Query result explanation– Why provenance– Where provenance
48
Query Result Explanation
Result Explainer Plug-in
(RDF model, Query, Result)
Can be any RDF model (RDF graph, SPARQL endpoint)
We generate result explanations by querying this model for why provenance
49
Why Provenance
• Triples in virtual model from which the result is derived
<http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" .
50
Where Provenance
• Keep the provenance of sources in the virtual model
fedqld:source1 {<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator><http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>"Distributed Query Processing for Linked Data" .
}
fedqld:source2 { <http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>"Charlie" .
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows><http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>"Alice" .
}
fedqld:prov {fedqld:source1 void:sparqlEndpoint <http://localhost:3030/books/query> .fedqld:source2 void:sparqlEndpoint <http://localhost:3031/person/query> .
}
Where the triples in this graph come from
51
What’s next
• Explanation user interfaces• Evaluating the impacts of our explanations
52
References• Andreas Schwarte, Peter Haase, Katja Hoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for
distributed query processing on linked open data. In ESWC, 2011• Shinji Kikuchi, Shelly Sachdeva, Subhash Bhalla, Steven Lynden, Isao Kojima, Akiyoshi Matono, and Yusuke Tanimura.
Adaptive integration of distributed semantic web data. Databases in Networked Information Systems• Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a
benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11)