strategies for processing and explaining distributed queries on linked data

52
Strategies for Processing and Explaining Distributed Queries on Linked Data Rakebul Hasan Wimmics Inria Sophia Antipolis

Upload: rakebul-hasan

Post on 11-May-2015

192 views

Category:

Documents


1 download

DESCRIPTION

Strategies for Processing and Explaining Distributed Queries on Linked Data

TRANSCRIPT

Page 1: Strategies for Processing and Explaining Distributed Queries on Linked Data

Strategies for Processing and Explaining Distributed Queries on Linked Data

Rakebul HasanWimmics

Inria Sophia Antipolis

Page 2: Strategies for Processing and Explaining Distributed Queries on Linked Data

2

Research Theme

• Distribute Query Processing– Optimization techniques for querying Linked Data

• Distributed Query Explanation– Query plan explanation• How query solving works

– Query result explanation• Why• Where

Page 3: Strategies for Processing and Explaining Distributed Queries on Linked Data

3

DISTRIBUTE QUERY PROCESSING

Page 4: Strategies for Processing and Explaining Distributed Queries on Linked Data

4

Querying Linked Data

• Bottom-up strategies– Discover sources during query processing by

following links between sources• Top-down strategies– Sources are known

Page 5: Strategies for Processing and Explaining Distributed Queries on Linked Data

5

Querying Linked DataFedBench Query CD5: Find the director and the

genre of movies directed by Italians

SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . ?x linkedMDB:genre ?genre }

Page 6: Strategies for Processing and Explaining Distributed Queries on Linked Data

6

Querying Linked DataFedBench Query CD5: Find the director and the genre of movies directed by

Italians

SELECT ?film ?director ?genre WHERE {

SERVICE <http://dbpedia.org/sparql> { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationality dbpedia:Italy . ?x owl:sameAs ?film . }

SERVICE <http://data.linkedmdb.org/sparql> { ?x linkedMDB:genre ?genre } }

SPARQL 1.1 SERVICE clause

Need knowledge of which part of the query should be solved by which endpoint

Page 7: Strategies for Processing and Explaining Distributed Queries on Linked Data

7

• Top-down approaches– Data warehousing approach– Virtual integration approach

Page 8: Strategies for Processing and Explaining Distributed Queries on Linked Data

8

Querying Linked Data

• Data warehousing approach– Collect the data in a central triple store– Process queries on that triple store

• Disadvantages• Expensive preprocessing (data collection + integration)

and maintenance• Data not up to date

Page 9: Strategies for Processing and Explaining Distributed Queries on Linked Data

9

Querying Linked Data

• Virtual integration approach– A query federation middleware• Parse and split into subqueries• Source selection for subqueries• Evaluate the subqueries against corresponding sources

directly• Merge the results

– Advantages• no preprocessing and maintenance• up to date data

Page 10: Strategies for Processing and Explaining Distributed Queries on Linked Data

10

Running Example

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

KEGG

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

Page 11: Strategies for Processing and Explaining Distributed Queries on Linked Data

11

Parsing and Source Selecting

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

drugbank

http://cu.bioportal.bio2rdf.org/sparql

KEGG

http://data.linkedmdb.org/sparql

DBpedia

Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient

?drug drugbank:drugCategory drugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id

?keggDrug rdf:type kegg:Drug?keggDrug bio2rdf:xRef ?id

?keggDrug purl:title ?title

Send ask queries to all the sources

ASK {?drug drugbank:drugCategory drugbank-category:micronutrient }ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrug rdf:type kegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrug purl:title ?title }

Page 12: Strategies for Processing and Explaining Distributed Queries on Linked Data

12

Evaluating subqueries

• Two options– All triple patterns are individually evaluated– Nested loop join (NLJ): evaluate iteratively pattern

by pattern

Page 13: Strategies for Processing and Explaining Distributed Queries on Linked Data

13

Evaluating subqueries

• Example of NLJ

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

Page 14: Strategies for Processing and Explaining Distributed Queries on Linked Data

14

Optimization techniques

• Source Selection– Indexing: characterization of RDF graphs– Statistics based catalogue– Caching

Page 15: Strategies for Processing and Explaining Distributed Queries on Linked Data

15

Optimization techniques

• Exclusive grouping

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

Page 16: Strategies for Processing and Explaining Distributed Queries on Linked Data

16

Optimization techniques

• Bound join– Effective for NLJ– Mappings of variable values from the intermediate

result to the next subquery– SPARQL 1.1 VALUES clause

Page 17: Strategies for Processing and Explaining Distributed Queries on Linked Data

17

Optimization techniques

• Hash join– Hash table for intermediate mappings

Page 18: Strategies for Processing and Explaining Distributed Queries on Linked Data

18

Optimization techniques

• FILTER optimization– Evaluating the subqueries with corresponding

FILTERS– Reduces the number of intermediate results

Page 19: Strategies for Processing and Explaining Distributed Queries on Linked Data

19

Optimization techniques

• Parallelization – Effective for individual subquery evaluation

approach

Page 20: Strategies for Processing and Explaining Distributed Queries on Linked Data

20

Optimization techniques

• Join order– Selectivity estimation: join order heuristic based

on the number of bound and free variables [1]– Statistics: based on cardinalities of the triple

patterns [2]

[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08)[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10)

Page 21: Strategies for Processing and Explaining Distributed Queries on Linked Data

21

Selectivity Estimation

• Ideal for the Linked Data scenario– No need for statistics about the underlying data– Estimations can be wrong however

BGP Number of results Selectivity estimation

?v1 p1 ?v2?v1 p2 ?v3

5 2

?v4 p3 c1?v4 p4 ?v1

10 1

Page 22: Strategies for Processing and Explaining Distributed Queries on Linked Data

22

Query Performance Prediction

• Learn query performance from already executed queries

• Statistics about the underlying data not required

• Applications– Query optimization: join order– Workload management/scheduling

Page 23: Strategies for Processing and Explaining Distributed Queries on Linked Data

23

Regression

f(X) = y

X = feature vector, vector representation of SPARQL queryy = performance metric (e.g. latency, number of results)

Learn a mapping function

Support vector machine with nu-SVRk-nearest neighbors regression

Page 24: Strategies for Processing and Explaining Distributed Queries on Linked Data

24

Feature Extraction

• How can we represent SPARQL queries as vectors?

Page 25: Strategies for Processing and Explaining Distributed Queries on Linked Data

25

SPARQL

• SPARQL algebra features• Graph pattern features

Page 26: Strategies for Processing and Explaining Distributed Queries on Linked Data

26

SPARQL Algebra

http://www.w3.org/TR/sparql11-query/#sparqlQuery

Page 27: Strategies for Processing and Explaining Distributed Queries on Linked Data

27

SPARQL Algebra Features

Page 28: Strategies for Processing and Explaining Distributed Queries on Linked Data

28

Graph Pattern Features

Page 29: Strategies for Processing and Explaining Distributed Queries on Linked Data

29

Graph Pattern Features

• Clustering Training Queries– K-mediods clustering algorithm with approximated

edit distance as distance function• Selects data points as cluster centers• Arbitrary distance function

Page 30: Strategies for Processing and Explaining Distributed Queries on Linked Data

30

Graph Pattern Features

• Graph Edit Distance– Minimum amount of distortion needed to

transform one graph to another

– Compute similarity by inversing distance

Page 31: Strategies for Processing and Explaining Distributed Queries on Linked Data

31

Graph Pattern Features

• Graph Edit Distance– Usually computed using A* search • Exponential running time

– Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with

classification problems

Page 32: Strategies for Processing and Explaining Distributed Queries on Linked Data

32

Experiment Setup

• Triple store and hardware– Jena TDB 1.0.0– 16 GB memory– 2.53 GHz CPU– 48 GB system RAM– Linux 2.6.32 operating system

Page 33: Strategies for Processing and Explaining Distributed Queries on Linked Data

33

Experiment Setup

• Datasets– Training, validation, test datasets generated from

25 DBPSB query templates • 1260 training queries• 420 validation queries• 420 test queries

– RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework

Page 34: Strategies for Processing and Explaining Distributed Queries on Linked Data

34

Prediction Models

• Support Vector Machine (SVM) with the nu-SVR kernel for regression

• k-nearest neighbors (k-NN) regression– Euclidean distance as the distance function– k-dimensional tree (k-d tree) data structure to

compute the nearest neighbors

Page 35: Strategies for Processing and Explaining Distributed Queries on Linked Data

35

Evaluation Measures

• Coefficient of determination

• Root mean squared error (RMSE)

Page 36: Strategies for Processing and Explaining Distributed Queries on Linked Data

36

Predicting Query Execution Time

• SPARQL algebra features

Page 37: Strategies for Processing and Explaining Distributed Queries on Linked Data

37

Predicting Query Execution Time

• SPARQL algebra and graph pattern features

Page 38: Strategies for Processing and Explaining Distributed Queries on Linked Data

38

What’s Next

• Apply QPP in join order optimization

• Benchmarking– FedBench: a benchmarking framework for

federated query processing• Systematic generation of training queries– Bootstrapping– Refining training queries from query logs

Page 39: Strategies for Processing and Explaining Distributed Queries on Linked Data

39

Statistical Analysis of Query Logs

• Approach to Systematic generation of training queries

Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011)

Page 40: Strategies for Processing and Explaining Distributed Queries on Linked Data

40

Summary

• Distributed query processing• Optimization techniques• Query performance prediction– Join order optimization

Page 41: Strategies for Processing and Explaining Distributed Queries on Linked Data

41

DISTRIBUTED QUERY EXPLANATION

Page 42: Strategies for Processing and Explaining Distributed Queries on Linked Data

42

Query Explanation

• Query plan explanation• Query result explanation

• Motivation– Understanding– Transparency– Trust

Page 43: Strategies for Processing and Explaining Distributed Queries on Linked Data

43

Query Explanation in the Semantic Web

Query plan explanation Query result explanation

Jena

Sesame

Virtuoso *

BigOWLIM +

* Explanation of SPARQL to SQL translation+ a debugging feature on the query engine side

Page 44: Strategies for Processing and Explaining Distributed Queries on Linked Data

44

Distributed Query Explanation in the Semantic WebQuery plan explanation Query result explanation

FedX

DARQ

SemWIQ

ADERIS

Anapsid

Page 45: Strategies for Processing and Explaining Distributed Queries on Linked Data

45

ADERIS Query Plan Explanation

Page 46: Strategies for Processing and Explaining Distributed Queries on Linked Data

46

Related Work

• Database: Why, how, where provenance• Semantic Web:– Inference explanation• Provenance• Generating justifications

Page 47: Strategies for Processing and Explaining Distributed Queries on Linked Data

47

Distributed Query ExplanationWhat We Provide

• Query plan explanation – Work in progress– Prior to query execution• Includes predicted performance metrics

– After query execution with performance metrics• Query result explanation– Why provenance– Where provenance

Page 48: Strategies for Processing and Explaining Distributed Queries on Linked Data

48

Query Result Explanation

Result Explainer Plug-in

(RDF model, Query, Result)

Can be any RDF model (RDF graph, SPARQL endpoint)

We generate result explanations by querying this model for why provenance

Page 49: Strategies for Processing and Explaining Distributed Queries on Linked Data

49

Why Provenance

• Triples in virtual model from which the result is derived

<http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" .

Page 50: Strategies for Processing and Explaining Distributed Queries on Linked Data

50

Where Provenance

• Keep the provenance of sources in the virtual model

fedqld:source1 {<http://example.org/book/book8>

<http://purl.org/dc/elements/1.1/creator><http://www-sop.inria.fr/members/Alice> ;

<http://purl.org/dc/elements/1.1/title>"Distributed Query Processing for Linked Data" .

}

fedqld:source2 { <http://www-sop.inria.fr/members/Charlie>

<http://xmlns.com/foaf/0.1/name>"Charlie" .

<http://www-sop.inria.fr/members/Alice>

<http://xmlns.com/foaf/0.1/knows><http://www-sop.inria.fr/members/Charlie> ;

<http://xmlns.com/foaf/0.1/name>"Alice" .

}

fedqld:prov {fedqld:source1 void:sparqlEndpoint <http://localhost:3030/books/query> .fedqld:source2 void:sparqlEndpoint <http://localhost:3031/person/query> .

}

Where the triples in this graph come from

Page 51: Strategies for Processing and Explaining Distributed Queries on Linked Data

51

What’s next

• Explanation user interfaces• Evaluating the impacts of our explanations

Page 52: Strategies for Processing and Explaining Distributed Queries on Linked Data

52

References• Andreas Schwarte, Peter Haase, Katja Hoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for

distributed query processing on linked open data. In ESWC, 2011• Shinji Kikuchi, Shelly Sachdeva, Subhash Bhalla, Steven Lynden, Isao Kojima, Akiyoshi Matono, and Yusuke Tanimura.

Adaptive integration of distributed semantic web data. Databases in Networked Information Systems• Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a

benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11)