scalable multi-query optimization for sparql - cs.utah.edulifeifei/papers/mqoslides.pdf · scalable...

Scalable Multi-Query Optimization for SPARQL

Wangchao Le1 Anastasios Kementsietsidis2 Songyun Duan2 Feifei Li1

1University of Utah 2IBM Research

April 2, 2012

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

2 / 113

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

3 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etcA large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

4 / 113

Introduction

DBpedia, Uniprot, Freebase etc

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#xmlns:dcterms=”http://purl.org/dc/terms/”>

<rdf:Description rdf:about=”urn:x-states:New York”><dcterms:alternative>NY</dcterms:alternative>

</rdf:Description></rdf:RDF>

Internally ...

A large graph and encode rich semantics

5 / 113

Introduction

<http://. . ./New York> <http://purl.org/dc/terms/alternative> ”NY” .

Triple format:

Internally ...

subject predicate object

6 / 113

Introduction

Triple format:

Internally ...

subject predicate object

7 / 113

Introduction

Triple format:

Internally ...

Query language: SPARQLsubject predicate object

8 / 113

Introduction

9 / 113

Introduction

10 / 113

Introduction

RDF store

Qn−1QnSPARQL queries

11 / 113

Introduction

RDF store

Qn−1Qn

• Observation: queries share common parts• Multi-query optimization

SPARQL queries

12 / 113

Introduction

A tempting choice: turn to MQO in relational databases[MQO88][MQO90][MQO00]

SPARQL↔relational algebra [EPS08][FSR07].Exist quite a few relational solutions for RDF store.

[MQO90] T. Sellis, et al. On the Multiple-Query Optimization Problem. In TKDE, 1990.

[MQO88] T. Sellis, et al. Multiple-query optimization. In TODS, 1988.

[MQO00] P. Roy, et al. Efficient and extensible algorithms for multi query optimization. In SIGMOD, 2000.

[EPS08] R. Angles, et al. The Expressive Power of SPARQL. In ISWC, 2008.

[FSR07] A. Polleres, et al. From SPARQL to rules (and back). In WWW, 2007.

For SPARQL and RDF, new issues arise in practice.

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joinsStore dependent solution

13 / 113

Introduction

14 / 113

Introduction

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joins

Store dependent solution

15 / 113

Introduction

16 / 113

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

17 / 113

Preliminary

We foucs on two types of queries

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

18 / 113

Preliminary

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Te (Q)≤ 1

19 / 113

Preliminary

subj pred objp1 name ”Alice”p1 zip 10001p1 mbox alice@homep1 mbox alice@workp1 www http://home/alicep2 name ”Bob”p2 zip 10001p3 name ”Ella”p3 zip 10001p3 www http://work/ellap4 name ”Tim”p4 zip ”11234”

(a) triple table D

SELECT ?nameWHERE { ?x name ?name, ?x zip 10001,

(b) Example query QOPT

name”Alice””Bob””Ella”

Problem statement.

Te (Q)≤ 1

20 / 113

Preliminary

(a) triple table D

SELECT ?name , ?mail , ?hpageWHERE { ?x name ?name, ?x zip 10001,

OPTIONAL {?x mbox ?mail }OPTIONAL {?x www ?hpage }}

name”Alice””Bob””Ella”

Problem statement.

Te (Q)≤ 1

21 / 113

Preliminary

(a) triple table D

SELECT ?name , ?mail , ?hpageWHERE { ?x name ?name, ?x zip 10001,

OPTIONAL {?x mbox ?mail }OPTIONAL {?x www ?hpage }}

name mail hpage”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

(c) Output QOPT(D)

Problem statement.

Te (Q)≤ 1

22 / 113

Preliminary

Problem statement.

Te (Q)≤ 1

23 / 113

Preliminary

Problem statement.

Input: a set Q of Type 1 queries and a data graph G

Output: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

Te (Q)≤ 1

24 / 113

Preliminary

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queries

Requirements:

Te (Q)≤ 1

25 / 113

Preliminary

Problem statement.

Te (Q)≤ 1

26 / 113

Our approach

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

27 / 113

Motivating example

?z2?x2

?z1?x1

(a) Query Q1 (b) Query Q2

28 / 113

Motivating example

?z2?x2

?z1?x1

: constant : variable

Q1: SELECT * WHERE{?x1 P1 ?z1 . ?y1 P2 ?z1 . ?y1 P3 ?w1 . ?w1 P4 v1 . }

29 / 113

Motivating example

?z2?x2

?z1?x1

30 / 113

Motivating example

?z2?x2

?z1?x1

SELECT *WHERE { ?x P1 ?z , ?y P2 ?z ,

(I) Structure only QOPT

31 / 113

Motivating example

?z2?x2

?z1?x1

OPTIONAL {?y P3 ?w , ?w P4 v1 }

v1 ?wP4

32 / 113

Motivating example

?z2?x2

?z1?x1

OPTIONAL {?y P3 ?w , ?w P4 v1 }OPTIONAL {?t P3 ?x , ?t P5 v1, ?w P4 v1 }

v1 ?wP4

33 / 113

Motivating example

?z2?x2

?z1?x1

OPTIONAL {?y P3 ?w , ?w P4 v1 }OPTIONAL {?t P3 ?x , ?t P5 v1, ?w P4 v1 }

v1 ?wP4

Evaluated once→potential saving

OPTIONALs are evaluated on top of the common substructures

(intermediate results cached by engine).

34 / 113

Motivating example

?z2?x2

?z1?x1

pattern p α(p)

?x P1 ?z 30%?y P2 ?z 20%?y P3 ?w 18%?w P4 v1 1%?t P5 v1 2%

*Max common subquery is not selective

(II) Using cost in optimization

35 / 113

Motivating example

?z2?x2

?z1?x1

pattern p α(p)

*Max common subquery is not selective

36 / 113

Motivating example

?z2?x2

?z1?x1

SELECT *WHERE { ?w P4 v1,

OPTIONAL {?x1 P1 ?z1, ?y1 P2 ?z1, ?y1 P3 ?w }OPTIONAL {?x2 P1 ?z2, ?y2 P2 ?z2, ?t2 P3 ?x2, ?t2 P5 v1 }

37 / 113

Our approach

Q={q1, q2, . . . , qn}

38 / 113

Our approach

Q={q1, q2, . . . , qn} They often do not share one common subquery

39 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together

40 / 113

Our approach

Q={q1, q2, . . . , qn}

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

41 / 113

Our approach

Q={q1, q2, . . . , qn}

Rewriting Rewriting Rewriting

42 / 113

Our approach

Q={q1, q2, . . . , qn}

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries(hierarchically)→ a set of type 2 queries

43 / 113

Our approach

Q={q1, q2, . . . , qn}

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries

• finding common edge subgraphs• optimizations to avoid bad efficiency• cost: guard against bad rewritings• approx. by the min selectivity in common subquery

(hierarchically)→ a set of type 2 queries

44 / 113

Our approach

Q={q1, q2, . . . , qn}

v1 ?wP4

pattern p α(p)

45 / 113

Our approach

Q={q1, q2, . . . , qn}

Execution

46 / 113

Our approach

Q={q1, q2, . . . , qn}

Execution

Result distribution

r(q1) r(q2) r(qn)

47 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queriesRD of a Type 1 query: e.g., X and Z

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesOPTIONAL queries

48 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

ab cd ef g

RD of a Type 1 query: e.g., X and Z↑↓

columns from results of the Type 2 rewriting

49 / 113

Our approach

Related issues

ab cd ef g

50 / 113

Our approach

Related issues

ab cd ef g

51 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

Queries

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

52 / 113

Experiments

Dataset

1 5 10 15 20 25 30 35 40 45 50

10−1

lectivity (

Predicate ID

Selective Non−selective

Queries

53 / 113

Experiments

Dataset

Queries

54 / 113

Experiments

Dataset

QueriesEnsure randomness in structure, e.g., star, chain and circle

55 / 113

Experiments

Dataset

QueriesParameter Symbol Default RangeDataset size D 4M 3M to 9MNumber of queries |Q| 100 60 to 160Size of query (num of triple patterns) |Q| 6 5 to 9Number of seed queries κ 6 5 to 10Size of seed queries |qcmn| ∼ |Q|/2 1 to 5Max selectivity of patterns in Q αmax (Q) random 0.1% to 4%Min selectivity of patterns in Q αmin(Q) 1% 0.1% to 4%

56 / 113

Experiments

Dataset

QueriesParameter Symbol Default RangeDataset size D 4M 3M to 9MNumber of queries |Q| 100 60 to 160Size of query (num of triple patterns) |Q| 6 5 to 9Number of seed queries κ 6 5 to 10Size of seed queries |qcmn| ∼ |Q|/2 1 to 5Max selectivity of patterns in Q αmax (Q) random 0.1% to 4%Min selectivity of patterns in Q αmin(Q) 1% 0.1% to 4%

57 / 113

Experiments

Time on rewritingMQO-S-C: structure based rewriting

MQO-C: rewriting integrating with cost

58 / 113

Experiments

Time on rewritingMQO-S-C: structure based rewriting

MQO-C: rewriting integrating with cost

60 80 100 120 140 1600

MQO−S−C MQO−C

*Costly/bad rewritings are rejected→more rounds of comparisons.

59 / 113

Experiments

Time on distributing resultsMQO-S-P: parsing results from MQO-S

MQO-P: parsing results with MQO

60 / 113

Experiments

60 80 100 120 140 1600

MQO−S−P MQO−P

*Non-selective common subqueriesincrease the set of results.

61 / 113

Experiments

60 80 100 120 140 1600

MQO−S−P MQO−P

*Non-selective common subqueriesincrease the set of results.

*Both rewriting and parsingare efficiently doable

62 / 113

Experiments

Varying num of queries in a batchNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

63 / 113

Experiments

60 80 100 120 140 1600

No−MQO MQO−S MQO

*Both reduce the num ofqueries to be exectued

64 / 113

Experiments

60 80 100 120 140 1600

|Q|65 / 113

Experiments

Varying min. selectivity in seed queriesNo-MQO: no optimization

66 / 113

Experiments

0.1 0.5 1 2 40

αmin(qcmn) (%)

*MQO: reject more bad rewritings;MQO-S: not sensitive

67 / 113

Experiments

0.1 0.5 1 2 40

αmin(qcmn) (%)68 / 113

Experiments

Varying seed sizeMQO-S: optimization based on structural rewriting

percentage=Te (common subquery)Te (Qopt )

× 100%

69 / 113

Experiments

Varying seed sizeMQO-S: optimization based on structural rewriting

percentage=Te (common subquery)Te (Qopt )

× 100%

1 2 3 4 5

MQO−S MQO

|qcmn|

No−OPTIONAL− − −

*MQO-S: up to 25% time on optional

70 / 113

Conclusions

In dealing RDF data on the Web, store independency is important

Combining SPARQL language and graph algorithms can achieveMQO, i.e., by rewriting queries

Cost must be taken in consideration during rewriting

71 / 113

The End

Thank You

Q and A

72 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

73 / 113

Our approach

Grouping: k-means

74 / 113

Our approach

Grouping: k-means

75 / 113

Our approach

p3p2 p4

p2p2 p1

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

Grouping: k-means

76 / 113

Our approach

p3p2 p4

p2p2 p1

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

Distance: Jaccard similarity on predicates

Grouping: k-means

77 / 113

Our approach

p3p2 p4

p2p2 p1

{p1,p2,p3} {p1,p2,p3} {p1,p2,p3} {p7,p8,p9} {p7,p8,p9} {p7,p8,p9,p10}

{p1,p2,p3,p4} {p2,p3,p4} {p1,p2,p3,p4} {p1,p2,p4} {p7,p8,p9} {p5,p6,p7,p9}

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

Distance: Jaccard similarity on predicatesRepresent each query as a set of predicates.

Grouping: k-means

78 / 113

Our approach

p3p2 p4

p2p2 p1

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

Grouping: k-means

79 / 113

Our approach

p3p2 p4

p2p2 p1

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

Grouping: k-means

80 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

81 / 113

Our approach

Rewrite pairs of queries bottom up

qab qcd

qa qb qc qd . . . . . . . . . . . . qr qs

82 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

83 / 113

Our approach

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

84 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

85 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

86 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

In the language of graph . . .[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

87 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

88 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

maximal common connected edge subgraphs→ maximal common connected induced sugraphs in linegraphs

89 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

maximal common connected edge subgraphs→ maximal common connected induced sugraphs in linegraphs→ maximal cliques in the product graph

90 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

91 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

• Linegraph: invert vertices and edges

92 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

93 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• product graph: simultaneous walk

94 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

95 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

graph query pattern 1 graph query pattern 2

The triangle (clique) highlights the common subgraph composed by

96 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• blowup in size, esp. > 2 queries

affect clique detection

97 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• optimize the product graph

98 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• prune non-common predicates• check the constants

99 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• prune non-common predicates• check the constants• prune vertices with non-common

neighborhoods

100 / 113

Our approach

?z4?x4

?z3?x3

?z2?x2

?z1?x1

P4 P4?w2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`2 `1`0

`0 `0`3

• prune non-common predicates• check the constants• prune vertices with non-common

neighborhoods

L(GPp):

S: P3 P4

101 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

102 / 113

Our approach

103 / 113

Our approach

104 / 113

Our approach

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritings

Measure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

105 / 113

Our approach

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximation

Cost: discard bad rewritings, keep good ones in hierarchical rewriting

106 / 113

Our approach

107 / 113

Our approach

108 / 113

Our approach

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

Stop when the selectivitydrops

109 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queriesRD of a Type 1 query

handle variable predicatesnested OPTIONALs

110 / 113

Our approach

Related issues

name mail hpage

”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

RD of a Type 1 query

111 / 113

Our approach

Related issues

name mail hpage

112 / 113

Our approach

Related issues

name mail hpage

113 / 113

scalable multi-query optimization for sparql - cs.utah.edulifeifei/papers/mqoslides.pdf · scalable...

Documents

using sparql to query bioportal ontologies and metadata

the sparql query graph model for query optimization

parallel sparql query execution using apache spark

openhpi 3.1 - how to query rdf(s)? - sparql

new foundations of sparql query...

interactive sparql query processing on hadoop -...

sparql query languageai.fon.bg.ac.rs › wp-content ›...

sparql query languageai.fon.bg.ac.rs › wp-content ›...

sparql- a query language for rdf(s)

optimizing sparql query answering over owl ontologies

cs 848 - an introduction to...

1 sparql language overview - uppsala university · 1 sparql...

predicting sparql query execution time and suggesting...

sparql all slides are adapted from the w3c recommendation...

sparql and rdf query optimization

towards a top-k sparql query benchmark generator

gstore: a graph-based sparql query engine

t4 tue 0830 feigenbaum lee prudhommeaux...

detecting sparql query templates for data prefetching

05/01/2016 sparql sparql protocol and rdf query language s....