worq: workload-driven rdf query processing · proposal •we present onlinemethod for...

WORQ: Workload-Driven RDF Query Processing

Department of Computer SciencePurdue University

Amgad Madkour (Purdue)Ahmed Aly (Google)Walid G. Aref (Purdue)

IntroductionRDF Data Is Everywhere

• RDF is an integral component in many systems: Semantic Search, Smart Governments (Data.gov), Medical Systems

• (Linked) RDF data contains very rich relations: Data.gov – 5 billion triples Linked Cancer Genome Atlas – 7.36 billion triples US Census Data – 1 billion triples

• Cloud-based systems are ideal for RDF data management (e.g., Storage, Query Processing)

Figure: Linked RDF Data Cloud containing thousands of datasets

2

IntroductionProcessing RDF Queries

• Network shuffling overhead degradesquery performance in a distributed environment

• Intermediate results represent the data that satisfies the binary join and contributes to the final result of the query

• Reducing the network shuffling relies on how the data is partitioned across the nodes and the intermediate results size

3

SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }

mention

SUB OBJ

:John :T1

:Mike :T2

Mike :T3

:Alex :T4

tweet

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Original Data

Reductions

Join(mention sub , tweet sub )

Join(tweet sub , mention sub )

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

SUB OBJ

:John :T1

:Alex :T4

Problem Statement

• Data partitioning incurs a preprocessing overhead as it needs to be performed over the whole data

• Intermediate results may contain redundant data triples that do not match all the query joins

• Caching the unique query results incurs significant memory storage overhead

4

Proposal• We present online method for computing reductions of RDF data

using Bloom filters

• We present workload-driven partitioning of RDF triples that can join together in order to minimize the network shuffling overhead

• We show that caching the RDF join reductions can boost the query performance while keeping the cache size minimal

• We study an efficient technique for answering RDF queries with unbound properties using Bloom filters

5

Online Reduction of RDF DataJoin Patterns

• SPARQL queries consist of Basic Graph Patterns (BGP)

• Every BGP consists of a set of triples

• Join patterns represent correlations between triples in a SPARQL Basic Graph Pattern (BGP)

SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w

tweet_S_JOIN_mention_Smention_S_JOIN_tweet_Smention_O_JOIN_likes_Slikes_S_JOIN_mention_O

Join Patterns

mention

likes

tweet

6

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Online Reduction of RDF DataBloom Join

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

:John :Alex:Sally:Mary

:John :Mike :Alex

SELECT ?x ?y WHERE

?x :mention :Mary . ?x :tweet ?y .

joinx(:mentionsub,:tweetsub)

SUB OBJ

:John :Mary

:John :Mike

:Alex :MaryBloomFilter sub(tweet)

BloomFilter sub(mention)

SUB OBJ

:John :T1

:Alex :T4

BGPJoin

mention tweet

Reduced Triples

Probe

ProbeSelection(:Mary)

?x ?y

:John :T1

:Alex :T4

Result

JoinSPARQL Query

Determine join patterns

mention tweet

Online Reduction of RDF DataN-ary Join

8

SELECT ?x ?y ?z ?w WHERE ?x :mention ?y?x :tweet ?z?x :likes ?w

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

Subject Object

:John :Alex

:Sally :Mike

mention tweet likes


:John :Mike:Alex

:John:Sally

Subject Object

:John :Mary

:John :Mike

Subject Object

:John :T1

Subject Object

:John :Alex

?y ?y ?z ?w

:John :Mary :T1 :Alex

:John :Mike :T1 :Alex

Reduction

Reduction

Reduction

Result

Query

Computed from

Online Reduction of RDF DataCaching

9

SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }

mention

SUB OBJ

:John :T1

:Mike :T2

Mike :T3

:Alex :T4

tweet

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Original Data

Reductions

Join(mention sub , tweet sub )

Join(tweet sub , mention sub )

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

SUB OBJ

:John :T1

:Alex :T4

Selection(:Mary)

CACHED

Workload-Driven PartitioningOverview

10

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Machine 1 Machine 2 Machine 3

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

:John :Mary:John :Mike

:John :T1

:Alex :Mary:Mary :John

:Alex :T4

:Sally :Mike

:Mike :T2:Mike :T3

mention tweet

mention tweet

Workload-Driven PartitioningProposal

SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w

Reductions Reduction ID

Reduction(tweetsub, mentionsub) R1

Reduction(mentionsub, tweetsub) R2

Reduction(likessub, mentionobj) R3

Subject Object

:John :T1

:Alex :T4

Reduction 1 (R1)

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

Reduction 2 (R2)

Subject Object

:John :Alex

Reduction 3 (R3)

11

Machine 1

Machine 2

:John :T1

:Alex :T4

:John :Alex

:John :Mary:John :Mike

Part

ition

ing

R1

R1

R3

R2

Possible Reductions Redu

ctio

ns:Alex :Mary R2

Queries with Unbound PropertiesOverview

SELECT ?x ?zWHERE ?x ?z :Mike

Scan All Tables

QUERY: Check all tables for Obj = ‘:Mike’

12

Queries with Unbound PropertiesProposal

SELECT ?xWHERE {:Mary ?x ?y . }

13


:John :Mike :Alex

BloomFilter sub (:tweet)

BloomFilter sub (:mention)

:John :Sally

BloomFilter sub (:like)

[MATCH]

[DOES NOT MATCH]

[FALSE POSITIVE MATCH]

Probe all existing Bloom Filters

IDENTIFICATION VERIFICATION

:mention :like

Filter sub

(:Mary)Filter sub

(:Mary)

1 Entry 0 Entry

?x

:mention

Result

Experimental Setup• Systems

WORQ: Implemented inside Knowledge Cubes (KC) S2RDF: State of the art Spark-based RDF engine

• Benchmarks WatDiv

Dataset: 1 Billion Triple, Query Workload: 5K queries Patterns: Covers 100 diverse SPARQL patterns, each containing 50 variations

Unbound Property Queries: 500 queries LUBM Dataset: 1 Billion Triple, Query Workload: 1K queries

Patterns: Covers 20 diverse SPARQL patterns YAGO Dataset: 245 million triples

14

GitHub Homepagehttps://github.com/amgadmadkour/knowledgecubes

Number of Files

15

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

LUBM 1B WatDiv 1B YAGO2s

Num

. File

s (C

ount

)

Datasets

VP WORQ S2RDF

Data Size on HDFS

16

1.E+00

1.E+01

1.E+02

1.E+03

LUBM 1B WatDiv 1B YAGO2s

Stor

age

Size

(G

B)

Datasets

VP WORQ S2RDF

Preprocessing Time

17

0.E+00

5.E+03

1.E+04

2.E+04

2.E+04

3.E+04

3.E+04

LUBM 1B WatDiv 1B YAGO2sPrep

roce

ssin

g Ti

me

(sec

)

Datasets

VP WORQ S2RDF

Query Execution PerformanceWorkload Generators

18

0

10

20

30

WatDiv LUBM

Exec

utio

n Ti

me

(sec

)

Datasets

WORQ S2RDF

0

5

10

15

20

WatDiv LUBM

Exec

utio

n Ti

me

(hou

rs)

Datasets

WORQ S2RDF

Mean Execution Time Total Execution Time

5000 queries over WatDiv (1 Billion triples) and1000 queries over LUBM (1 Billion triples)

Query Execution PerformanceQuery Patterns

19

1.E+02

1.E+03

1.E+04

1.E+05

1 3 5 7 9 11 13 15 17 19

Exec

utio

n Ti

me

(ms)

Query Patterns

WORQ S2RDF

5.E+02

5.E+03

5.E+04

1 3 5 7 9 11 13 15 17 19

Exec

utio

n Ti

me

(ms)

Query Patterns

WORQ S2RDF

WatDiv 1 Billion dataset LUBM 1 Billion dataset

Query Execution PerformanceQuery Patterns

20

400

4000

40000

2 3 4 5 6 7 8 9 10 11 12

Mea

n Ex

ecut

ion

Tim

e (m

s)

Number of query triples

WORQ S2RDF

500

5000

50000

2 3 4 5 6 7 8 9Mea

n Ex

ecut

ion

Tim

e (m

s)

Number of joins

WORQ S2RDF

Mean execution time over WatDiv 1 Billion

Query Execution PerformanceWorkload-Driven Partitioning

21

5000

5500

6000

6500

7000

WatDiv LUBMExec

utio

n Ti

me

(ms)

Datasets

Workload-driven Static

Query Execution PerformanceCaching

22

0.E+002.E+034.E+036.E+038.E+031.E+041.E+041.E+04

120

140

160

180

110

0112

0114

0116

0118

0120

0122

0124

0126

0128

0130

0132

0134

0136

0138

0140

0142

0144

0146

0148

01

Mem

ory

Usa

ge (

MB)

Timeline

Caching Results Caching Reductions

Performance of Unbound-Property Queries

23

System BSO-Mean BSO-Sum BS-Mean BS-Sum BO-Mean BO-Sum

WORQ 1.25 ms 10.49 min 4.18 ms 34.84 min 3.52 ms 29.34 min

RDF-Table 5.3 ms 44.44 min 3.80 ms 31.67 min 4.35 ms 36.26 min

(BSO) Bound Subject and Object(BS) Bound Subject(BO) Bound Object

Conclusion

• WORQ is an online method for computing reductions of RDF data using Bloom filters

• WORQ is a method for workload-driven partitioning that minimizes the network shuffling overhead

• WORQ demonstrates how caching reductions can boost the query performance

• WORQ helps answer RDF queries with unbound properties efficiently

24

25

Thank You !

worq: workload-driven rdf query processing · proposal •we present onlinemethod for...

Documents