worq: workload-driven rdf query processing · proposal •we present onlinemethod for...
TRANSCRIPT
WORQ: Workload-Driven RDF Query Processing
Department of Computer SciencePurdue University
Amgad Madkour (Purdue)Ahmed Aly (Google)Walid G. Aref (Purdue)
IntroductionRDF Data Is Everywhere
• RDF is an integral component in many systems: Semantic Search, Smart Governments (Data.gov), Medical Systems
• (Linked) RDF data contains very rich relations: Data.gov – 5 billion triples Linked Cancer Genome Atlas – 7.36 billion triples US Census Data – 1 billion triples
• Cloud-based systems are ideal for RDF data management (e.g., Storage, Query Processing)
Figure: Linked RDF Data Cloud containing thousands of datasets
2
IntroductionProcessing RDF Queries
• Network shuffling overhead degradesquery performance in a distributed environment
• Intermediate results represent the data that satisfies the binary join and contributes to the final result of the query
• Reducing the network shuffling relies on how the data is partitioned across the nodes and the intermediate results size
3
SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }
mention
SUB OBJ
:John :T1
:Mike :T2
Mike :T3
:Alex :T4
tweet
SUB OBJ
:John :Mary
:John :Mike
:Alex :Mary
:Sally :Mike
:Mary :John
Original Data
Reductions
Join(mention sub , tweet sub )
Join(tweet sub , mention sub )
SUB OBJ
:John :Mary
:John :Mike
:Alex :Mary
SUB OBJ
:John :T1
:Alex :T4
Problem Statement
• Data partitioning incurs a preprocessing overhead as it needs to be performed over the whole data
• Intermediate results may contain redundant data triples that do not match all the query joins
• Caching the unique query results incurs significant memory storage overhead
4
Proposal• We present online method for computing reductions of RDF data
using Bloom filters
• We present workload-driven partitioning of RDF triples that can join together in order to minimize the network shuffling overhead
• We show that caching the RDF join reductions can boost the query performance while keeping the cache size minimal
• We study an efficient technique for answering RDF queries with unbound properties using Bloom filters
5
Online Reduction of RDF DataJoin Patterns
• SPARQL queries consist of Basic Graph Patterns (BGP)
• Every BGP consists of a set of triples
• Join patterns represent correlations between triples in a SPARQL Basic Graph Pattern (BGP)
SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w
tweet_S_JOIN_mention_Smention_S_JOIN_tweet_Smention_O_JOIN_likes_Slikes_S_JOIN_mention_O
Join Patterns
mention
likes
tweet
6
Subject Object
:John :Mary
:John :Mike
:Alex :Mary
:Sally :Mike
:Mary :John
Online Reduction of RDF DataBloom Join
Subject Object
:John :T1
:Mike :T2
:Mike :T3
:Alex :T4
:John :Alex:Sally:Mary
:John :Mike :Alex
SELECT ?x ?y WHERE
?x :mention :Mary . ?x :tweet ?y .
joinx(:mentionsub,:tweetsub)
SUB OBJ
:John :Mary
:John :Mike
:Alex :MaryBloomFilter sub(tweet)
BloomFilter sub(mention)
SUB OBJ
:John :T1
:Alex :T4
BGPJoin
mention tweet
Reduced Triples
Probe
ProbeSelection(:Mary)
?x ?y
:John :T1
:Alex :T4
Result
JoinSPARQL Query
Determine join patterns
mention tweet
Online Reduction of RDF DataN-ary Join
8
SELECT ?x ?y ?z ?w WHERE ?x :mention ?y?x :tweet ?z?x :likes ?w
Subject Object
:John :Mary
:John :Mike
:Alex :Mary
:Sally :Mike
:Mary :John
Subject Object
:John :T1
:Mike :T2
:Mike :T3
:Alex :T4
Subject Object
:John :Alex
:Sally :Mike
mention tweet likes
:John :Alex:Sally:Mary
:John :Mike:Alex
:John:Sally
Subject Object
:John :Mary
:John :Mike
Subject Object
:John :T1
Subject Object
:John :Alex
?y ?y ?z ?w
:John :Mary :T1 :Alex
:John :Mike :T1 :Alex
Reduction
Reduction
Reduction
Result
Query
Computed from
Online Reduction of RDF DataCaching
9
SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }
mention
SUB OBJ
:John :T1
:Mike :T2
Mike :T3
:Alex :T4
tweet
SUB OBJ
:John :Mary
:John :Mike
:Alex :Mary
:Sally :Mike
:Mary :John
Original Data
Reductions
Join(mention sub , tweet sub )
Join(tweet sub , mention sub )
SUB OBJ
:John :Mary
:John :Mike
:Alex :Mary
SUB OBJ
:John :T1
:Alex :T4
Selection(:Mary)
CACHED
Workload-Driven PartitioningOverview
10
Subject Object
:John :Mary
:John :Mike
:Alex :Mary
:Sally :Mike
:Mary :John
Machine 1 Machine 2 Machine 3
Subject Object
:John :T1
:Mike :T2
:Mike :T3
:Alex :T4
:John :Mary:John :Mike
:John :T1
:Alex :Mary:Mary :John
:Alex :T4
:Sally :Mike
:Mike :T2:Mike :T3
mention tweet
mention tweet
Workload-Driven PartitioningProposal
SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w
Reductions Reduction ID
Reduction(tweetsub, mentionsub) R1
Reduction(mentionsub, tweetsub) R2
Reduction(likessub, mentionobj) R3
Subject Object
:John :T1
:Alex :T4
Reduction 1 (R1)
Subject Object
:John :Mary
:John :Mike
:Alex :Mary
Reduction 2 (R2)
Subject Object
:John :Alex
Reduction 3 (R3)
11
Machine 1
Machine 2
:John :T1
:Alex :T4
:John :Alex
:John :Mary:John :Mike
Part
ition
ing
R1
R1
R3
R2
Possible Reductions Redu
ctio
ns:Alex :Mary R2
Queries with Unbound PropertiesOverview
SELECT ?x ?zWHERE ?x ?z :Mike
Scan All Tables
QUERY: Check all tables for Obj = ‘:Mike’
12
Queries with Unbound PropertiesProposal
SELECT ?xWHERE {:Mary ?x ?y . }
13
:John :Alex:Sally:Mary
:John :Mike :Alex
BloomFilter sub (:tweet)
BloomFilter sub (:mention)
:John :Sally
BloomFilter sub (:like)
[MATCH]
[DOES NOT MATCH]
[FALSE POSITIVE MATCH]
Probe all existing Bloom Filters
IDENTIFICATION VERIFICATION
:mention :like
Filter sub
(:Mary)Filter sub
(:Mary)
1 Entry 0 Entry
?x
:mention
Result
Experimental Setup• Systems
WORQ: Implemented inside Knowledge Cubes (KC) S2RDF: State of the art Spark-based RDF engine
• Benchmarks WatDiv
Dataset: 1 Billion Triple, Query Workload: 5K queries Patterns: Covers 100 diverse SPARQL patterns, each containing 50 variations
Unbound Property Queries: 500 queries LUBM Dataset: 1 Billion Triple, Query Workload: 1K queries
Patterns: Covers 20 diverse SPARQL patterns YAGO Dataset: 245 million triples
14
GitHub Homepagehttps://github.com/amgadmadkour/knowledgecubes
Number of Files
15
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
LUBM 1B WatDiv 1B YAGO2s
Num
. File
s (C
ount
)
Datasets
VP WORQ S2RDF
Data Size on HDFS
16
1.E+00
1.E+01
1.E+02
1.E+03
LUBM 1B WatDiv 1B YAGO2s
Stor
age
Size
(G
B)
Datasets
VP WORQ S2RDF
Preprocessing Time
17
0.E+00
5.E+03
1.E+04
2.E+04
2.E+04
3.E+04
3.E+04
LUBM 1B WatDiv 1B YAGO2sPrep
roce
ssin
g Ti
me
(sec
)
Datasets
VP WORQ S2RDF
Query Execution PerformanceWorkload Generators
18
0
10
20
30
WatDiv LUBM
Exec
utio
n Ti
me
(sec
)
Datasets
WORQ S2RDF
0
5
10
15
20
WatDiv LUBM
Exec
utio
n Ti
me
(hou
rs)
Datasets
WORQ S2RDF
Mean Execution Time Total Execution Time
5000 queries over WatDiv (1 Billion triples) and1000 queries over LUBM (1 Billion triples)
Query Execution PerformanceQuery Patterns
19
1.E+02
1.E+03
1.E+04
1.E+05
1 3 5 7 9 11 13 15 17 19
Exec
utio
n Ti
me
(ms)
Query Patterns
WORQ S2RDF
5.E+02
5.E+03
5.E+04
1 3 5 7 9 11 13 15 17 19
Exec
utio
n Ti
me
(ms)
Query Patterns
WORQ S2RDF
WatDiv 1 Billion dataset LUBM 1 Billion dataset
Query Execution PerformanceQuery Patterns
20
400
4000
40000
2 3 4 5 6 7 8 9 10 11 12
Mea
n Ex
ecut
ion
Tim
e (m
s)
Number of query triples
WORQ S2RDF
500
5000
50000
2 3 4 5 6 7 8 9Mea
n Ex
ecut
ion
Tim
e (m
s)
Number of joins
WORQ S2RDF
Mean execution time over WatDiv 1 Billion
Query Execution PerformanceWorkload-Driven Partitioning
21
5000
5500
6000
6500
7000
WatDiv LUBMExec
utio
n Ti
me
(ms)
Datasets
Workload-driven Static
Query Execution PerformanceCaching
22
0.E+002.E+034.E+036.E+038.E+031.E+041.E+041.E+04
120
140
160
180
110
0112
0114
0116
0118
0120
0122
0124
0126
0128
0130
0132
0134
0136
0138
0140
0142
0144
0146
0148
01
Mem
ory
Usa
ge (
MB)
Timeline
Caching Results Caching Reductions
Performance of Unbound-Property Queries
23
System BSO-Mean BSO-Sum BS-Mean BS-Sum BO-Mean BO-Sum
WORQ 1.25 ms 10.49 min 4.18 ms 34.84 min 3.52 ms 29.34 min
RDF-Table 5.3 ms 44.44 min 3.80 ms 31.67 min 4.35 ms 36.26 min
(BSO) Bound Subject and Object(BS) Bound Subject(BO) Bound Object
Conclusion
• WORQ is an online method for computing reductions of RDF data using Bloom filters
• WORQ is a method for workload-driven partitioning that minimizes the network shuffling overhead
• WORQ demonstrates how caching reductions can boost the query performance
• WORQ helps answer RDF queries with unbound properties efficiently
24
25
Thank You !