matrix “bit”loaded: a scalable lightweight join query processor for rdf data

22
Matrix “Bit”loaded: A Scalable Matrix “Bit”loaded: A Scalable Lightweight Join Query Lightweight Join Query Processor for RDF Data Processor for RDF Data Medha Atre 1 , Vineet Chaoji 2 , Mohammed J. Zaki 1 , and James A. Hendler 1 1 Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA 2 Yahoo! Labs, Bangalore, India April 29, 2010 WWW 2010, Raleigh NC, USA

Upload: carson

Post on 06-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data. Medha Atre 1 , Vineet Chaoji 2 , Mohammed J. Zaki 1 , and James A. Hendler 1 1 Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA 2 Yahoo! Labs, Bangalore, India April 29, 2010 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Matrix “Bit”loaded: A Scalable Lightweight Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF DataJoin Query Processor for RDF Data

Medha Atre1, Vineet Chaoji2, Mohammed J. Zaki1, and James A. Hendler1

1Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA2Yahoo! Labs, Bangalore, India

April 29, 2010WWW 2010, Raleigh NC, USA

Page 2: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

OverviewOverview

• Introduction• Challenges• Motivation• BitMat structure

– Construction & operations

• Query processing algorithm• Experimental evaluation• Future roadmap

WWW 2010, Raleigh NC, USA

Page 3: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

IntroductionIntroduction

WWW 2010, Raleigh NC, USA

• RDF (Resource Description Framework) for representing any information– triple form – [<subject> <predicate> <object>]– Depicted as a directed edge

• RDF graphs of hundreds of millions of triples to a few billion triples are common nowadays– DBPedia (103 million triples)– UniProt (845 million triples)– US Census (1 billion triples)– Bio2RDF (2.3 billion triples)– Data.gov (5+ billion triples)

Subject ObjectPredicate

Page 4: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Challenges – Storing RDF DataChallenges – Storing RDF Data

WWW 2010, Raleigh NC, USA

• RDF graphs of more than a billion triples (400 GB+ on-disk size).

• Traditional DB based efforts– Jena-TDB (custom indexes and storage)– C-store– MonetDB (open-source DB system)

• Exploit RDF data characteristics on top of DB storage– Vertical partitioning: create separate predicate table for each predicate.

• Compression based techniques– MonetDB and RDF-3X

Page 5: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Challenges – Querying RDF DataChallenges – Querying RDF Data

WWW 2010, Raleigh NC, USA

• Limited main memory compared to disk space• Large intermediate join tables• Scans over large percentage of indexes

– Even for aggressive indexing + compression. E.g. Hexastore, RDF-3X

• Optimizations– Selectivity estimation in case of multiple level joins, left deep join tree– Sideways (parallel) information passing for several merge-joins– Semi-joins: Semi-joins reduce the database for a given join query

Page 6: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Motivation for this workMotivation for this work

WWW 2010, Raleigh NC, USA

• SPARQL join queries can be broadly classified into 3 types:1) Queries having highly selective triple patterns,

e.g., (?s :residesIn USA)(?s :hasSSN “123-45-6789”)• Existing techniques handle these queries very efficiently

2) Queries with low-selectivity triple patterns but highly selective results, e.g., (?s :residesIn China)(?s :citizenOf India)

3) Queries with low-selectivity triple patterns and low-selectivity results, e.g., (?s :residesIn USA)(?s :hasSSN ?y)

• Such queries involving multi-level joins can lead to large intermediate results

Page 7: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Our ContributionOur Contribution

WWW 2010, Raleigh NC, USA

• A compressed data structure – BitMat to store the RDF data

• A join query algorithm which operates directly on the compressed data:– No intermediate join tables, instead, use a 2-phase query algorithm

• First phase: prune the candidate RDF triples• Second phase: stitch the final results directly from the pruned triples

– Can guaranty memory requirements at the beginning of the query– Online/streaming result generation

Page 8: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

BitMat ConstructionBitMat Construction

WWW 2010, Raleigh NC, USA

• Conceptually construct a bit-cube of subject (S), predicate (P), object (O) dimensions

• Mapping dictionary:– Vs: Set of subjects, Vp: Set of predicates, Vo: Set of objects, Vso= Vs Vo

– Common subject and object URIs mapped to same integer IDs 1 to |Vso|

– Subject only URIs mapped to integer IDs |Vso|+1 to |Vs|S-dimension

P-dimension

O-dimension

1

1

Vso

Vso

Vo

Vs

Page 9: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

BitMat Construction (continued..)BitMat Construction (continued..)

WWW 2010, Raleigh NC, USA

• Slice along P dimension and store S-O and O-S BitMats• Apply gap-encoding to each row of the BitMat before storing it• Storage: 2 |Vp| + |Vs| + |Vo| BitMats

• Additionally store condensed representation of rows and columns and number of triples in each of the 4 types of BitMats

S-dimension

00 1

100 0

0

01 0

000 1

0

00 0

011 0

0

P-dimension

O-dimension

Subject Predicate Object

:the_matrix :releasedIn “1999”

:the_thirteenth_floor :releasedIn “1999”

:the_matrix :similar_to :the_matrix_reloaded

:the_thirteenth_floor :similar_to :the_matrix

:the_matrix rdf:type :movie

:the_thirteenth_floor rdf:type :movieSO1

SO2

O3

O4

SO1

SO2

P1

P2

P3

Page 10: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Operations on BitMatOperations on BitMat

WWW 2010, Raleigh NC, USA

• Join algorithm uses two basic operations: fold & unfold• fold(BitMat, dimension) returns bitArray

– Folds the input BitMat by retaining the dimension

• unfold(BitMat, MaskBitArray, dimension)– Unfolds MaskBitArray on the BitMat in dimension

• Fold & unfold operate by doing bitwise AND/OR operations on gap compressed bit-vectors

1 11

111 1 11

Page 11: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Query Processing AlgorithmQuery Processing Algorithm

WWW 2010, Raleigh NC, USA

• Build a constraint graph

• E.g., query (?m rdf:type :movie)(?n rdf:type movie)(?m :similar_to :n) has constraint graph as

• Each triple pattern has a BitMat containing only triples matching that triple pattern

• Propagate the constraints on join variable bindings imposed by each triple pattern

?m ?n

?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie

SS SO

Gjvar

Gtp

Page 12: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Phase 1 -- Pruning phasePhase 1 -- Pruning phase

WWW 2010, Raleigh NC, USA

1. Embed a tree on the subgraph Gjvar

2. Walk over this tree from root to leaves and back in BFS order3. At each node in the tree over Gjvar, collect all the variable

bindings from the BitMats of the triple patterns containing that variable (fold operation)

4. Do a bitwise AND of all folded bit-arrays obtained5. Relay back the results of bitwise AND on the BitMats (unfold

operation)

• Simple optimizations:– Tree root selection: Select the join variable having the least number of

triples in their BitMats as the root of the tree over Gjvar

– Early stopping: If at any point, the result of bitwise AND of folded bit-arrays is null

Page 13: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Pruning phasePruning phase

WWW 2010, Raleigh NC, USA

?m

?m ?n

?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie

fold foldunfold unfold fold foldunfold unfold

1

1

1

1 1

1

1 1

11

1

1 1 1 1 1 1 111

1

1

1 1

1

11

1

1

1

1 1 1 1 1111 1 1 11

1

1

1 1

1

1

1

1

1

1 1

In the reverse traversal while propagating effect of join over “?n”, the fold of 2nd BitMat yields same bit-array as the mask bit-array of ?m before, hence there is no need to do fold/unfold again on the first BitMat

1

1

1

1

1

1

1

1

1

1

1

1

1

Page 14: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Phase 2 -- Final result generationPhase 2 -- Final result generation

WWW 2010, Raleigh NC, USA

• Resembles a multi-way join

• Start with the triple pattern with least number of triples left in its BitMat

• Generate bindings for variables in that triple pattern

• Next, select another triple pattern which shares a join variable with any of the previously selected triple patterns

• Check if it can generate the same bindings for the shared join variable and generate bindings for its other variables

• Continue this and at the end of one round when all triple patterns are processed and all variables have consistent bindings, output the result

Page 15: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Final result generationFinal result generation

WWW 2010, Raleigh NC, USA

Var

Val

?a

?b

?c

:s1

:o2

:t3

Output this result

:t4

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1 1

1

1

111

1

:o3

:t3

?a

?b

?a

?b

?c

Sample query?a rdf:type :Person?a :worksFor ?b?c :departmentOf ?b

?a rdf:type :Person

?a :worksFor ?b

?c :departmentOf ?b

Page 16: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Evaluation setupEvaluation setup

WWW 2010, Raleigh NC, USA

• Competitive RDF stores:– MonetDB– RDF-3X

• Datasets:– UniProt: Protein dataset with ~845M triples, ~147M subjects, 95 predicates, and

~128M objects– LUBM: Synthetic university dataset with ~1.33B triples, ~217M subjects, 18

predicates, and ~161M objects

• Queries:– UniProt: Queries published by UniProt dataset owners and RDF-3X– LUBM: Queries published by OpenRDF

• Development environment:– Dell Optiplex 755 PC, 3.0 GHz Intel E6850 Core 2 Duo Processor, 4 GB memory.– 7 GB swap space on 7200 rpm 1 TB disk.– 64 bit 2.6.28-15 Linux kernel (Ubuntu 9.04 distribution).

Page 17: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

ResultsResults

WWW 2010, Raleigh NC, USA

• For queries with low-selectivity triple patterns, BitMat outperformed MonetDB and RDF-3X by 2-3 orders of magnitude

• For highly selective triple patterns, RDF-3X gave superior performance, especially for queries where sideways-information-passing (SIP) could benefit

• BitMat’s shortcomings in case of highly selective queries:– The 2-phase query processing can create additional overheads for highly

selective queries– No cache memory optimization– No memory mapping of disk files

Page 18: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

WWW 2010, Raleigh NC, USA

Q1(4)

Q2(7)

Q3(8)

Q4(4)

Q5(3)

Q6(7)

Q7(2)

Q8(12)

Cold cache

BitMat 451.365 269.526 173.324 9.396 78.35 1.34 9.33 13.06

MonetDB 548.21 303.213 124.356 9.63 97.28 11.28 9.91 15.93

RDF-3X Aborted 525.125 224.58 1.38 4.636 0.902 0.892 1.353

Warm cache

BitMat 440.868 263.071 168.673 8.305 77.442 0.448 8.36 10.87

MonetDB 495.64 267.53 113.818 0.584 96.02 0.822 0.861 0.362

RDF-3X Aborted 487.182 226.05 0.077 1.008 0.0064 0.003 0.03

#Results 160,198,689 90,981,843 50,192,929 0 179,316 0 0 19

#Initial triples

92,965,468 73,618,481 78,840,372 16,626,073 60,260,006 15,408,126 16,625,901 53,677,336

UniProt 845 million triples (time in sec)

More results in the paper

Page 19: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

WWW 2010, Raleigh NC, USA

Q1 (Circ) Q2 (Star) Q3(Circ) Q4 (Star) Q5(Star) Q6

Cold cache

BitMat 51.21 2.71 6.56 2.45 0.503 3.81

MonetDB 548.21 27.17 455.23 34.12 18.89 14.6

RDF-3X Aborted 34.868 2324.753 0.588 0.425 1.129

Warm cache

BitMat 48.57 2.11 1.94 0.686 0.27 2.85

MonetDB 96.65 6.56 398.46 3.209 0.566 0.542

RDF-3X Aborted 29.033 2028.685 0.0024 0.0029 0.1814

#Results 2528 10,799,863 0 10 10 125

#Initial triples 165,397,764 224,805,759 219,416,877 438,912,513 3,000,966 9,100,649

LUBM 1.33 billion triples (time in sec)

Page 20: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

WWW 2010, Raleigh NC, USA

Comparison of index storage spaceComparison of index storage space

BitMat (including LZ77 compressed

dictionary mapping)

RDF-3X MonetDB Raw triples (uncompressed)

UniProt 51.2 GB 42 GB 16 GB 205 GB

LUBM 68.8 GB 70 GB 25 GB 451 GB

Page 21: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Future RoadmapFuture Roadmap

WWW 2010, Raleigh NC, USA

• Does not allow a subset of variables to be specified by the SELECT clause

• Does not have ability to process other class of SPARQL queries, e.g., OPTIONAL, UNION, FILTER etc.

• S-P or P-O dimensional joins not handled– Rare in assertional RDF data

• Cannot perform addition/deletion/update of triples

• Incorporate lazy-loading of BitMats to avoid overheads for highly selective queries

Page 22: Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Thank you!

WWW 2010, Raleigh NC, USA