scan-sharing for optimizing rdf graph pattern matching on mapreduce

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu{hkim22, pravind2, kogan}@ncsu.edu

COUL – Semantic COmpUting research Lab

Outline Background

RDF Graph Pattern Matching Graph Pattern Matching on MapReduce Queries with Repeated Properties (QRP) Nested Triplegroup Algebra (NTGA)

Challenges: Processing QRP with NTGA Approach: TripleGroup Cloning

Well-formed, Ambiguous, and Perfect TripleGroups TripleGroup Cloning in TG_GroupFilter

Evaluation Related Work

The Growing Amount of RDF data

May 2007 - # of datasets: 12 Sep 2011 - # of datasets:295

Growing #RDF triples: currently 31 billion

The amount of RDF on the web is rapidly growing. Example: DBPedia (http://dbpedia.org)

A dataset extracted from Wikipedia. Contains 1 billion RDF triples.

Linked Data on the web:

RDF Data Model(Resource Description Framework)

How is knowledge represented in the Semantic Web? e.g., Information on mobile device products.

Resource Description Framework (RDF) is used. W3C standard data model for the Semantic web

as Ex. “product1 has a name called iphone4” as RDF.

Represent information as a form of triple. A subject as “product1” A property as “name” An object as “iphone4”

(:Product1, :name, :iphone4)

Data model is a directed labeled graph. Node: subject, object Labeled edge: property

:Producer1

:Product1“iphone4”:name

:design

:Product2“iphone5”:name

:design

www.apple.com:homepage

:price$499

:date

:date

Processing RDF Query (from the Viewpoint of Graph Pattern Matching)

Query Variable is denoted with a question mark (e.g., ?product)

2. Example RDF Query:SELECT * WHERE{

?product :name ?productName .

?product :price ?productPrice .

?product :date ?productDate .}

Graph Pattern

(Three) Triple Patterns

1. Example RDF Dataset:Example Data: RDF graph on mobile devices

Oval: Resources in the Web Rectangle: Literals

:Producer1

:Product1“iphone4” “2011-10-14”:name

:design:date


:design

:date


:price$499

:Producer1


:design:date


:design

:date


:price$499

:Producer1


:design:date


:design

:date


:price$499

:Producer1


:design:date


:design

:date


:price$499

SELECT * WHERE{




SELECT * WHERE{

?product :name ?productName

?product :price ?productPrice


SELECT * WHERE{




A star pattern whose subject variable is ?product

Processing RDF Query(based on Relational Algebra)

Implicit joins on ?product

Subject Property

Object

:Product1

:price “$499”

:Product1

:name “iphone 4”

:Product1

:date “2011-10-14”

… … …:Product2


:Product2

:date “2012-09-12”

… … …

2. Example RDF Query:

4. (Intermediate) Result:

1. Example RDF Dataset

First scan of relation RSecond scan of relation RThird scan of relation R

SELECT * WHERE{

?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .}

3. Conceptual Execution Plan

⋈⋈

(Subject = Subject)

(R)

(Subject = Subject)

(:Product1, :name, “iphone 4”)(:Product2, :name, “iphone 5”)(:Product1, :name, “iphone 4”, :Product1, :price, “$499”)(:Product1, :name, “iphone 4”, :Product1, :price, “$499”, :Product1, :date, “2011-10-14”)

Subject Property

Object

:Product1

:price “$499”

:Product1


:Product1

:date “2011-10-14”



:Product2

:date “2012-09-12”

… … …

Subject Property

Object

:Product1

:price “$499”

:Product1


:Product1

:date “2011-10-14”



:Product2

:date “2012-09-12”

… … …

Subject Property

Object

:Product1

:price “$499”

:Product1


:Product1

:date “2011-10-14”



:Product2

:date “2012-09-12”

… … …

Relation R

SELECT * WHERE{


SELECT * WHERE{


SELECT * WHERE{


⋈⋈

(Subject = Subject)

(R)

(Subject = Subject)

⋈⋈

(Subject = Subject)

(R)

(Subject = Subject)

⋈⋈

(Subject = Subject)

(R)

(Subject = Subject)

Disk

Overview of MapReduce

𝐛𝟏 ,𝟏

𝐛𝟏 ,𝟐

M 1 ↓↓

Disk

𝐛𝟐 ,𝟏

𝐛𝟐 ,𝟐

M 2 ↓↓

Disk

𝐛𝟑 ,𝟏

𝐛𝟑 ,𝟐

M 3 ↓↓

☼ R1

☼ R2

HDFS

↓☼

= sort

= merge

HDFS

MapReduce (MR): Large-scale data processing systems running on a cluster of machines. [DEAN04]

Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster.

2. Reduce(k2, list (v2)) → list(v3)1.Map(k1,v1) → list(k2,v2)

: read the data. : execute user’s map function. : sort and write intermediate data

: sort and merge input. : execute user’s reduce function. : transfer result to HDFS.

[NYKIEL10]

(k2, ((L: k2, v4),

(R: k2, v1))

(k2, v4, k2, v1)

HDFS

Join Processing on MapReduce

M 1

M 2

R1

Result:Reduce: Separate and bu er the input ff

records into two sets according to the table tag (L or R)

Perform a cross-product

(k1, v5)(k2, v4)

L

(k2, v1)(k3, v6)

R

(k1, (L: k1, v5))

(k2, (R: k2, v1))

(k2, (L: k2, v4))

(k3, (R: k3, v6))

(k1,v5)

(k2,v1)

(k2,v4)

(k3,v6)

R2

(k1, ((L: k1, v5))

(k3, ((R: k3, v6))

R3

(k2, v4, k2, v1)Map:

Extract the join column Add a tag of either L or R Annotate tuples with join key

[BLANAS10] Example:

Equi-join operation with the first column of relation L and R

2. Vertical Partitioning (VP):

Partition relation R vertically based on the value of the property attribute.

E.g., property relation name, price, design, and type can be generated using selection or split operators.

name = price = design = type =

Processing Multi-Join Query on MapReduce3. Corresponding Logical Plan based on VP1. (Extended) Example Query

SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .

?producer :design ?product .?producer :type ?ProducerType .}

⋈⋈

⋈

(subject = subject)

(subject = object)

(subject = subject)

name price design type

MR Job: MR Job:

⋈temp1

⋈(subject = subject) (subject = object)

4. MapReduce Plans

MR Job:

⋈output

(subject = subject)

name price temp1 design temp2 type

[ABADI07]

Cost =

temp2

[ABADI07]

Query Optimization on MapReduce Heuristic to group operations -> fewer MR jobs in a workflow.

Group multiple join operations on the same key in same MR cycle. (Pig)

Finding optimal grouping is NP-hard; more advanced techniques use greedy approach that groups non-conflicting joins as much as possible.

1. (Extended) Example Query SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .

?producer :design ?product .?producer :type ?ProducerType .}

MR Job: MR Job:

temp1

⋈(subject = subject) (subject = subject)

MR Job:

⋈output

(subject = object)

name price design type temp1 temp2

temp2

date

⋈

[HUSAIN11]

⋈⋈

⋈

(subject = subject)

(subject = object)

(subject = subject)

name price date type

⋈design

(subject = subject)

2. Corresponding Logical Plan based on VP

TS: TableScan (Load) operator

Queries with “Repeated” Properties

Issue: name, type, date are scanned repeatedly across MR jobs J2, J3 Possible Optimization Considerations:

Minimize Scan overhead using indexes. MapReduce does not support any indexes by default.

Buffer such relations across multiple joins (memory intensive) Another approach : Algebraic Optimization

Rewrite queries to equivalent queries but less expensive ones.

1. Example Query

SELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .

?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .}

J1

SPLIT

price

name

type

date

design

HDFSHDFS

J2

HDFS

TS(name, type, …

JOIN

(price, name, …)

JOIN

(name, type, …)

J4

JOIN

TS(R)

J3

TS(design)

TS(type)

TS(name)

TS(date)

TS(price)

TS(type)TS(name)

TS(date)TS(price, name, …

Query: We want to see the list of the products with detail information and its producer information as well (e.g., the company name, the type of company, and its foundation date)

General Intuition in NTGANested TripleGroup Algebra (NTGA) : Re-interpret multiple star-

joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples

[RAVINDRA11]

1. Example Query

SELECT * WHERE { 1: ?x :p1 ?o1 . 2: ?x :p2 ?o2 .

3: ?y :p3 ?o2 .4: ?y :p4 ?o3 . }

Subject

Property

Object

:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …

2. Input Triplestg1 =

tg2 =

(:s1, :p1, :o1) (:s1, :p2, :o2)

(:s2, :p3, :o3) (:s2, :p4, :o4)

VP: 1MR job for each star pattern → 2MR jobs! each MR job for star pattern whose subject variable ?x, ?y

NTGA: 1MR job for all star patterns!

t1 =(:s1, :p1, :o1, :s2, p2, o2)

t2 =(:s2, :p3, :o3, :s3, p4, o4)

different structure BUT “content equivalent”

p1 ⋈ (subject=subject) p2

p3 ⋈ (subject=subject) p4

Subject

Property

Object

:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …

Subject

Property

Object

:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …

J1

TG_GroupFilter

:name:type:date:price

:design:name:type:date

TG_GroupBy

HDFSHDFSTS: TableScan (Load)

operator

Processing RDF Query with NTGA

NTGA:VP:

J2

TG_JOIN

TG_Flatten

TG_Unnest

TS(R) TS(Rpltd)

TS(Rltds)

4 MR jobs (4 HDFS reads)

2 MR jobs (2 HDFS reads)

1. Example Query

SELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .


J1

SPLIT

price

name

type

date

design

HDFSHDFS

J2

HDFS

TS(name, type, …

JOIN

(price, name, …)

JOIN

(name, type, …)

J4

JOIN

TS(R)

J3

TS(design)

TS(type)

TS(name)

TS(date)

TS(price)

TS(type)TS(name)

TS(date)TS(price, name, …

A "Key" NTGA Operator: TG_GroupFilter. Retain only TripleGroups that satisfy the required query sub

structure Check “exact” match between a set of property in star patterns and a

TripleGroup

Example Query: SELECT * WHERE { 1: ?x :p1 :o1 . 2: ?x :p2 ?y .

3: ?y :p3 :o2 .4: ?y :p4 :o3 . }

tg1 =

tg2 =

{(:s1, :p1, :o1) (:s1, :p2, :o2)

Input TripleGroups:

(:s2, :p2, :o2) (:s2, :p3, :o3) }

,

No Matches.Therefore, tg2 filtered out.

Correct match.Therefore, tg1 passes.

= : Matched : Not matched

(:p1, :p2) (:p3, :p4)

(:p1, :p2) (:p1, :p2)

(:p1, :p2) (:p2, :p3)

(:p1, :p2) (:p2, :p3)

Outline Background





SELECT * WHERE{?product :name ?prodname .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .


TG_GroupFilter Semantics and Repeated Properties.

s1 :type o1s1 :name o2s1 :date o3s1 :price o4s1 :design o5

1. Given triple pattern 2. A triplegroup from TG_GroupBy

tg0 =

Stp2

Stp1

Assumes 1-1 correspondence between TripleGroups and star subpatterns. But with repeated properties there can be ambiguities

(Partial Match with stp1 and stp2)

?

?

Overview of the Solution Issue: Mappings between TripleGroups and star patterns

become ambiguous if repeated properties exist across multiple star patterns.

Goal: Produce TripleGroups that can be a exact match with a star pattern in a query.

Solution: Classify the filtering processing into two steps.1. Remove out incomplete TripleGroups that do not match with

any star patterns (or eliminate Non-well-formed TripleGroups)

2. Solve the ambiguity of remaining TripleGroups that may match with multiple star patterns (Ambiguous TripleGroup) and generate TripleGroups that can be an exact match with a star pattern (Perfect TripleGroup)

Well-formed TripleGroup

1. Example Query

stp1

stp2

s1 :name :o1s1 :date :o2s1 :price :o3

tg1=

s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4

tg2=

s1 :name :o4s1 :design :o3 tg3=

Well-formed TripleGroup: a TripleGroup consisting of triples which contains all the properties of some star subpattern.

2. TripleGroups generated from TG_GroupBy

SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .

?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .}

well-formed (contain properties from )

well-formed (contain properties from )

NOT well-formed (Not contain all the properties from

Ambiguous&Perfect TripleGroup Ambiguous TripleGroup : a well-formed TripleGroup that can be

matched with multiple star subpatterns in a query, e.g. tg2 Perfect TripleGroup : a well-formed TripleGroup which is an

exact match for a single star pattern.* (valid intermediate answers)

1. Example Query

stp1

stp2


tg1=


tg2=

2. TripleGroups generated from TG_GroupBy


?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .}

Perfect TripleGroup(“exact” match with )

Ambiguous TripleGroup(can be matched with )

* a single star pattern “class”

Dealing with Ambiguous TripleGroupsPerfect triplegroups and are cloned from the ambiguous triplegroup and the non-perfect triplegroup is rejected.


tg1=

s1 :design :o4s1 :name :o1s1 :date :o2tg2=

s1 :sell ??s1 :name :o1

tg3=

Ambiguous TripleGroup

Clone(:name, :date, :price)

Clone(:sell,:name)

Clone (:design, :name, :date)

stp1

stp2


?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .

?seller :sell ?product?seller :name ?selName}


stp3

Perfect TripleGroup

tg0=

Generated MR Plan

J1: Map

m:TG_GroupBy

r:TG_GroupBy

r:TG_GroupFilter*(Revised)

m:TG_JOIN(?o1 = ?o1)

r:TG_JOIN

J2: Map

J1: Reduce

J2: Reduce

J1

J2

NTGA-based MapReduce Plan

m:op : Map-side Operator

r:op :Reduce-side Operator

Example Query Clone in TG_GroupFilter

{ }

{

}

,

(clone)

(…)


?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .

?seller :sell ?product?seller :name ?selName}


tg0=


tg1=

s1 :design :o4s1 :name :o1s1 :date :o2

tg2=

Losslessness of Revised TG_Groupfilter.

Subject

Property

Object

:s1 :price :o1:s1 :name :o2:s1 :date :o3:s1 :design :o4… … …


tg0= ,

s1 :name :o1s1 :date :o2s1 :price :o3tg1=

s1 :design :o4s1 :name :o1s1 :date :o2tg2=

t1 = (:s1, :name, :o1, :s1, :date, :o2, :s1, :price, :o3)

1) name ⋈(subject=subject) date ⋈(subject=subject) price

t2 = (:s1, :design, :o4, :s1, :name, :o1, :s1, :price, :o3)

1. Relational Algebra (VP)

2. NTGA

Example Dataset

(clone)

t1 , t2 No valid intermediate results are destroyed nor

are spurious results introduced by cloning.

Filter out non-well-formed TripleGroup. Incomplete TripleGroup that does not contain all the properties for any

star patterns clearly does not match any star patterns in a query. Generate multiple Perfect TripleGroups from an ambiguous TripleGroups.

2) design ⋈(subject=subject) name ⋈(subject=subject) date

Outline Background





Setup and TestBed Setup:

Implement VP and NTGA on top of Apache Pig. 10-node Hadoop clusters on NCSU’s VCL*.

Three approaches were considered : 1-join-per-cycle (SHARD) 1-star-join-per-cycle (Pig-Def or VP) all-star-joins-1-cycle (NTGA)

Evaluation of the redundant scans during star-join computations.

Task 1a – varying the ratio of repeated properties to fixed ones. Task 1b – varying the selectivity of repeated properties. Task 2 – scaling up sub patterns with repeated properties. Task 3 – scalability test with varying data size

*https://vcl.ncsu.edu

[ROHLOFF10]

DatasetDataset: Synthetic benchmark dataset generated using BSBM*- From 22GB (250k Products, BSBM-250k ~86M triples) - Up to 87GB (1M Products, BSBM-1000k ~350M triples)

7 repeated properties:- Across all classes e.g. type, publisher- Only for a smaller subset of classes, e.g. name

The size and selectivity ** of BSBM-250k : :publisher - 1.7GB, 0.091:type - 1.8GB, 0.105:name - 49MB, 0.003:date - 1.4GB, 0.091

* http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/ = denotes triples containing P and T denotes all triples

Task 1a: Varying the Ratio of Repeated Properties to Fixed ones.

Test Queries – (dq0 to dq4)- Two star patterns with fixed subset of unique properties + varying #repeated

properties in the second star pattern (from 0 to 4). - Overall #triple patterns increase from 8 to 12

:publisher :name

:type

:date

dq0: 2 star pattern, 0 repeated properties. dq4: 2 star patterns,

4 repeated properties.(:type, :publisher, :name, :date)

Black edge: arbitrary unique property Red edge: repeated property

:publisher :name

:type

:date

:publisher :name

:type

:datedq1: 1 repeated props.dq2: 2 repeated props.dq3: 3 repeated props.

Task 1a: Varying the Ratio of Repeated Properties to Fixed ones.

Pig-Def (4 MR cycles), NTGA(2 cycles), SHARD (13 cycles)

dq0 dq1 dq2 dq3 dq40

50000000000

100000000000

150000000000

200000000000

HD

FS_R

EAD

(G

B)

1-star-join-per-cycle (Pig-Def)

1-join-per-cycle (SHARD)

all-star-joins-1-cycle (NTGA)


500

1000

1500

2000

2500

Tim

e (S

econ

ds)


5000000000

10000000000

15000000000

20000000000

25000000000

HD

FS_W

RIT

E (G

B)

Pig-Def MR1

MR4

SHARD MR1

MR12

00:00 07:12 14:24 21:36 28:48

Map_Start

With increasing #repeated properties, 1. NTGA : Constant HDFS reads and execution time : Less HDFS writes due to the fewer number of required MR jobs.2. SHARD #the scans of the whole relations are increased.3. Pig-Def or VP : #the scans of the property relations are increased.

Task 1b: Varying the Size of Repeated Props

Test Queries – rq1 and rq2 Identical queries with two star subpatterns

but contain a different repeated property.- rq1 : :publisher - 1.7GB, 9.1%- rq2 : :name - 49MB, 0.3%

- NTGA has around 42% performance gain over Pig-Def for rq2 and increases to around 48% gain for rq1.

- With rq2, Pig-Def always uses additional 70 seconds than rq1.

:publisher:publisher

:name:name

rq1: two star pattern with repeated property :publisher

rq2: two star pattern with repeated property :name

Task 2: Scaling up Sub patterns with Repeated Properties

Four queries (mq1 ~ mq4) - Two repeated properties occur in each of the star subpatterns, - Vary number of star patterns (1 to 4).

- The total number of repeated properties are increased across a graph pattern query: from 2 (in mq1) to 8 (in mq4)

:publisher

:type

mq1: a single star pattern

mq2: two star patterns

:publisher

:type

:publisher

:type

:publisher

:type

:publisher

:type

:publisher

:type

mq3: threestar patterns

mq1 mq2 mq3 mq40

5001000150020002500300035004000

Tim

e (S

econ

d)

mq1 mq2 mq3 mq40

50,000,000,000

100,000,000,000

150,000,000,000

200,000,000,000

250,000,000,000

300,000,000,000

HD

FS_R

EAD

(G

B)

mq1 mq4: ↑ #star patterns → ↑ #repeated properties across star patterns (from 2 to 8), ↑ #the amount of scan-sharing across star patterns (from around 40G to 120G) Execution Time is increased due to join operations for

connecting sub stars.

Task 2: Scaling up Sub patterns with Repeated Properties

1-star-join-per-cycle (Pig-Def)

1-join-per-cycle (SHARD)

all-star-joins-1-cycle (NTGA)

≈40G≈ 80G ≈ 120G

Task 3: Varying Size of Graphs

BSBM-250k (22GB)

BSBM-500k (43GB)

BSBM-750k(66GB)

BSBM-1000k (86GB)

0

500

1000

1500

2000

2500

52.8% 54.8%55%

58%

Pig-Def NTGAEx

ecut

ion

Tim

e (i

n se

cond

s)

Increases #RDF triples for query dq4 used in Task1. From BSBM-250k (22GB) to BSBM-1000k (86GB)

NTGA approach scales well.- Performance gain is observed from 52% to 58%- The size of relations containing repeated properties are not

increased linearly when increasing the size of data

Related WorkRDF Data Processing on MapReduce:SHARD[Rohloff10] : The clause-iteration algorithm (n +1 jobs to process n triple patterns)

HadoopDB[Huang11] : A hybrid architecture of database (RDF-3x) and Hadoop with a graph partitioning scheme.

HadoopRDF[Husain10] : A customized storage format and plan generation based on a heuristic greedy approach.

Work Sharing on MapReduce:MRShare [NYKIEL10]: Inter-query sharing scheme customized into the MapReduce framework.

NOVA [Olston11]: Share the initial load operation if multiple copies of workflow use the identical input.

CoScan[Wang11]: Minimize redundant data loading by merging multiple Pig scripts.

Relevant Publications Kim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for

Optimizing RDF Graph Pattern Matching on MapReduce, In: Proc. CLOUD (2012)

Anyanwu, K., Kim, H., Ravindra, P., : Algebraic Optimization for Processing Graph Pattern Queries in the Cloud, IEEE Internet Computing (2012)

Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. In: Proc. International Conference on Very Large Data Bases (2011) – (Demonstration).

Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Platforms, In: Proc. Extended Semantic Web Conference (2011)

References[DEAN08] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.

ACM 51 (2008) 107–113[OLSTON08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign

language for data processing. In: Proc. International Conference on Management of data. (2008)[HUSAIN11] M. F. Husain, J. McGlothlin et al., “Heuristics-Based Query Processing for Large RDF

Graphs Using Cloud Computing,” TKDE, vol. 23, pp. 1312–1327, 2011.[HUANG11] J. Huang, D. J. Abadi et al., “Scalable SPARQL Querying of Large RDF Graphs,” Proc. VLDB,

vol. 4, no. 11, 2011.[NYKIEL10] T. Nykiel, M. Potamias et al., “MRShare: Sharing across Multiple Queries in MapReduce,”

Proc. VLDB, vol. 3, pp.494–505, 2010.[OLSTON11] C. Olston, G. Chiou et al., “Nova: Continuous Pig/Hadoop Workflows,” in Proc. SIGMOD,

2011, pp. 1081–1090.[WANG11] X. Wang, C. Olston et al., “CoScan: Cooperative Scan Sharing in the Cloud,” in Proc. SOCC,

2011, pp. 11:1–11:12.[RAVINDRA11] P. Ravindra, H. Kim et al., “An Intermediate Algebra for Optimizing RDF Graph Pattern

Matching on MapReduce,” in Proc. ESWC, 2011, vol. 6644, pp. 46–61.[ABADI07] D. J. Abadi, A. Marcus et al., “Scalable Semantic Web data Management using Vertical

Partitioning,” in Proc. VLDB,2007.[ROHLOFF10] K. Rohloff and R. E. Schantz, “High-performance, Massively Scalable Distributed Systems

using the MapReduce Software Framework: the SHARD Triple-store,” in PSI EtA, 2010, pp. 4:1–4:5.[NEUMANN10] T. Neumann and G. Weikum, “The RDF-3X engine for scalable management of RDF

data,” The VLDB Journal, vol. 19, pp. 91–113, 2010.[WEISS08] C. Weiss, P. Karras, and A. Bernstein.“Hexastore: Sextuple Indexing for Semantic Web Data

Management”, Proc. VLDB, vol. 1, no. 1, 2008.[HERODOTOU11] H. Herodotou and S. Babu. “Profiling, What-if Analysis, and Cost-based Optimization of

MapReduce Programs.” Proc. VLDB, vol. 4, 2011[BLANAS1010] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. “A Comparison of

Join Algorithms for Log Processing in MapReduce.” Proc. SIGMOD, 2010.

Thank You!

Oval: resources i.e. URIs Rectangle: Literals

RDF Data Model(Resource Description Framework)

1. Statements (triples) 2. Graph Representation

:Producer1

:Producer“Apple” “1976-04-01”

apple.com

:Product1

:white

“iphone4”

“2011-10-14”

:name

:name

:publisher

:date

:color

:type:date

:homepage

Subject Property

Object

:Product1

:name “iphone4”

:Product1

:color :white

:Product1

:date “2011-10-14”

:Product1

:publisher :Producer1

… … …:Producer1

:name “Apple”

:Producer1

:type :Producer

:Producer1

:date “1976-04-01”

:Producer1

:homepage

apple.com

Star subgraphs - set of edges with same subject e.g. :Product1 and :Producer1,

Relationship between TripleGroups and n-tuples

different structure BUT “content equivalent”

(:Product1, :type, :Product, :Product1, :date, “1976-04-01”, :Product1, :name, “iphone 4”)

t1 t2 t3

2. n-tuple in VP (SPLIT and JOIN)

(:Product1, :type, :Product)(:Product1, :date, “1976-04-01”), (:Product1, :name, “iphone 4”)

tg1 =

1.TripleGroup in NTGA (TG_GroupBy and TG_GroupFilter)

TripleGroups are not structurally equivalent to n-tuples but are “content equivalent”.

# NTGA Operators Result1 TG_Flatten(tg1) (:Prdct1, :name,

“iphone4”, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100)

2 TG_Join(?o :publisher ?v: TG{:name, :publisher, :price}

?v :type ?t : TG{:type, :date, :hpage} )

ntg = {(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }

3 TG_Unnest(ntg) (:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, :Prdcr1),(:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }

NTGA Quick ReferenceConsider, a set of Triplegroups TG = {tg1 , tg2 } such that

(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, :prdcr1), (:Prdct1, :price, “100”)

(:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”)

tg1 = tg2 =



2 TG_Join(?o :publisher ?v: TG{:name, :publisher, :price}

?v :type ?t : TG{:type, :date, :hpage} )

ntg = {(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }



Execution on MapReduce Platform MapReduce (MR): Popular large-scale data processing systems

of data running on a cluster of commodity grade machines [DEAN04]

* http://hadoop.apache.org** http://pig.apache.org, *** http://hive.apache.org

Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. Apache Hadoop* – open-source implementation

Extended systems provides high-level languages for specifying tasks along with optimizing compilers for generating map/reduce code à la database systems. Pig Latin for Apache Pig**, HiveQL for Apache Hive***.

http://pig.apache.org/

Architecture of RAPID+

MapReduce Job Compiler

Hadoop Job Tracker

Query Ana-lyzer

SPARQL parser

Logical-to-Physical Plan Translator

Pig Latin Plan

Generator

NTGA Plan Generator

QueryA

rchi

tect

ure

of

RA

PID

+

Logical Plan Generator/Optimizer

Parser Layer

Pig Latin parser

(…)

SPLITLOAD STOREJOIN

JOIN

TG_GroupBy

TG_GroupFilterLOAD STORE

TG_Join

scan-sharing for optimizing rdf graph pattern matching on mapreduce

Documents

rdf triples

data model rdf

current web

n semantic web

subject node

semantic web stack builds

form subject

linked data