scan-sharing for optimizing rdf graph pattern matching on mapreduce
DESCRIPTION
Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce HyeongSik Kim, Padmashree Ravindra , Kemafor Anyanwu { hkim22, pravind2, kogan }@ ncsu.edu. COUL – Semantic CO mp U ting research L ab. Outline. Background RDF Graph Pattern Matching - PowerPoint PPT PresentationTRANSCRIPT
Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce
HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu{hkim22, pravind2, kogan}@ncsu.edu
COUL – Semantic COmpUting research Lab
Outline Background
RDF Graph Pattern Matching Graph Pattern Matching on MapReduce Queries with Repeated Properties (QRP) Nested Triplegroup Algebra (NTGA)
Challenges: Processing QRP with NTGA Approach: TripleGroup Cloning
Well-formed, Ambiguous, and Perfect TripleGroups TripleGroup Cloning in TG_GroupFilter
Evaluation Related Work
The Growing Amount of RDF data
May 2007 - # of datasets: 12 Sep 2011 - # of datasets:295
Growing #RDF triples: currently 31 billion
The amount of RDF on the web is rapidly growing. Example: DBPedia (http://dbpedia.org)
A dataset extracted from Wikipedia. Contains 1 billion RDF triples.
Linked Data on the web:
RDF Data Model(Resource Description Framework)
How is knowledge represented in the Semantic Web? e.g., Information on mobile device products.
Resource Description Framework (RDF) is used. W3C standard data model for the Semantic web
as Ex. “product1 has a name called iphone4” as RDF.
Represent information as a form of triple. A subject as “product1” A property as “name” An object as “iphone4”
(:Product1, :name, :iphone4)
Data model is a directed labeled graph. Node: subject, object Labeled edge: property
:Producer1
:Product1“iphone4”:name
:design
:Product2“iphone5”:name
:design
www.apple.com:homepage
:price$499
:date
:date
Processing RDF Query (from the Viewpoint of Graph Pattern Matching)
Query Variable is denoted with a question mark (e.g., ?product)
2. Example RDF Query:SELECT * WHERE{
?product :name ?productName .
?product :price ?productPrice .
?product :date ?productDate .}
Graph Pattern
(Three) Triple Patterns
1. Example RDF Dataset:Example Data: RDF graph on mobile devices
Oval: Resources in the Web Rectangle: Literals
:Producer1
:Product1“iphone4” “2011-10-14”:name
:design:date
:Product2“iphone5” “2012-09-12”:name
:design
:date
www.apple.com:homepage
:price$499
:Producer1
:Product1“iphone4” “2011-10-14”:name
:design:date
:Product2“iphone5” “2012-09-12”:name
:design
:date
www.apple.com:homepage
:price$499
:Producer1
:Product1“iphone4” “2011-10-14”:name
:design:date
:Product2“iphone5” “2012-09-12”:name
:design
:date
www.apple.com:homepage
:price$499
:Producer1
:Product1“iphone4” “2011-10-14”:name
:design:date
:Product2“iphone5” “2012-09-12”:name
:design
:date
www.apple.com:homepage
:price$499
SELECT * WHERE{
?product :name ?productName .
?product :price ?productPrice .
?product :date ?productDate .}
SELECT * WHERE{
?product :name ?productName
?product :price ?productPrice
?product :date ?productDate .}
SELECT * WHERE{
?product :name ?productName .
?product :price ?productPrice .
?product :date ?productDate .}
A star pattern whose subject variable is ?product
Processing RDF Query(based on Relational Algebra)
Implicit joins on ?product
Subject Property
Object
:Product1
:price “$499”
:Product1
:name “iphone 4”
:Product1
:date “2011-10-14”
… … …:Product2
:name “iphone 5”
:Product2
:date “2012-09-12”
… … …
2. Example RDF Query:
4. (Intermediate) Result:
1. Example RDF Dataset
First scan of relation RSecond scan of relation RThird scan of relation R
SELECT * WHERE{
?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .}
3. Conceptual Execution Plan
⋈⋈
(Subject = Subject)
(R)
(Subject = Subject)
(:Product1, :name, “iphone 4”)(:Product2, :name, “iphone 5”)(:Product1, :name, “iphone 4”, :Product1, :price, “$499”)(:Product1, :name, “iphone 4”, :Product1, :price, “$499”, :Product1, :date, “2011-10-14”)
Subject Property
Object
:Product1
:price “$499”
:Product1
:name “iphone 4”
:Product1
:date “2011-10-14”
… … …:Product2
:name “iphone 5”
:Product2
:date “2012-09-12”
… … …
Subject Property
Object
:Product1
:price “$499”
:Product1
:name “iphone 4”
:Product1
:date “2011-10-14”
… … …:Product2
:name “iphone 5”
:Product2
:date “2012-09-12”
… … …
Subject Property
Object
:Product1
:price “$499”
:Product1
:name “iphone 4”
:Product1
:date “2011-10-14”
… … …:Product2
:name “iphone 5”
:Product2
:date “2012-09-12”
… … …
Relation R
SELECT * WHERE{
?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .}
SELECT * WHERE{
?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .}
SELECT * WHERE{
?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .}
⋈⋈
(Subject = Subject)
(R)
(Subject = Subject)
⋈⋈
(Subject = Subject)
(R)
(Subject = Subject)
⋈⋈
(Subject = Subject)
(R)
(Subject = Subject)
Disk
Overview of MapReduce
𝐛𝟏 ,𝟏
𝐛𝟏 ,𝟐
M 1 ↓↓
Disk
𝐛𝟐 ,𝟏
𝐛𝟐 ,𝟐
M 2 ↓↓
Disk
𝐛𝟑 ,𝟏
𝐛𝟑 ,𝟐
M 3 ↓↓
☼ R1
☼ R2
HDFS
↓☼
= sort
= merge
HDFS
MapReduce (MR): Large-scale data processing systems running on a cluster of machines. [DEAN04]
Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster.
2. Reduce(k2, list (v2)) → list(v3)1.Map(k1,v1) → list(k2,v2)
: read the data. : execute user’s map function. : sort and write intermediate data
: sort and merge input. : execute user’s reduce function. : transfer result to HDFS.
[NYKIEL10]
(k2, ((L: k2, v4),
(R: k2, v1))
(k2, v4, k2, v1)
HDFS
Join Processing on MapReduce
M 1
M 2
R1
Result:Reduce: Separate and bu er the input ff
records into two sets according to the table tag (L or R)
Perform a cross-product
(k1, v5)(k2, v4)
L
(k2, v1)(k3, v6)
R
(k1, (L: k1, v5))
(k2, (R: k2, v1))
(k2, (L: k2, v4))
(k3, (R: k3, v6))
(k1,v5)
(k2,v1)
(k2,v4)
(k3,v6)
R2
(k1, ((L: k1, v5))
(k3, ((R: k3, v6))
R3
(k2, v4, k2, v1)Map:
Extract the join column Add a tag of either L or R Annotate tuples with join key
[BLANAS10] Example:
Equi-join operation with the first column of relation L and R
2. Vertical Partitioning (VP):
Partition relation R vertically based on the value of the property attribute.
E.g., property relation name, price, design, and type can be generated using selection or split operators.
name = price = design = type =
Processing Multi-Join Query on MapReduce3. Corresponding Logical Plan based on VP1. (Extended) Example Query
SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .
?producer :design ?product .?producer :type ?ProducerType .}
⋈⋈
⋈
(subject = subject)
(subject = object)
(subject = subject)
name price design type
MR Job: MR Job:
⋈temp1
⋈(subject = subject) (subject = object)
4. MapReduce Plans
MR Job:
⋈output
(subject = subject)
name price temp1 design temp2 type
[ABADI07]
Cost =
temp2
[ABADI07]
Query Optimization on MapReduce Heuristic to group operations -> fewer MR jobs in a workflow.
Group multiple join operations on the same key in same MR cycle. (Pig)
Finding optimal grouping is NP-hard; more advanced techniques use greedy approach that groups non-conflicting joins as much as possible.
1. (Extended) Example Query SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .
?producer :design ?product .?producer :type ?ProducerType .}
MR Job: MR Job:
temp1
⋈(subject = subject) (subject = subject)
MR Job:
⋈output
(subject = object)
name price design type temp1 temp2
temp2
date
⋈
[HUSAIN11]
⋈⋈
⋈
(subject = subject)
(subject = object)
(subject = subject)
name price date type
⋈design
(subject = subject)
2. Corresponding Logical Plan based on VP
TS: TableScan (Load) operator
Queries with “Repeated” Properties
Issue: name, type, date are scanned repeatedly across MR jobs J2, J3 Possible Optimization Considerations:
Minimize Scan overhead using indexes. MapReduce does not support any indexes by default.
Buffer such relations across multiple joins (memory intensive) Another approach : Algebraic Optimization
Rewrite queries to equivalent queries but less expensive ones.
1. Example Query
SELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .}
J1
SPLIT
price
name
type
date
design
HDFSHDFS
J2
HDFS
TS(name, type, …
JOIN
(price, name, …)
JOIN
(name, type, …)
J4
JOIN
TS(R)
J3
TS(design)
TS(type)
TS(name)
TS(date)
TS(price)
TS(type)TS(name)
TS(date)TS(price, name, …
Query: We want to see the list of the products with detail information and its producer information as well (e.g., the company name, the type of company, and its foundation date)
General Intuition in NTGANested TripleGroup Algebra (NTGA) : Re-interpret multiple star-
joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples
[RAVINDRA11]
1. Example Query
SELECT * WHERE { 1: ?x :p1 ?o1 . 2: ?x :p2 ?o2 .
3: ?y :p3 ?o2 .4: ?y :p4 ?o3 . }
Subject
Property
Object
:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …
2. Input Triplestg1 =
tg2 =
(:s1, :p1, :o1) (:s1, :p2, :o2)
(:s2, :p3, :o3) (:s2, :p4, :o4)
VP: 1MR job for each star pattern → 2MR jobs! each MR job for star pattern whose subject variable ?x, ?y
NTGA: 1MR job for all star patterns!
t1 =(:s1, :p1, :o1, :s2, p2, o2)
t2 =(:s2, :p3, :o3, :s3, p4, o4)
different structure BUT “content equivalent”
p1 ⋈ (subject=subject) p2
p3 ⋈ (subject=subject) p4
Subject
Property
Object
:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …
Subject
Property
Object
:s1 :p1 :o1:s1 :p2 :o2:s2 :p3 :o3:s2 :p4 :o4… … …
J1
TG_GroupFilter
:name:type:date:price
:design:name:type:date
TG_GroupBy
HDFSHDFSTS: TableScan (Load)
operator
Processing RDF Query with NTGA
NTGA:VP:
J2
TG_JOIN
TG_Flatten
TG_Unnest
TS(R) TS(Rpltd)
TS(Rltds)
4 MR jobs (4 HDFS reads)
2 MR jobs (2 HDFS reads)
1. Example Query
SELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .}
J1
SPLIT
price
name
type
date
design
HDFSHDFS
J2
HDFS
TS(name, type, …
JOIN
(price, name, …)
JOIN
(name, type, …)
J4
JOIN
TS(R)
J3
TS(design)
TS(type)
TS(name)
TS(date)
TS(price)
TS(type)TS(name)
TS(date)TS(price, name, …
A "Key" NTGA Operator: TG_GroupFilter. Retain only TripleGroups that satisfy the required query sub
structure Check “exact” match between a set of property in star patterns and a
TripleGroup
Example Query: SELECT * WHERE { 1: ?x :p1 :o1 . 2: ?x :p2 ?y .
3: ?y :p3 :o2 .4: ?y :p4 :o3 . }
tg1 =
tg2 =
{(:s1, :p1, :o1) (:s1, :p2, :o2)
Input TripleGroups:
(:s2, :p2, :o2) (:s2, :p3, :o3) }
,
No Matches.Therefore, tg2 filtered out.
Correct match.Therefore, tg1 passes.
= : Matched : Not matched
(:p1, :p2) (:p3, :p4)
(:p1, :p2) (:p1, :p2)
(:p1, :p2) (:p2, :p3)
(:p1, :p2) (:p2, :p3)
Outline Background
RDF Graph Pattern Matching Graph Pattern Matching on MapReduce Queries with Repeated Properties (QRP) Nested Triplegroup Algebra (NTGA)
Challenges: Processing QRP with NTGA Approach: TripleGroup Cloning
Well-formed, Ambiguous, and Perfect TripleGroups TripleGroup Cloning in TG_GroupFilter
Evaluation Related Work
SELECT * WHERE{?product :name ?prodname .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .}
TG_GroupFilter Semantics and Repeated Properties.
s1 :type o1s1 :name o2s1 :date o3s1 :price o4s1 :design o5
1. Given triple pattern 2. A triplegroup from TG_GroupBy
tg0 =
Stp2
Stp1
Assumes 1-1 correspondence between TripleGroups and star subpatterns. But with repeated properties there can be ambiguities
(Partial Match with stp1 and stp2)
?
?
Overview of the Solution Issue: Mappings between TripleGroups and star patterns
become ambiguous if repeated properties exist across multiple star patterns.
Goal: Produce TripleGroups that can be a exact match with a star pattern in a query.
Solution: Classify the filtering processing into two steps.1. Remove out incomplete TripleGroups that do not match with
any star patterns (or eliminate Non-well-formed TripleGroups)
2. Solve the ambiguity of remaining TripleGroups that may match with multiple star patterns (Ambiguous TripleGroup) and generate TripleGroups that can be an exact match with a star pattern (Perfect TripleGroup)
Well-formed TripleGroup
1. Example Query
stp1
stp2
s1 :name :o1s1 :date :o2s1 :price :o3
tg1=
s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4
tg2=
s1 :name :o4s1 :design :o3 tg3=
Well-formed TripleGroup: a TripleGroup consisting of triples which contains all the properties of some star subpattern.
2. TripleGroups generated from TG_GroupBy
SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .}
well-formed (contain properties from )
well-formed (contain properties from )
NOT well-formed (Not contain all the properties from
Ambiguous&Perfect TripleGroup Ambiguous TripleGroup : a well-formed TripleGroup that can be
matched with multiple star subpatterns in a query, e.g. tg2 Perfect TripleGroup : a well-formed TripleGroup which is an
exact match for a single star pattern.* (valid intermediate answers)
1. Example Query
stp1
stp2
s1 :name :o1s1 :date :o2s1 :price :o3
tg1=
s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4
tg2=
2. TripleGroups generated from TG_GroupBy
SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .}
Perfect TripleGroup(“exact” match with )
Ambiguous TripleGroup(can be matched with )
* a single star pattern “class”
Dealing with Ambiguous TripleGroupsPerfect triplegroups and are cloned from the ambiguous triplegroup and the non-perfect triplegroup is rejected.
s1 :name :o1s1 :date :o2s1 :price :o3
tg1=
s1 :design :o4s1 :name :o1s1 :date :o2tg2=
s1 :sell ??s1 :name :o1
tg3=
Ambiguous TripleGroup
Clone(:name, :date, :price)
Clone(:sell,:name)
Clone (:design, :name, :date)
stp1
stp2
SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .
?seller :sell ?product?seller :name ?selName}
s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4
stp3
Perfect TripleGroup
tg0=
Generated MR Plan
J1: Map
m:TG_GroupBy
r:TG_GroupBy
r:TG_GroupFilter*(Revised)
m:TG_JOIN(?o1 = ?o1)
r:TG_JOIN
J2: Map
J1: Reduce
J2: Reduce
J1
J2
NTGA-based MapReduce Plan
m:op : Map-side Operator
r:op :Reduce-side Operator
Example Query Clone in TG_GroupFilter
{ }
{
}
,
(clone)
(…)
SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .
?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .
?seller :sell ?product?seller :name ?selName}
s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4
tg0=
s1 :name :o1s1 :date :o2s1 :price :o3
tg1=
s1 :design :o4s1 :name :o1s1 :date :o2
tg2=
Losslessness of Revised TG_Groupfilter.
Subject
Property
Object
:s1 :price :o1:s1 :name :o2:s1 :date :o3:s1 :design :o4… … …
s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4
tg0= ,
s1 :name :o1s1 :date :o2s1 :price :o3tg1=
s1 :design :o4s1 :name :o1s1 :date :o2tg2=
t1 = (:s1, :name, :o1, :s1, :date, :o2, :s1, :price, :o3)
1) name ⋈(subject=subject) date ⋈(subject=subject) price
t2 = (:s1, :design, :o4, :s1, :name, :o1, :s1, :price, :o3)
1. Relational Algebra (VP)
2. NTGA
Example Dataset
(clone)
t1 , t2 No valid intermediate results are destroyed nor
are spurious results introduced by cloning.
Filter out non-well-formed TripleGroup. Incomplete TripleGroup that does not contain all the properties for any
star patterns clearly does not match any star patterns in a query. Generate multiple Perfect TripleGroups from an ambiguous TripleGroups.
2) design ⋈(subject=subject) name ⋈(subject=subject) date
Outline Background
RDF Graph Pattern Matching Graph Pattern Matching on MapReduce Queries with Repeated Properties (QRP) Nested Triplegroup Algebra (NTGA)
Challenges: Processing QRP with NTGA Approach: TripleGroup Cloning
Well-formed, Ambiguous, and Perfect TripleGroups TripleGroup Cloning in TG_GroupFilter
Evaluation Related Work
Setup and TestBed Setup:
Implement VP and NTGA on top of Apache Pig. 10-node Hadoop clusters on NCSU’s VCL*.
Three approaches were considered : 1-join-per-cycle (SHARD) 1-star-join-per-cycle (Pig-Def or VP) all-star-joins-1-cycle (NTGA)
Evaluation of the redundant scans during star-join computations.
Task 1a – varying the ratio of repeated properties to fixed ones. Task 1b – varying the selectivity of repeated properties. Task 2 – scaling up sub patterns with repeated properties. Task 3 – scalability test with varying data size
*https://vcl.ncsu.edu
[ROHLOFF10]
DatasetDataset: Synthetic benchmark dataset generated using BSBM*- From 22GB (250k Products, BSBM-250k ~86M triples) - Up to 87GB (1M Products, BSBM-1000k ~350M triples)
7 repeated properties:- Across all classes e.g. type, publisher- Only for a smaller subset of classes, e.g. name
The size and selectivity ** of BSBM-250k : :publisher - 1.7GB, 0.091:type - 1.8GB, 0.105:name - 49MB, 0.003:date - 1.4GB, 0.091
* http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/ = denotes triples containing P and T denotes all triples
Task 1a: Varying the Ratio of Repeated Properties to Fixed ones.
Test Queries – (dq0 to dq4)- Two star patterns with fixed subset of unique properties + varying #repeated
properties in the second star pattern (from 0 to 4). - Overall #triple patterns increase from 8 to 12
:publisher :name
:type
:date
dq0: 2 star pattern, 0 repeated properties. dq4: 2 star patterns,
4 repeated properties.(:type, :publisher, :name, :date)
Black edge: arbitrary unique property Red edge: repeated property
:publisher :name
:type
:date
:publisher :name
:type
:datedq1: 1 repeated props.dq2: 2 repeated props.dq3: 3 repeated props.
Task 1a: Varying the Ratio of Repeated Properties to Fixed ones.
Pig-Def (4 MR cycles), NTGA(2 cycles), SHARD (13 cycles)
dq0 dq1 dq2 dq3 dq40
50000000000
100000000000
150000000000
200000000000
HD
FS_R
EAD
(G
B)
1-star-join-per-cycle (Pig-Def)
1-join-per-cycle (SHARD)
all-star-joins-1-cycle (NTGA)
dq0 dq1 dq2 dq3 dq40
500
1000
1500
2000
2500
Tim
e (S
econ
ds)
dq0 dq1 dq2 dq3 dq40
5000000000
10000000000
15000000000
20000000000
25000000000
HD
FS_W
RIT
E (G
B)
Pig-Def MR1
MR4
SHARD MR1
MR12
00:00 07:12 14:24 21:36 28:48
Map_Start
With increasing #repeated properties, 1. NTGA : Constant HDFS reads and execution time : Less HDFS writes due to the fewer number of required MR jobs.2. SHARD #the scans of the whole relations are increased.3. Pig-Def or VP : #the scans of the property relations are increased.
Task 1b: Varying the Size of Repeated Props
Test Queries – rq1 and rq2 Identical queries with two star subpatterns
but contain a different repeated property.- rq1 : :publisher - 1.7GB, 9.1%- rq2 : :name - 49MB, 0.3%
- NTGA has around 42% performance gain over Pig-Def for rq2 and increases to around 48% gain for rq1.
- With rq2, Pig-Def always uses additional 70 seconds than rq1.
:publisher:publisher
:name:name
rq1: two star pattern with repeated property :publisher
rq2: two star pattern with repeated property :name
Task 2: Scaling up Sub patterns with Repeated Properties
Four queries (mq1 ~ mq4) - Two repeated properties occur in each of the star subpatterns, - Vary number of star patterns (1 to 4).
- The total number of repeated properties are increased across a graph pattern query: from 2 (in mq1) to 8 (in mq4)
:publisher
:type
mq1: a single star pattern
mq2: two star patterns
:publisher
:type
:publisher
:type
:publisher
:type
:publisher
:type
:publisher
:type
mq3: threestar patterns
mq1 mq2 mq3 mq40
5001000150020002500300035004000
Tim
e (S
econ
d)
mq1 mq2 mq3 mq40
50,000,000,000
100,000,000,000
150,000,000,000
200,000,000,000
250,000,000,000
300,000,000,000
HD
FS_R
EAD
(G
B)
mq1 mq4: ↑ #star patterns → ↑ #repeated properties across star patterns (from 2 to 8), ↑ #the amount of scan-sharing across star patterns (from around 40G to 120G) Execution Time is increased due to join operations for
connecting sub stars.
Task 2: Scaling up Sub patterns with Repeated Properties
1-star-join-per-cycle (Pig-Def)
1-join-per-cycle (SHARD)
all-star-joins-1-cycle (NTGA)
≈40G≈ 80G ≈ 120G
Task 3: Varying Size of Graphs
BSBM-250k (22GB)
BSBM-500k (43GB)
BSBM-750k(66GB)
BSBM-1000k (86GB)
0
500
1000
1500
2000
2500
52.8% 54.8%55%
58%
Pig-Def NTGAEx
ecut
ion
Tim
e (i
n se
cond
s)
Increases #RDF triples for query dq4 used in Task1. From BSBM-250k (22GB) to BSBM-1000k (86GB)
NTGA approach scales well.- Performance gain is observed from 52% to 58%- The size of relations containing repeated properties are not
increased linearly when increasing the size of data
Related WorkRDF Data Processing on MapReduce:SHARD[Rohloff10] : The clause-iteration algorithm (n +1 jobs to process n triple patterns)
HadoopDB[Huang11] : A hybrid architecture of database (RDF-3x) and Hadoop with a graph partitioning scheme.
HadoopRDF[Husain10] : A customized storage format and plan generation based on a heuristic greedy approach.
Work Sharing on MapReduce:MRShare [NYKIEL10]: Inter-query sharing scheme customized into the MapReduce framework.
NOVA [Olston11]: Share the initial load operation if multiple copies of workflow use the identical input.
CoScan[Wang11]: Minimize redundant data loading by merging multiple Pig scripts.
Relevant Publications Kim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for
Optimizing RDF Graph Pattern Matching on MapReduce, In: Proc. CLOUD (2012)
Anyanwu, K., Kim, H., Ravindra, P., : Algebraic Optimization for Processing Graph Pattern Queries in the Cloud, IEEE Internet Computing (2012)
Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. In: Proc. International Conference on Very Large Data Bases (2011) – (Demonstration).
Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Platforms, In: Proc. Extended Semantic Web Conference (2011)
References[DEAN08] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.
ACM 51 (2008) 107–113[OLSTON08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign
language for data processing. In: Proc. International Conference on Management of data. (2008)[HUSAIN11] M. F. Husain, J. McGlothlin et al., “Heuristics-Based Query Processing for Large RDF
Graphs Using Cloud Computing,” TKDE, vol. 23, pp. 1312–1327, 2011.[HUANG11] J. Huang, D. J. Abadi et al., “Scalable SPARQL Querying of Large RDF Graphs,” Proc. VLDB,
vol. 4, no. 11, 2011.[NYKIEL10] T. Nykiel, M. Potamias et al., “MRShare: Sharing across Multiple Queries in MapReduce,”
Proc. VLDB, vol. 3, pp.494–505, 2010.[OLSTON11] C. Olston, G. Chiou et al., “Nova: Continuous Pig/Hadoop Workflows,” in Proc. SIGMOD,
2011, pp. 1081–1090.[WANG11] X. Wang, C. Olston et al., “CoScan: Cooperative Scan Sharing in the Cloud,” in Proc. SOCC,
2011, pp. 11:1–11:12.[RAVINDRA11] P. Ravindra, H. Kim et al., “An Intermediate Algebra for Optimizing RDF Graph Pattern
Matching on MapReduce,” in Proc. ESWC, 2011, vol. 6644, pp. 46–61.[ABADI07] D. J. Abadi, A. Marcus et al., “Scalable Semantic Web data Management using Vertical
Partitioning,” in Proc. VLDB,2007.[ROHLOFF10] K. Rohloff and R. E. Schantz, “High-performance, Massively Scalable Distributed Systems
using the MapReduce Software Framework: the SHARD Triple-store,” in PSI EtA, 2010, pp. 4:1–4:5.[NEUMANN10] T. Neumann and G. Weikum, “The RDF-3X engine for scalable management of RDF
data,” The VLDB Journal, vol. 19, pp. 91–113, 2010.[WEISS08] C. Weiss, P. Karras, and A. Bernstein.“Hexastore: Sextuple Indexing for Semantic Web Data
Management”, Proc. VLDB, vol. 1, no. 1, 2008.[HERODOTOU11] H. Herodotou and S. Babu. “Profiling, What-if Analysis, and Cost-based Optimization of
MapReduce Programs.” Proc. VLDB, vol. 4, 2011[BLANAS1010] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. “A Comparison of
Join Algorithms for Log Processing in MapReduce.” Proc. SIGMOD, 2010.
Thank You!
Oval: resources i.e. URIs Rectangle: Literals
RDF Data Model(Resource Description Framework)
1. Statements (triples) 2. Graph Representation
:Producer1
:Producer“Apple” “1976-04-01”
apple.com
:Product1
:white
“iphone4”
“2011-10-14”
:name
:name
:publisher
:date
:color
:type:date
:homepage
Subject Property
Object
:Product1
:name “iphone4”
:Product1
:color :white
:Product1
:date “2011-10-14”
:Product1
:publisher :Producer1
… … …:Producer1
:name “Apple”
:Producer1
:type :Producer
:Producer1
:date “1976-04-01”
:Producer1
:homepage
apple.com
Star subgraphs - set of edges with same subject e.g. :Product1 and :Producer1,
Relationship between TripleGroups and n-tuples
different structure BUT “content equivalent”
(:Product1, :type, :Product, :Product1, :date, “1976-04-01”, :Product1, :name, “iphone 4”)
t1 t2 t3
2. n-tuple in VP (SPLIT and JOIN)
(:Product1, :type, :Product)(:Product1, :date, “1976-04-01”), (:Product1, :name, “iphone 4”)
tg1 =
1.TripleGroup in NTGA (TG_GroupBy and TG_GroupFilter)
TripleGroups are not structurally equivalent to n-tuples but are “content equivalent”.
# NTGA Operators Result1 TG_Flatten(tg1) (:Prdct1, :name,
“iphone4”, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100)
2 TG_Join(?o :publisher ?v: TG{:name, :publisher, :price}
?v :type ?t : TG{:type, :date, :hpage} )
ntg = {(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }
3 TG_Unnest(ntg) (:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, :Prdcr1),(:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }
NTGA Quick ReferenceConsider, a set of Triplegroups TG = {tg1 , tg2 } such that
(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, :prdcr1), (:Prdct1, :price, “100”)
(:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”)
tg1 = tg2 =
# NTGA Operators Result1 TG_Flatten(tg1) (:Prdct1, :name,
“iphone4”, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100)
2 TG_Join(?o :publisher ?v: TG{:name, :publisher, :price}
?v :type ?t : TG{:type, :date, :hpage} )
ntg = {(:Prdct1, :name, “iphone4”), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, “1976-04-01”), (:Prdcr1, :hpage, “apple.com”) (:Prdct1, :price, “100”) }
# NTGA Operators Result1 TG_Flatten(tg1) (:Prdct1, :name,
“iphone4”, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100)
Execution on MapReduce Platform MapReduce (MR): Popular large-scale data processing systems
of data running on a cluster of commodity grade machines [DEAN04]
* http://hadoop.apache.org** http://pig.apache.org, *** http://hive.apache.org
Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. Apache Hadoop* – open-source implementation
Extended systems provides high-level languages for specifying tasks along with optimizing compilers for generating map/reduce code à la database systems. Pig Latin for Apache Pig**, HiveQL for Apache Hive***.
Architecture of RAPID+
MapReduce Job Compiler
Hadoop Job Tracker
Query Ana-lyzer
SPARQL parser
Logical-to-Physical Plan Translator
Pig Latin Plan
Generator
NTGA Plan Generator
QueryA
rchi
tect
ure
of
RA
PID
+
Logical Plan Generator/Optimizer
Parser Layer
Pig Latin parser
(…)
SPLITLOAD STOREJOIN
JOIN
TG_GroupBy
TG_GroupFilterLOAD STORE
TG_Join