cascading map-side joins over hbase for scalable join processing
TRANSCRIPT
![Page 1: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/1.jpg)
Cascading Map-Side Joins over HBase for Scalable Join Processing
Joint Workshop on Scalable and High-Performance Semantic Web Systems
(SSWS + HPCSW 2012)
Collocated with the 11th International Semantic Web Conference (ISWC 2012)
Alexander Schätzle Martin Przyjaciel-Zablocki Christopher Dorner Thomas Hornung Georg Lausen
University of Freiburg Databases & Information Systems
11 November 2012
![Page 2: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/2.jpg)
RDF datasets are growing constantly (e.g. LOD)
Querying RDF datasets at web-scale is challenging
Our Approach ◦ Distributed scalable RDF engine for processing very large
datasets (RDF + SPARQL)
◦ Build on common & widely-used frameworks (Hadoop MapReduce, HBase, Pig, Cassandra, …)
Cascading Map-Side Joins over HBase for Scalable Join Processing 2
Motivation
![Page 3: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/3.jpg)
3
MapReduce
Automatic parallelization of computations
Distributed File System ◦ Commodity hardware Fault tolerance by replication ◦ Very large files / write-once, read-many pattern
Apache Hadoop ◦ Well-known open-source implementation
split 1
split 0 Map
Map
Map
Reduce
Reduce
output 0
output 1
Map phase Shuffle & Sort Reduce phase
split 2
split 3
split 4
split 5
Input
(DFS)
Intermediate Results
(Local)
Output
(DFS)
Cascading Map-Side Joins over HBase for Scalable Join Processing
![Page 4: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/4.jpg)
Cascading Map-Side Joins over HBase for Scalable Join Processing 4
Previous Work – PigSPARQL [1]
[1] Alexander Schätzle, Martin Przyjaciel-Zablocki, Georg Lausen: PigSPARQL: Mapping SPARQL to Pig Latin, SWIM 2011.
SPARQL on top of Pig Latin
Advantages
◦ All operators of SPARQL 1.0
◦ Benefits from Pig optimizations
◦ Runs "out-of-the-box" on Hadoop
◦ Portable on other platforms
Performance
◦ Good scalability and performance for complex analytical queries
◦ Performance not satisfying for more selective queries
Reasons
◦ Reduce-Side Join ( Data shuffling)
◦ No built-in index structures
Query Processor
RDF
Graph
Query Engine (Pig)
MapReduce
HDFS
Triple Loader
RDF Management System
SPARQL 1.0
![Page 5: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/5.jpg)
Cascading Map-Side Joins over HBase for Scalable Join Processing 5
New Approach
Store input dataset in HBase instead of plain HDFS
Process the join in the Map phase to avoid unnecessary data shuffling
Expected benefit
◦ No costly Shuffle & Sort phase
◦ I/O reduction due to HBase indexes
Expected drawbacks
◦ Communication overhead
◦ Significantly higher RAM consumption
◦ Not ideal for high-output queries
Query Processor
RDF
Graph
Native Query Engine
MapReduce
HDFS
Triple Loader
RDF Management System
SPARQL BGP
HBase
![Page 6: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/6.jpg)
RDF Storage in HBase Store RDF in a NoSQL data store
Cascading Map-Side Joins over HBase for Scalable Join Processing 6
![Page 7: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/7.jpg)
Clone of Google's Bigtable
◦ Column-oriented, semi-structured NoSQL data store
◦ Distributed over many machines
◦ Layered on top of HDFS (Hadoop Distributed File System) Files split into blocks (e.g. 64MB) and replicated across machines Tolerant of machine failure
◦ Adds random data access to HDFS in "close to real-time"
◦ Strictly consistent!
Not a relational query engine
◦ Not designed for normalized schemas
◦ No join operators
◦ No expressive query language like SQL
7
What is HBase (Not)?
Cascading Map-Side Joins over HBase for Scalable Join Processing
![Page 8: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/8.jpg)
Sparse, distributed, sorted, multidimensional map ◦ Indexed by row key
◦ Values can have multiple versions, identified via timestamps
◦ Columns are grouped into column families
◦ Tables are dynamically split into regions
◦ Every region is assigned to exactly one Region Server
Access Pattern: (Table,RowKey,Family,Column,Timestamp) Value
HBase Data Model
Cascading Map-Side Joins over HBase for Scalable Join Processing 8
![Page 9: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/9.jpg)
9
RDF Storage by Example (1)
Cascading Map-Side Joins over HBase for Scalable Join Processing
rowkey family:column value
Article1 p:title {"PigSPARQL"} p:year {"2011"} p:author {Alex, Martin}
Article2 p:title {"RDFPath"} p:year {"2011"} p:author {Martin, Alex} p:cite {Article1}
Ts_po:
rowkey family:column value
Alex p:author {Article1, Article2}
Article1 p:cite {Article2}
. . . . . .
To_ps:
![Page 10: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/10.jpg)
10
Triple Pattern Matching
Cascading Map-Side Joins over HBase for Scalable Join Processing
pattern table filter
(s, p, o) Ts_po or To_ps column & value
(?s, p, o) To_ps column
(s, ?p, o) Ts_po or To_ps value
(s, p, ?o) Ts_po column
(?s, ?p, o) To_ps
(?s, p, ?o) Ts_po or To_ps (SCAN) column
(s, ?p, ?o) Ts_po
(?s, ?p, ?o) Ts_po or To_ps (SCAN)
server side filters
![Page 11: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/11.jpg)
MAPSIN Join Map-Side Index Nested Loop Join
Cascading Map-Side Joins over HBase for Scalable Join Processing 11
![Page 12: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/12.jpg)
Map-Side (Merge) Join ◦ Input datasets must be:
1. divided into same number of partitions
2. Sorted by the same key (the join key)
3. All records of a particular key must reside in the same partition
◦ Problem: Fulfill requirements for subsequent iterations
Broadcast Join ◦ One dataset small enough to be distributed to each node
◦ Problem: Not feasible for big datasets
Cascading Map-Side Joins over HBase for Scalable Join Processing 12
Map-Side Joins in MapReduce
![Page 13: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/13.jpg)
13
MAPSIN Join
SELECT *
WHERE {
?article title ?title .
?article author ?author .
?article year ?year
}
Cascading Map-Side Joins over HBase for Scalable Join Processing
![Page 14: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/14.jpg)
14
Multiway Join Optimization
?article title ?title
?article author ?author
?article year ?year
Cascading Map-Side Joins over HBase for Scalable Join Processing
(Ts_po, article1, column=author) (Ts_po, article2, column=author)
(Ts_po, article1, column=year) (Ts_po, article2, column=year)
1. iteration
2. iteration
Query pattern Corresponding HBase requests
rowkey filter
![Page 15: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/15.jpg)
15
Multiway Join Optimization
?article title ?title
?article author ?author
?article year ?year
Cascading Map-Side Joins over HBase for Scalable Join Processing
(Ts_po, article1, column=author) (Ts_po, article2, column=author)
(Ts_po, article1, column=year) (Ts_po, article2, column=year)
1. iteration
2. iteration
Query pattern Corresponding HBase requests
![Page 16: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/16.jpg)
16
Multiway Join Optimization
?article title ?title
?article author ?author
?article year ?year
Cascading Map-Side Joins over HBase for Scalable Join Processing
(Ts_po, article1, column=author) (Ts_po, article2, column=author)
(Ts_po, article1, column=year) (Ts_po, article2, column=year)
1. iteration
2. iteration
?article title ?title
?article author ?author
?article year ?year
(Ts_po, article1, column=author) (Ts_po, article1, column=year) (Ts_po, article2, column=author) (Ts_po, article2, column=year)
1. iteration
Query pattern Corresponding HBase requests
![Page 17: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/17.jpg)
17
Multiway Join Optimization
?article title ?title
?article author ?author
?article year ?year
Cascading Map-Side Joins over HBase for Scalable Join Processing
(Ts_po, article1, column=author) (Ts_po, article2, column=author)
(Ts_po, article1, column=year) (Ts_po, article2, column=year)
1. iteration
2. iteration
?article title ?title
?article author ?author
?article year ?year
(Ts_po, article1, column=author) (Ts_po, article1, column=year) (Ts_po, article2, column=author) (Ts_po, article2, column=year)
1. iteration
Query pattern Corresponding HBase requests
4 requests!
![Page 18: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/18.jpg)
18
Multiway Join Optimization
?article title ?title
?article author ?author
?article year ?year
Cascading Map-Side Joins over HBase for Scalable Join Processing
(Ts_po, article1, column=author) (Ts_po, article2, column=author)
(Ts_po, article1, column=year) (Ts_po, article2, column=year)
1. iteration
2. iteration
?article title ?title
?article author ?author
?article year ?year
(Ts_po, article1, column=author) (Ts_po, article1, column=year) (Ts_po, article2, column=author) (Ts_po, article2, column=year)
1. iteration
?article title ?title
?article author ?author
?article year ?year
(Ts_po, article1, column=author & column=year) (Ts_po, article2, column=author & column=year)
1. iteration
Query pattern Corresponding HBase requests
2 re
quests
!
![Page 19: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/19.jpg)
Evaluation Lehigh University Benchmark (LUBM)
Cascading Map-Side Joins over HBase for Scalable Join Processing 19
![Page 20: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/20.jpg)
Cluster Hardware ◦ 10 Dell PowerEdge R200 servers
◦ Dual Core 3.16 GHz CPU
◦ 8 GB RAM
◦ 3 TB hard disk
◦ Gigabit Network
Frameworks
◦ Hadoop 0.20.2 (CDH3)
◦ HBase 0.90.4
Datasets
◦ 1000 – 3000 LUBM universities
◦ ~ 210 – 630 million triples (after reasoning)
Cascading Map-Side Joins over HBase for Scalable Join Processing 20
Evaluation Setup
Master deamons (JobTracker, NameNode, HBase Master, Zookeeper)
Slave deamons (TaskTracker, DataNode, HBase Regionserver)
![Page 21: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/21.jpg)
Base Case (single join)
Linear Scaling behavior for both approaches
MAPSIN performs 8 – 13 times faster than PigSPARQL
Cascading Map-Side Joins over HBase for Scalable Join Processing 21
LUBM Q1
1
10
100
1000
1000 1500 2000 2500 3000
tim
e in
se
con
ds
LUBM (# universities)
0
200
400
600
800
1000
1000 1500 2000 2500 3000
tim
e in
se
con
ds
LUBM (# universities)
SELECT ?X
WHERE {
?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <...GraduateCourse0>
}
PigSPARQL MAPSIN
![Page 22: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/22.jpg)
General Case (sequence of joins), Multiway Join Optimization applicable
Linear Scaling behavior for both approaches
MAPSIN performs up to 28 times faster than PigSPARQL
MAPSIN multiway join ~ 3 times faster than standard MAPSIN
Cascading Map-Side Joins over HBase for Scalable Join Processing 22
LUBM Q4 SELECT ?X ?Y1 ?Y2 ?Y3
WHERE {
?X rdf:type ub:Professor .
?X ub:worksFor <...Department0.University0.edu> .
?X ub:name ?Y1 .
?X ub:emailAddress ?Y2 .
?X ub:telephone ?Y3
}
1
10
100
1000
10000
1000 1500 2000 2500 3000
tim
e in
se
con
ds
LUBM (# universities)
0
500
1000
1500
2000
2500
3000
3500
1000 1500 2000 2500 3000
tim
e in
se
con
ds
LUBM (# universities)
PigSPARQL MAPSIN
PigSPARQL Multiway Join MAPSIN Multiway Join
![Page 23: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/23.jpg)
Conclusion ◦ MAPSIN joins are processed completely in Map phase
◦ MAPSIN joins are easily iterable in a sequence of joins (without auxiliary Shuffle & Reduce Phases)
◦ Multiway join optimization reduces the number of iterations and HBase requests
◦ Outperforms reduce-side joins (PigSPARQL) by an order of magnitude (depending on the query selectivity)
◦ Performance degrades with increasing query output
Future Work ◦ Improvements of the RDF storage schema
◦ Incorporate MAPSIN joins into PigSPARQL
Cascading Map-Side Joins over HBase for Scalable Join Processing 23
Conclusion & Future Work
[http://www.superscholar.org]
![Page 24: Cascading Map-Side Joins over HBase for Scalable Join Processing](https://reader030.vdocument.in/reader030/viewer/2022032421/55a6946c1a28ab61148b456f/html5/thumbnails/24.jpg)
Cascading Map-Side Joins over HBase for Scalable Join Processing 24
Thank you for your attention!