dataset skew resilience in main memory parallel hash joins · the rapid rise of big data and data...
TRANSCRIPT
Examples of Dataset Skew▪ Joins on non-primary key columns (duplicate
keys)
▪ Parallel multiway joins
▪ Compound queries (multiple joins and
aggregations)
▪ Many examples in real-world data
▪ The size of cities and the length and
frequency of words can be modeled with
Zipfian distributions
▪ Measurement errors and IQ scores often
follow Gaussian distributions
▪ Dataset skew can occur in attributes as well
as in the correlation between relations
Experiments▪ We generated large datasets (hundreds of millions of records) that
vary in terms of distribution, correlation, shuffling, and skew on the
build and probe tables
▪ We measured hash join execution time, with three different hash
table implementations
▪ Separate Chaining hash table (SC) is the original baseline
▪ Modified Separate Chaining hash table (MSC)
▪ Custom Cuckoo Hashing implementation (CCH)
The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main memory hash joins are prescribed to accelerate database
joins. However, the performance of these joins can be hindered by dataset skew. We tested four popular hash join algorithms on an extensive array of datasets. Our results
demonstrate that hash joins are acutely affected by dataset skew, and performance is further hampered if the data is unordered. To address these issues, we propose two
different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original
implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table that can further improve performance by up to 17.3x.
Dataset Skew Resilience in Main Memory Parallel
Hash JoinsPuya Memarzia, Virendra Bhavsar, and Suprio Ray
Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada
ABSTRACT
References: [1] Image from Wikimedia commons - https://commons.wikimedia.org/wiki/File:Zipf_distribution_CMF.png#/media/File:Zipf_distribution_CMF.png - CC-BY-SA 3.0 [2] By Inductiveload (self-made, Mathematica,
Inkscape) [Public domain], via Wikimedia Commons https://commons.wikimedia.org/w/index.php?curid=3817960
(a) Zipfian distribution [1]
MotivationIn-memory hash joins on datasets that are
skewed and/or shuffled are significantly
slower than ordered non-skewed datasets.
The performance penalty can be several
orders of magnitude.
▪ Throwing more hardware at the problem
is not always an option.
▪ The design and implementation of the
hash table is one of the main bottlenecks
▪ Join time deteriorates due to increased
memory lookups, lock contention, and
CPU cache and TLB misses
Hash Join ConfigurationsHash joins can leverage data parallelism to scale up on multicore processors. Hash join
configurations can be categorized based on their partitioning scheme (or lack thereof) and
parallel implementation.
1. No Partitioning (Nopart): one large lock-protected hash table shared by all threads
2. Shared Partitioning (Part-Share): multiple lock-protected partitions shared by all threads
3. Independent Partitioning (Part-Indep): private lock-less partitions for each thread
4. Radix Partitioning (Part-Radix): a hierarchy of partitions with dynamic load balancing
Figure 2. The performance impact of dataset skew and shuffling on
four hash join algorithms (note that y axis is log scale)
0
0
0
0
0
0
0
0
1
10
100
1,000
10,000
100,000
1,000,000
Nopart Part-Share Part-Indep Part-Radix
Ru
nti
me
(C
PU
Cyl
ces)
-lo
g 10
scal
e
Hash Join Configuration
Non-skewed Ordered Non-skewed Shuffled Skewed Shuffled1014
1013
1012
1011
1010
109
108
107
106
105
104
103
102
101
1
Figure 1. Example of a hash join between two corresponding tables or partitions from each table – using a separate chaining hash table
Figure 4. Original (SC) versus modified hash
table (MSC) on skewed dataset
(b) Gaussian distribution [2]
Figure 3. Cumulative distribution function of data distributions that
mimic some skewed datasets
0
20
40
60
80
100
120
140
160
Intel Skylake Intel Harpertown AMD K8
Ru
nti
me (
CP
U C
ycle
s)
Billio
ns
CPU Architecture
Nopart_CCH
Nopart_MSC
Part-Share_MSC
Part-Indep_MSC
Part-Radix_MSC
0
0
0
0
0
0
0
0
1
10
100
1,000
10,000
100,000
1,000,000
Nopart Part-Share Part-Indep Part-Radix
Ru
nti
me
(C
PU
Cyl
ces)
-lo
g 10
scal
e
Hash Join Configuration
SC MSC1014
1013
1012
1011
1010
109
108
107
106
105
104
103
102
101
1
Conclusion• Hash table skew resilience is an important feature to consider when
designing parallel hash join implementations
• Our approach improves hash join times across a variety of datasets
and architectures, allowing for more work with the same resources
• The dataset and CPU can help determine the most efficient method
• Applications such as machine learning, analytics, graph processing,
and data warehousing, could benefit from skew resilient algorithms
.
.
.
Financial
40
IT
20
HR
30
Hash FunctionBuild Relation Hash Table
dept_part0
desc deptno
Marketing 10
IT 20
HR 30
Financial 40
Probe Relation
emp_part0
ename deptno
Gary 10
Dulley 10
Reiter 30
Taylor 20
Prevatt 30
Marketing
10
NULL
Hash Function
Gary, Marketing
Dulley, Marketing
Output
.
.
....
Original Data
(optional)
0
50
100
150
200
250
300
350
400
Cycle
s p
er
ou
tpu
t tu
ple
Configuration
Part Build Probe
0
50
100
150
200
250C
ycle
s p
er
ou
tpu
t tu
ple
Configuration
Part Build Probe
(b) Shuffled dataset
Figure 6. Breakdown of hash join phases on Zipf-skewed datasets
(a) Ordered dataset
Figure 5. Evaluation of three different CPU
architectures on shuffled dataset
Sponsored by: