exploiting multithreaded architectures to improve data management operations
DESCRIPTION
Exploiting Multithreaded Architectures to Improve Data Management Operations. Layali Rashid The Advanced Computer Architecture Group @ U of C (ACAG) Department of Electrical and Computer Engineering University of Calgary. Outline. The SMT and the CMP Architectures Join (Hash Join) - PowerPoint PPT PresentationTRANSCRIPT
Exploiting Multithreaded Architectures to Improve Data Management Operations
Layali RashidThe Advanced Computer Architecture Group @ U of C
(ACAG)Department of Electrical and Computer Engineering
University of Calgary
2
Outline The SMT and the CMP Architectures Join (Hash Join)
Motivation Algorithm Results
Sort (Radix and Quick Sorts) Motivation Algorithms Results
Index (CSB+-Tree) Motivation Algorithm Results
Conclusions
3
The SMT and the CMP Architectures
Simultaneous Multithreading (SMT): multiple threads run simultaneously on a single processor.
Chip Multiprocessor (CMP): more than one processor are integrated on a single chip.
4
Hash Join Motivation
0%
10%
20%
30%
40%
50%
60%
70%
20 60 100 140Tuple Size (Byte)
L2
Lo
ad
Mis
s R
ate
Hash join is one of the most important operations commonly used in current commercial DBMSs.
The L2 cache load miss rate is a critical factor in main-memory hash join performance.
Increase level of parallelism in hash join.
4.4%
4.5%
4.6%
4.7%
4.8%
4.9%
5.0%
5.1%
5.2%
5.3%
5.4%
20 60 100 140Tuple Size (Byte)
L1
Lo
ad
Mis
s R
ate
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.16%
20 60 100 140Tuple (Size)
Tra
ce C
ach
e M
iss
Ra
te
5
Architecture-Aware Hash Join (AA_HJ)
Build Index Partition Phase Tuples divided equally between threads, each thread has its own
set of L2-cache size clusters The Build and Probe Index Partition Phase
One thread builds a hash table from each key-range, other threads index partition the probe relation similar to the previous phase.
Probe Phase See figure.
6
AA_HJ Results
05
101520253035404550
20 60 100 140
Tuple Size (Byte)
Tim
e (S
econ
d)
PT NPT Index PT 2 4 8 12 16
We achieve speedups ranging from 2 to 4.6 compared to PT on Quad Intel Xeon Dual Core server.
Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9 compared to PT.
7
Memory-Analysis for Multithreaded AA_HJ
0%
10%
20%
30%
40%
50%
60%
70%
20 60 100 140Tuple Size (Byte)
L2 Lo
ad M
iss R
ate
NPT 2 4 8 12 16
A decrease in L2 load miss rate is due to the cache-sized index partitioning, constructive cache sharing and Group Prefetching. A minor increase in L1 data cache load miss rate from 1.5% to 4%.
3%
4%
5%
6%
7%
8%
9%
10%
20 60 100 140Tuple Size (Byte)
L1 L
oad
Miss
Rat
e
NPT 2 4 8 12 16
8
The Sort Motivation Some researches find that the sort algorithms suffer
from high level two cache miss rates. Whereas others pointed out that radix sort has high
TLB miss rates. In addition, the fact that most sort algorithms are
sequential has high impact on generating efficient parallel sort algorithms.
In our work we target Radix Sort (distribution-based sort) and Quick Sort (comparison-based sort).
9
Our Parallel Sorts Radix Sort
A hybrid radix sort between Partition Parallel Radix Sort and Cache-Conscious Radix Sort.
Repartitioning large destination buckets only when they are significantly larger than the L2 cache size.
Quick Sort Use Fast Parallel Quick Sort. Dynamically balancing the load across threads. Improve thread parallelism during the sequential cleaning up
sorting. Stop the recursive partitioning process when the size of the
subarray is almost equal to the largest cache size.
10
The Sort Timing for the Random Datasets on the SMT Arhcitecure
Radix Sort and Quick Sort shows low L1 and L2 caches miss rates on our machines. Radix Sort has a DTLB Store miss rate up to 26%.
Radix Sort accomplishes slight speedup on SMT architectures that doesn’t exceed 3% , due to its CPU-intensive nature.
Enhancements in execution time for quick sort are about 25% to 30%.
0
2
4
6
8
10
12
14
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Time (
Seco
nd)
1 2
Quick Sort
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Time (
Seco
nd)
LSB 1 2
Radix Sort
11
The Sort Timing for the Random Datasets on the CMP Architecture
0
2
4
6
8
10
12
14
16
18
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Time
(Sec
ond)
1 2 4 8 12 16
0
1
2
3
4
5
6
7
8
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
LSB 1 2 4 8 12 16
Radix Sort Quick Sort
Our speedups for the Radix sort range from 54% for two threads up to 300% for threads from 2 to 8. Our speedups for the Quick Sort range from 34% to 417%.
12
The Index Motivation
Despite the fact that CSB+-tree proves to have significant speedup over B+-trees, experiments show that a large fraction of its execution time is still spent waiting for data.
The L2 load miss rate for single-threaded CSB+-tree is as high as 42%.
13
Dual-threaded CSB+-Tree
One CSB+-Tree. Single thread for the
bulkloading. Two threads for
probing. Unlike inserts and
deletes, search needs no synchronization since it involves reads only.
14
Index Results
0
0.02
0.04
0.06
0.08
0.1
0.12
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07Number of Keys
Tim
e (S
econ
d)
Single-Threaded Dual-Threaded
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Number of Keys
L2 L
oad
Mis
s Ra
te
Single-Threaded Dual-Threaded
Speedups for dual-threaded CSB+-tree range from 19% to 68% compared to single-threaded CSB+-tree.
Two threads for memory-bound operations propose more chances to keep the functional units working.
Sharing one CSB+-tree amongst both of our threads result in constructive behaviour and reduction of 6% -8% in the L2 miss rate.
15
Conclusions State-of-the-art parallel architectures (SMT and
CMP) have opened opportunities for the improvement of software operations to better utilize the underlying hardware resources.
It is essential to have efficient implementations of database operations.
We propose architecture-aware multithreaded database algorithms of the most important database operations (joins, sorts and indexes).
We characterize the timing and memory behaviour of these database operations.
16
The End
17
Backup Slides
18
Figure 1‑1: The SMT Architecture
19
Figure 1‑2: Comparison between the SMT and the Dual Core Architectures
20
Figure 1‑3: Combining the SMT and the CMP Architectures
21
Figure 2‑1: The L1 Data Cache Load Miss Rate for Hash Join
4.4%
4.5%
4.6%
4.7%
4.8%
4.9%
5.0%
5.1%
5.2%
5.3%
5.4%
20 60 100 140Tuple Size (Byte)
L1 L
oad
Mis
s R
ate
22
Figure 2‑2: The L2 Cache Load Miss Rate for Hash Join
0%
10%
20%
30%
40%
50%
60%
70%
20 60 100 140Tuple Size (Byte)
L2 L
oad
Mis
s R
ate
23
Figure 2‑3: The Trace Cache Miss Rate for Hash Join
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.16%
20 60 100 140Tuple (Size)
Trac
e C
ache
Mis
s R
ate
24
Figure 2‑4: Typical Relational Table in RDBMS
25
Figure 2‑5: Database Join
26
Figure 2‑6: Hash Equi-join Process
27
Figure 2‑7: Hash Table Structure
28
Figure 2‑8: Hash Join Base Algorithm
partition R into R0, R1,…, Rn-1partition S into S0, S1,…, Sn-1for i = 0 until i = n-1
use Ri to build hash-tablei
for i = 0 until i = n-1probe Si using hash-
tablei
29
Figure 2‑9: AA_HJ Build Phase Executed by one Thread
30
Figure 2‑10: AA_HJ Probe Index Partitioning Phase Executed by one Thread
31
Figure 2‑11: AA_HJ S-Relation Partitioning and Probing Phases
32
Figure 2‑12: AA_HJ Multithreaded Probing Algorithm
33
Table 2‑1: Machines Specifications
34
Table 2‑2: Number of Tuples for Machine 1
35
Table 2‑3: Number of Tuples for Machine 2
36
Figure 2‑13: Timing for three Hash Join Partitioning Techniques
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
20 60 100 140
Tuple Size (Byte)
Tim
e (
Se
con
d)
PT NPT Index PT
37
Figure 2‑14: Memory Usage for three Hash Join Partitioning Techniques
0
50
100
150
200
250
300
350
400
450
500
20 60 100 140
Tuple Size (Byte)
Mem
ory
(MB
yte)
PT NPT Index PT
38
Figure 2‑15: Timing for Dual-threaded Hash Join
0
50
100
150
200
250
300
350
400
450
500
20 60 100 140
Tuple Size (Byte)
Tim
e (
Se
con
d)
SMT+PT SMT+NPT SMT+Index PT
39
Figure 2‑16: Memory Usage for Dual-threaded Hash Join
0
50
100
150
200
250
300
350
400
450
500
20 60 100 140
Tuple Size (Byte)
Mem
ory
(MB
yte)
SMT+PT SMT+NPT SMT+Index PT
40
Figure 2‑17: Timing Comparison of all Hash Join Algorithms
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.03.23.43.63.84.0
20 60 100 140
Tuple Size (Byte)
Tim
e (S
econ
d)
AA_HJ+GP+SMT AA_HJ+SMT SMT+NPT NPT
SMT+PT PT SMT+Index PT Index PT
41
Figure 2‑18: Memory Usage Comparison of all Hash Join Algorithms
0
50
100
150
200
250
300
350
400
450
500
20 60 100 140
Tuple Size (Byte)
Mem
ory
(MBy
te)
AA_HJ+GP+SMT AA_HJ+SMT SMT+NPT NPT
SMT+PT PT SMT+Index PT Index PT
42
Figure 2‑19: Speedups due to the AA_HJ+SMT and the AA_HJ+GP+SMT Algorithms
0
0.5
1
1.5
2
2.5
3
3.5
20 60 100 140
Tuple Size (Byte)
PT SMT+PT Index PT SMT+Index PT AA_HJ+SMT AA_HJ+GP+SMT
43
Figure 2‑20: Varying Number of Clusters for the AA_HJ+GP+SMT
0
0.5
1
1.5
2
2.5
32 64 128 512 1024 2048Number of Clusters
Tim
e (
Se
co
nd
)
20 60 100 140Tuple Size (Byte)
44
Figure 2‑21: Varying the Selectivity for Tuple Size = 100Bytes
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100
Selectivity
Tim
e (
Second)
PT SMT+PT AA_HJ+SMT AA_HJ+GP+SMT
45
Figure 2‑22: Time Breakdown Comparison for the Hash Join Algorithms for tuple sizes 20Bytes and 100Bytes
0
0.5
1
1.5
2
2.5
3
3.5
4
100 20 100 20 100 20 100 20 100 20 100 20 100 20 100 20
NPT SMT+NPT PT SMT+PT Index PT SMT+Index PT AA_HJ+SMT AA_HJ+GP+SMT
Tim
e (
Se
con
d)
Build Index Partition Probe Index Partition Partition Build Probe
46
Figure 2‑23: Timing for the Multi-threaded Architecture-Aware Hash Join
05
101520253035404550
20 60 100 140
Tuple Size (Byte)
Tim
e (S
econ
d)
PT NPT Index PT 2 4 8 12 16
47
Figure 2‑24: Speedups for the Multi-Threaded Architecture-Aware Hash Join
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
20 60 100 140
PT Index PT 2 4 8 12 16
48
Figure 2‑25: Memory Usage for the Multi-Threaded Architecture-Aware Hash Join
49
Figure 2‑26: Time Breakdown Comparison for Hash Join Algorithms
00.5
11.5
22.5
33.5
44.5
55.5
66.5
77.5
88.5
99.510
10.511
11.512
12.513
13.514
14.515
PT
Index P
T
2 4 8 12
16
PT
Index P
T
2 4 8 12
16
PT
Index P
T
2 4 8 12
16
PT
Index P
T
2 4 8 12
16
20 60 100 140
Tuple Size
Tim
e (
Second)
Partition Build Index Partition Probe Index Partition Build Probe
35.91 second
27.70 second
50
Figure 2‑27: The L1 Data Cache Load Miss Rate for NPT and AA_HJ
3%
4%
5%
6%
7%
8%
9%
10%
20 60 100 140Tuple Size (Byte)
L1 L
oad
Mis
s R
ate
NPT 2 4 8 12 16
51
Figure 2‑28: Number of Loads for NPT and AA_HJ
0.E+00
1.E+09
2.E+09
3.E+09
4.E+09
5.E+09
6.E+09
20 60 100 140Tuple Size (Byte)
Num
ber o
f Loa
ds
NPT 2 4 8 12 16
52
Figure 2‑29: The L2 Cache Load Miss Rate for NPT and AA_HJ
0%
10%
20%
30%
40%
50%
60%
70%
20 60 100 140Tuple Size (Byte)
L2 L
oad
Mis
s Ra
te
NPT 2 4 8 12 16
53
Figure 2‑30: The Trace Cache Miss Rate for NPT and AA_HJ
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.16%
20 60 100 140Tuple Size (Byte)
Trac
e C
ache
Mis
s R
ate
NPT 2 4 8 12 16
54
Figure 2‑31: The DTLB Load Miss Rate for NPT and AA_HJ
0%
1%
2%
3%
4%
5%
6%
7%
8%
20 60 100 140Tuple Size (Byte)
DT
LB
Load M
iss R
ate
NPT 2 4 8 12 16
55
Figure 3‑1: The LSD Radix Sort
1 for (i= 0; i < number_of_digits; i ++)2 sort source-array based on digiti;
56
Figure 3‑2: The Counting LSD Radix Sort Algorithm
57
Figure 3‑3: Parallel Radix Sort Algorithm
58
Table 3‑1: Memory Characterization for LSD Radix Sort with Different Datasets
59
Figure 3‑4: Radix Sort Timing for the Random Datasets on Machine 2
0
1
2
3
4
5
6
7
8
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (
Second)
LSB 1 2 4 8 12 16
60
Figure 3‑5: Radix Sort Timing for the Gaussian Datasets on Machine 2
0
1
2
3
4
5
6
7
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (
Se
con
d)
LSB 1 2 4 8 12 16
61
Figure 3‑6: Radix Sort Timing for Zero Datasets on Machine 2
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
LSB 1 2 4 8 12 16
62
Figure 3‑7: Radix Sort Timing for the Random Datasets on Machine 1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
LSB 1 2
63
Figure 3‑8: Radix Sort Timing for the Gaussian Datasets on Machine 1
00.5
11.5
22.5
33.5
44.5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
LSB 1 2
64
Figure 3‑9: Radix Sort Timing for the Zero Datasets on Machine 1
0
0.2
0.4
0.6
0.8
1
1.2
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
LSB 1 2
65
Figure 3‑10: The DTLB Stores Miss Rate for the Radix Sort on Machine 2 (Random Datasets)
0%
5%
10%
15%
20%
25%
30%
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
DTLB
Sto
re M
iss
Rate
LSB 1 2 4 8 16
66
Figure 3‑11: The L1 Data Cache Load Miss Rate for the Radix Sort on Machine 2 (Random Datasets)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
L1 D
ata
Load
Mis
s Ra
te
LSB 1 2 4 8 12 16
67
Table 3‑2: Memory Characterization for Memory-Tuned Quick Sort with Different Datasets
68
Figure 3‑12: Quicksort Timing for the Random Datasets on Machine 2
0
2
4
6
8
10
12
14
16
18
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2 4 8 12 16
69
Figure 3‑13: Quicksort Timing for the Random Dataset on Machine 1
0
2
4
6
8
10
12
14
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2
70
Figure 3‑14: Quicksort Timing for the Gaussian Datasets on Machine 2
0
2
4
6
8
10
12
14
16
18
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2 4 8 12 16
71
Figure 3‑15: Quicksort Timing for the Gaussian Dataset on Machine 1
0
2
4
6
8
10
12
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2
72
Figure 3‑16: Quicksort Timing for the Zero Datasets on Machine 2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2 4 8 12 16
73
Figure 3‑17: Quicksort Timing for the Zero Dataset on Machine 1
0
0.5
1
1.5
2
2.5
3
3.5
1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07
Number of Keys
Tim
e (S
econ
d)
1 2
74
Table 3‑3: The Sort Results for Machine 1
75
Table 3‑4: The Sort Results for Machine 2
76
Figure 4‑1: Search Operation on an Index Tree
77
Figure 4‑2: Differences between the B+-Tree and the CSB+-Tree
78
Figure 4‑3: Dual-Threaded CSB+-Tree for the SMT Architectures
79
Figure 4‑4: Timing for the Single and Dual-Threaded CSB+-Tree
0
0.02
0.04
0.06
0.08
0.1
0.12
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07Number of Keys
Tim
e (
Second)
Single-Threaded Dual-Threaded
80
Figure 4‑5: The L1 Data Cache Load Miss Rate for the Single and Dual-Threaded CSB+-Tree
0%
5%
10%
15%
20%
25%
30%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07Number of Keys
L1 D
ata
Load
Mis
s Ra
te
Single-Threaded Dual-Threaded
81
Figure 4‑6: The Trace Cache Miss Rate for the Single and Dual-Threaded CSB+-Tree
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Number of Keys
Tra
ce C
ache
Mis
s R
ate
Single-Threaded Dual-Threaded
82
Figure 4‑7: The L2 Load Miss Rate for the Single and Dual-Threaded CSB+-Tree
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Number of Keys
L2 L
oad
Mis
s Ra
te
Single-Threaded Dual-Threaded
83
Figure 4‑8: The DTLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree
0%2%4%
6%8%
10%12%14%
16%18%20%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Number of Keys
DTL
B Lo
ad M
iss
Rat
e
Single-Threaded Dual-Threaded
84
Figure 4‑9: The ITLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree
0%
2%4%
6%
8%
10%12%
14%
16%18%
20%
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07Number of Keys
ITLB
Mis
s Ra
te
Single-Threaded Dual-Threaded