6/1/2015 UNIVERSITY OF WISCONSIN
Toward GPUs being mainstream in analytic processing
An initial argument using simple scan-aggregate queries
1
Jason Power || Yinan Li || Mark D. HillJignesh M. Patel || David A. Wood
DaMoN 2015
6/1/2015 UNIVERSITY OF WISCONSIN
Summary
▪ GPUs are energy efficient▪ Discrete GPUs unpopular for DBMS▪ New integrated GPUs solve the problems
2
▪ Scan-aggregate GPU implementation▪ Wide bit-parallel scan▪ Fine-grained aggregate GPU offload
▪ Up to 70% energy savings over multicore CPU▪ Even more in the future
6/1/2015 UNIVERSITY OF WISCONSIN
Analytic Data is Growing
▪ Data is growing rapidly
▪ Analytic DBs increasingly important
3
Want: High performance Need: Low energy
Source: IDC’s Digital Universe Study. 2012.
6/1/2015 UNIVERSITY OF WISCONSIN
GPUs to the Rescue?
▪ GPUs are becoming more general▪ Easier to program▪ Integrated GPUs are everywhere
4
[Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12,Satish ’10, and many others]▪ GPUs show great promise
▪ Higher performance than CPUs▪ Better energy efficiency
▪ Analytic DBs look like GPU workloads
6/1/2015 UNIVERSITY OF WISCONSIN
GPU Microarchitecture
5L2 C
ache
Graphics Processing UnitI-Fetch/Sched
SP SP SP SP
L1 Cache Scratchpad Cache
Register File
Compute Unit
SP SP SP SP
SP SP SP SP
CU
6/1/2015 UNIVERSITY OF WISCONSIN
Discrete GPUs
6
Mem
ory Bus
CPU chip
Memory Bus
Discrete GPU
CoresPCIe Bus
6/1/2015 UNIVERSITY OF WISCONSIN
Discrete GPUs
7
Mem
ory Bus
CPU chip
Memory Bus
Discrete GPU
CoresPCIe Bus
➊
➋
6/1/2015 UNIVERSITY OF WISCONSIN
Discrete GPUs
8
Mem
ory Bus
CPU chip
Memory Bus
Discrete GPU
CoresPCIe Bus
➌
➍ And repeat
6/1/2015 UNIVERSITY OF WISCONSIN
Discrete GPUs
▪ Copy data over PCIe▪ Low bandwidth▪ High latency
▪ Small working memory
▪ High latency user→kernel calls
▪ Repeated many times
9
98% of time spent not computing
➊
➋
➌
➍
6/1/2015 UNIVERSITY OF WISCONSIN
Integrated GPUs
10
Memory Bus
Heterogeneouschip
CPU cores
GPU CUs
6/1/2015 UNIVERSITY OF WISCONSIN
▪ No need for data copies▪ Cache coherence and shared address space
▪ No OS kernel interaction▪ User-mode queues
Heterogeneous System Arch.
11
▪ API for tightly-integrated accelerators
▪ Industry support▪ Initial hardware support today▪ HSA foundation (AMD, ARM,Qualcomm, others)
➊➋
➌
❹
6/1/2015 UNIVERSITY OF WISCONSIN
Outline
▪ Background
▪ Algorithms▪ Scan▪ Aggregate
▪ Results
12
6/1/2015 UNIVERSITY OF WISCONSIN
Analytic DBs
▪ Resident in main-memory
▪ Column-based layout
13
▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14]
▪ Convert queries to mostly scans by pre-joining tables▪ Fast scan by using sub-word parallelism▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU]
▪ Scan-aggregate queries
6/1/2015 UNIVERSITY OF WISCONSIN
Running Example
14
ShirtColor
ShirtAmount
Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2
Color CodeRed 0Blue 1Green 2Yellow 3
ShirtColor221230313
6/1/2015 UNIVERSITY OF WISCONSIN
Running Example
15
ShirtColor
ShirtAmount
Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2
ShirtColor221230313
Count the number of green shirts in the inventory
➊
➋
Scan the color column for green (2)
Aggregate amount where there is a match
6/1/2015 UNIVERSITY OF WISCONSIN
10 10 01
11011010000 0000...
Traditional Scan Algorithm
16
ColumnData
CompareCode(Green)
10
ResultBitVector 111
10 10
ShirtColor2 (10)2 (10)1 (01)2 (10)3 (11)0 (00)3 (11)1 (01)3 (11)
6/1/2015 UNIVERSITY OF WISCONSIN
Vertical Layout
17
Color
c0 2 (10)c1 2 (10)c2 1 (01)c3 2 (10)c4 3 (11)c5 0 (00)c6 3 (11)c7 1 (01)c8 3 (11)c9 0 (00)
word c0 c1 c2 c3 c4 c5 c6 c7
w0 1 1 0 1 1 0 1 0
w1 0 0 1 0 1 0 1 1
c8 c9
w2 1 0
w3 1 0
word c0
w0 1
w1 0
word c0 c1
w0 1 1
w1 0 0
110110110 00101011 10000000 ...
6/1/2015 UNIVERSITY OF WISCONSIN
1101
CPU BitWeaving Scan
18
ColumnData
CompareCode
ResultBitVector
11111111 00000000
11010000 0000...
11011011 00101011 10000000
CPU width: 64-bits, up to 256-bit SIMD
6/1/2015 UNIVERSITY OF WISCONSIN
GPU BitWeaving Scan
19
11011011 00101011 10000000ColumnData
CompareCode
ResultBitVector
11111111 11111111 11111111
11010000 0000...
GPU width: 16,384-bit SIMD
6/1/2015 UNIVERSITY OF WISCONSIN
GPU Scan Algorithm
▪ GPU uses very wide “words”▪ CPU: 64-bits or 256-bits with SIMD▪ GPU: 16,384 bits (256 lanes × 64-bits)
▪ Memory and caches optimized for bandwidth
▪ HSA programming model▪ No data copies▪ Low CPU-GPU interaction overhead
20
6/1/2015 UNIVERSITY OF WISCONSIN
ShirtAmount131572142
11+31+3+5+...
CPU Aggregate Algorithm
21
ResultBitVector 11010000 0000...
Result
6/1/2015 UNIVERSITY OF WISCONSIN
GPU Aggregate Algorithm
22
ResultBitVector 11010000 0000...
Column Offsets 00,10,1,3,...
On CPU
6/1/2015 UNIVERSITY OF WISCONSIN
GPU Aggregate Algorithm
23
Column Offsets 0,1,3,...
Result 1+3+5+...
On GPU
ShirtAmount131572142
6/1/2015 UNIVERSITY OF WISCONSIN
Aggregate Algorithm
▪ Two phases▪ Convert from BitVector to offsets (on CPU)▪ Materialize data and compute (offload to GPU)
▪ Two group-by algorithms (see paper)
▪ HSA programming model▪ Fine-grained sharing▪ Can offload subset of computation
24
6/1/2015 UNIVERSITY OF WISCONSIN
Outline
▪ Background
▪ Algorithms
▪ Results
25
6/1/2015 UNIVERSITY OF WISCONSIN
Experimental Methods
▪ AMD A10-7850▪ 4-core CPU▪ 8-compute unit GPU▪ 16GB capacity, 21 GB/s DDR3 memory▪ Separate discrete GPU
▪ Watts-Up meter for full-system power
▪ TPC-H @ scale-factor 10
26
6/1/2015 UNIVERSITY OF WISCONSIN
Scan Performance & Energy
27
6/1/2015 UNIVERSITY OF WISCONSIN
Scan Performance & Energy
28
Takeaway:Integrated GPU most efficient for scans
6/1/2015 UNIVERSITY OF WISCONSIN
TPC-H Queries
29
Query 12 Performance
6/1/2015 UNIVERSITY OF WISCONSIN
TPC-H Queries
30
Query 12 Performance Query 12 Energy
Integrated GPU faster for both aggregate and
scan computation
6/1/2015 UNIVERSITY OF WISCONSIN
TPC-H Queries
31
Query 12 Performance Query 12 Energy
6/1/2015 UNIVERSITY OF WISCONSIN
TPC-H Queries
32
Query 12 Performance Query 12 Energy
More energy:Decrease in latency does not offset power increase
Less energy:Decrease in latency AND
decrease in power
6/1/2015 UNIVERSITY OF WISCONSIN
Future Die Stacked GPUs
33
▪ 3D die stacking
▪ Same physical & logical integration
▪ Increased compute
▪ Increased bandwidthBoard
Power et al.Implications of 3D GPUs on the Scan Primitive
SIGMOD Record. Volume 44, Issue 1. March 2015
DRAM
CPU
GPU
6/1/2015 UNIVERSITY OF WISCONSIN
Conclusions
34
Discrete GPUs
Integrated GPUs
3D StackedGPUs
Performance High ☺ Moderate High ☺Memory Bandwidth High ☺ Low ☹ High ☺
Overhead High ☹ Low ☺ Low ☺Memory Capacity Low ☹ High ☺ Moderate
6/1/2015 UNIVERSITY OF WISCONSIN
?35
6/1/2015 UNIVERSITY OF WISCONSIN
HSA vs CUDA/OpenCL
▪ HSA defines a heterogeneous architecture▪ Cache coherence▪ Shared virtual addresses▪ Architected queuing▪ Intermediate language
▪ CUDA/OpenCL are a level above HSA▪ Come with baggage▪ Not as flexible▪ May not be able to take advantage of all features
36
6/1/2015 UNIVERSITY OF WISCONSIN
Scan Performance & Energy
37
6/1/2015 UNIVERSITY OF WISCONSIN
Group-by Algorithms
38
6/1/2015 UNIVERSITY OF WISCONSIN
All TPC-H Results
39
6/1/2015 UNIVERSITY OF WISCONSIN
Average TPC-H Results
40
Average Performance Average Energy
6/1/2015 UNIVERSITY OF WISCONSIN
What’s Next?
41
▪ Developing cost model for GPU▪ Using the GPU is just another algorithm to choose▪ Evaluate exactly when the GPU is more efficient
▪ Future “database machines”▪ GPUs are a good tradeoff between specialization and
commodity
6/1/2015 UNIVERSITY OF WISCONSIN
Conclusions
▪ Integrated GPUs viable for DBMS?▪ Solve problems with discrete GPUs▪ (Somewhat) better performance and energy
42
▪ Looking toward the future...▪ CPUs cannot keep up with bandwidth▪ GPUs perfectly designed for these workloads