gpu processing in database systems - amd
TRANSCRIPT
24.06.2011 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin
http://www.dima.tu-berlin.de/
GPU Processing in Database Systems
Max HeimelNikolaj Leischner
Volker MarklMichael Säcker
24.06.2011 DIMA – TU Berlin 2
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU/CPU Hybrid DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 3
Database Query
■ SQL is declarative: specifies what data is needed, not how to get it
SELECT DISTINCT o.name,a.driverFROM owner o,
car c, demographics d , accidents a
WHERE c.ownerid = o.id AND o.id = d.ownerid ANDc.id = a.id ANDc.make = 'Mazda' AND c.model = '323' ANDo.country3 = 'EG' ANDo.city = 'Cairo' ANDd.age < 30 ;
Find owner and driver of Mazda 323s that have been involved in accidents in Cairo, Egypt, where the age of the driver has been less than 30
24.06.2011 DIMA – TU Berlin 4
Relational Database Operators
Selection/Filteringσcity=„Berlin“(owner)
Projectionπcity (owner)
make model ownerid
Honda Accord 1
Toyota Camry 1
Mazda 323 2
ownerid city
1 Berlin
2 Potsdam
3 Berlin
ownerid city
1 Berlin
3 Berlin
city
Berlin
Potsdam
Joincars |><| cars.ownerid = owner.ownerid owner
make model ownerid city
Honda Accord 1 Berlin
Toyota Camry 1 Berlin
Mazda 323 2 Potsdam
Further important database operators: union, intersection, set difference, Cartesian cross product, grouping/aggregation, sort
carsowner
24.06.2011 DIMA – TU Berlin 5
■ Bottom up data flow graph
■ Strings together operators
■ Many QEPs for a query□ physical operators□ operator order □ pipelining vs. bulk transfer
■ Query optimization□ determines “best“ QEP
■ Cost model □ for each operator□ for operator combination
Query Execution Plan (QEP)
RETURN>7
NLJN /-----+-----\
7 N/A SCAN FETCH
| |7 N/A
SORT ISCAN | |7 N/A
HSJN ACCIDENTS/------+-----\
162015 10 SCAN HSJN
| /--+--\605999 14422 1000 DEMO. FETCH FETCH
| |14422 1000
ISCAN ISCAN| |
N/A N/ACAR OWNER
SELECT o.name,a.driverFROM owner o,
car c, demographics d , accidents a
WHERE c.ownerid = o.id AND o.id = d.ownerid ANDc.id = a.id ANDc.make = 'Mazda' AND c.model = '323' ANDo.country3 = 'EG' ANDo.city = 'Cairo' ANDd.age < 30 ;
Execution PlanQuery
24.06.2011 DIMA – TU Berlin 6
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU/CPU Hybrid DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 7
A Case for GPU architectures
■ Challenge in processor architecture□ More’s law is hitting a wall
− power wall− memory wall− ILP wall− …
□ Solution: Parallel processing
■ GPGPUS offer massively parallel processing□ Many more cores□ High GFLOPs and memory
bandwidth□ Lower/cost per core□ Can be used to build and test
massively parallel systems
24.06.2011 DIMA – TU Berlin 8
■ GFLOPS & GB/s for different CPU & GPU chips..
■ Beware: theoretical peak numbers
GFLOPS & GB/s
Wikipedia, Intel, AMD, Nvidia, IBM
Xeon E5320
Xeon W3540 Xeon X5680
Opteron 2360SE
Opteron 6180SE
Power7 4.04Ghz
SPARC64 VIIIfx
Firestream 9270
Firestream 9370
Tesla c870
Tesla c1060
Tesla c2050
0
20
40
60
80
100
120
140
160
2006 2007 2008 2009 2010 2011 2012
Memory bandwidth (GB/s)
Xeon E5320
Xeon X7460
Xeon W3540
Xeon X5680 Xeon E8870 Opteron 2360SE
Opteron 2435
Opteron 6180SE
Power7 4.04Ghz
SPARC64 VIIIfx
Firestream 9270
Firestream 9370
Tesla c1060
Tesla c2050
0
100
200
300
400
500
600
2006 2007 2008 2009 2010 2011 2012
GFLOP/s
24.06.2011 DIMA – TU Berlin 9
■ Most die space devoted to control logic & caches
■ Maximize performance for arbitrary, sequential programs
What is the difference? Look at a modern CPU..
www.chip‐architect.comAMD K8L
24.06.2011 DIMA – TU Berlin 10
■ Little control logic,a lot of execution logic
■ Maximize parallel computational throughput
And at a GPU..
ixbtlabs.comATI RV770
24.06.2011 DIMA – TU Berlin 11
■ SIMD cores with small local memories (16-48 kB)■ Shared high bandwidth + high latency DRAM (1-6 GB)■ Multi-threading hides memory latency■ Bulk-synchroneous execution model
GPU architecture
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
Local memory
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
Local memory
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
Local memory…
SharedDRAM
x86 host
24.06.2011 DIMA – TU Berlin 12
■ Most database operations well-suited for massively parallel processing
■ Efficient algorithms for many database primitives exist□ Scatter, gather/reduce, prefix sum, sort..
■ Find ways to utilize higher bandwidth & arithmetic throughput
GPUs & databases: opportunities
24.06.2011 DIMA – TU Berlin 13
■ Find a way to live with architectural constraints□ GPU to CPU bandwidth (PCIe) smaller than CPU to RAM bandwidth□ Fast GPU memory smaller than CPU RAM
■ Technical hurdles□ GPU is a co-processor:
− needs CPU to orchestrate work− get data from storage devices − etc.
□ GPU programming models (e.g., OpenCL, CUDA) are low level− need architecture-specific tuning− lack database primitives
□ Limited support for multi-tasking− extra work like aggregating operations into bulks required
GPUs & databases: challenges & constraints
24.06.2011 DIMA – TU Berlin 14
■ J. Owens: GPU architecture overview, In ACM SIGGRAPH 2007 courses (SIGGRAPH '07). ACM, New York, NY, USA, , Article 2
■ H. Sutter: The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, 30(3), March 2005, http://www.gotw.ca/publications/concurrency-ddj.htm (Visited May 2011)
■ V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451-460.
References & Further Reading
24.06.2011 DIMA – TU Berlin 15
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 16
Hybrid CPU+GPU system
■ 1 or several (interconnected) multicore CPUs
■ 1 or several GPUs
CPU
I/OH
GPU
RAMRAMRAMRAM
HDD
CPU
I/OH
GPU
RAMRAMRAMRAM
HDD
24.06.2011 DIMA – TU Berlin 17
Bottlenecks
■ PCIe bandwidth & latency
■ GPU memory size
■ GPU has no direct access to storage & network
CPU
I/OH
GPU
RAMRAMRAMRAM
HDD
CPU
I/OH
GPU
RAMRAMRAMRAM
HDD
24.06.2011 DIMA – TU Berlin 18
■ Fully-fledged GPU query processor□ GPU hash indices, B+ trees, hash join, sort-merge join□ Handles data sizes larger than GPU memory, supports non-numeric
data types
GDB: GPU query processing
Relational query coprocessing on graphics processors
Bingsheng He and Mian Lu and Ke Yang and Rui Fang and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander
CPU
I/OH
GPU
RAMRAMRAMRAM
HDD
24.06.2011 DIMA – TU Berlin 19
■ 2x-7x speedup for compute-intense operators if data fits inside GPU memory
■ Mixed mode operators provide no speedup■ No speedup if data does not fit into GPU memory
GDB: results
05101520253035404550
NEJ EJ (HJ)
Performance of SQL queries (seconds)
CPU only
GPU only
24.06.2011 DIMA – TU Berlin 20
■ Offload large bulks of OLTP queries to GPU
■ CPU generates processing plan based on a transaction dependency graph, GPU does actual query processing
□ System with 1 CPU + 1 GPU
□ Data must fit inside GPU memory
□ Only supports stored procedures
GPUTx: GPU OLTP query processing
High‐Throughput Transaction Executions on Graphics Processors
Bingsheng He and Jeffrey Xu Yu
24.06.2011 DIMA – TU Berlin 21
■ Implementation based on H-Store
GPUTx: results
0
5
10
15
20
25
30
35
40
Scale factor 128
256 512 1024 2048
TPC‐B, normalized throughput
CPU (4 core)
GPUTx
0
5
10
15
20
25
Scale factor 20 40 60 80
TPC‐C, normalized throughput
24.06.2011 DIMA – TU Berlin 22
■ B. He and J. Xu Yu: High-throughput transaction executions on graphics processors. Proc. VLDB Endow. 4, 5 (February 2011), 314-325.
■ B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander: Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article 21 (December 2009), 39 pages.
References & Further Reading
24.06.2011 DIMA – TU Berlin 23
■ No dynamic memory allocation
■ Divergence□ Threads taking different code paths
Serialization
■ Limited synchronization possibilities□ Only on block level□ Limited block size
Challenges of GPU Programming
24.06.2011 DIMA – TU Berlin 24
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU
□ Index Operations
□ Compression
□ Sorting
□ Relational Operators
□ Further Operations
■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 25
► GPGPU architectures offer massively data-parallel processing
► Fruitfly architecture: code may run on future multicore CPUs
Opportunities and Challenges
► Bottlenecks (PCIe, local memory, disk and network access)
► Ensuring correctness while fully utilizing data parallelism is hard
► Synchronization mechanism for read/write conflicts is limited
► Cost modeling hard - some CPU details are unknown or change quickly
24.06.2011 DIMA – TU Berlin 26
■ Second order functions on arrays
□ Map
□ Scatter (indexed write)
□ Gather (indexed read)
□ Prefix-Scan (combines/reduces entries)
□ Split (partitioning, e.g., hash or range)
□ Sort (e.g., bitonic sort/merge sort, radix sort, quicksort)
■ Library of primitives (building blocks) for GPGPU queryprocessing
Primitives for DBMS Operator Implementations
Relational joins on graphics processors
Bingsheng He and Mian Lu and Ke Yang and Rui Fang and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander
24.06.2011 DIMA – TU Berlin 27
■ Classic: B+ Tree
■ Directory accelerates access, exploting the fact that the data in a sorted and organized in a hierarchical way
■ Compression to map variable-length data to fixed size□ Dictionary encoding
■ CSS-Tree: Encode directory and payload in array, cache-aware
Indexing
… …
… …
Key
Payload in leaves
Directory
24.06.2011 DIMA – TU Berlin 28
■ Path of a key within the tree defined by absolute value
■ Split key into equally sized sub-keys
■ Tree depth depends on key length
Prefix Tree Index
INT = 793910
0001 1111 0000 00112
1 15 0 310…
…
…
24.06.2011 DIMA – TU Berlin 29
■ Idea: Group processing to exploit faster local memory■ Group computational trees into blocks■ Tree block is executed by a GPU thread block
□ use shared memory for intermediate results
■ Generate global result by traversing internal result list
Speculative Hierarchical Index Traversal
24.06.2011 DIMA – TU Berlin 30
GPUs□ very limited memory capacity□ high computation capabilities and bandwidth⇒ Compression reduces overhead of data transfer⇒ Compression allows GPUs to address bigger problems
Usually: combine compression schemes for higher compression ratiosE.g., auxiliary scheme RLE and main scheme zero suppression
Goal: Carry operations (e.g., filtering) out on compressedrepresentation of the dataW. Fang, B. He, Q. Luo: Database Compression on Graphics Processors, Proc. VLDB Endow. 3, 1-2 (September 2010), 670-680
Database Compression
1000,1003,1005 1000,+0003,+0002 1000,+3,+2
24.06.2011 DIMA – TU Berlin 31
Sorting with GPUs
■ Numerous GPU algorithms published□ Bitonic sort, bucket sort, merge sort, quicksort, radix sort, sample
sort..
■ Existing work: focus on in-memory sorting with 1 GPU
■ State of the art: merge sort, radix sort
■ GPUs outperform multicore CPUs when data transfer over PCIe is not considered
24.06.2011 DIMA – TU Berlin 32
Radix Sort @ GPGPU
Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw
Scan
Core 0
Bottom
‐level Red
uctio
nTop‐level
Scan
Bottom
‐level Scan
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Core 0
…
...
Reduce
Accumulate
Binn
ing
Keys (in)..
Accumulate
Binn
ing
Keys (in)..
Core 0
…
Reduce
Accumulate
Binn
ing
Keys (in)..
Accumulate
Binn
ing
Keys (in)..
Core 1
…
Reduce
Accumulate
Binn
ing
Keys (in)..
Accumulate
Binn
ing
Keys (in)..
Core 2
…
Reduce
Accumulate
Binn
ing
Keys (in)..
Accumulate
Binn
ing
Keys (in)..
Core 3
…
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Core 1
…Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Core 2
…
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Binn
ing
Keys (in)..
Scatter
Scan
Scan
Keys (o
ut)..
Core 3
…
24.06.2011 DIMA – TU Berlin 33
■ For GPU radix sort the PCIe transfer dominates running time
GPU sorting performance
Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw
Faster Radix Sort via Virtual Memory and Write‐Combining Jan Wasenberg, Peter Sanders
0
200
400
600
800
1000
1200
GPU Radix w/ PCIe transfer GPU Radix CPU Radix
Throughput (million items / second)
32bit keys
32bit keys, 32bit values
24.06.2011 DIMA – TU Berlin 34
1. Split relations into blocks
2. Join smaller blocks in parallel
Example: Non-indexed Nested-Loop Join
R
S
Thread Group 1,1
Thread Groupi,1
Thread Group1,j
Thread Groupi,j
…
…
… …
R`
S`
… …
… …Thread 1 Thread T
Problem: To allocate memory for the result, the join has to be performed twice!
Aka “fragment and replicate join” (symmetric/unsymmetric)There is more: indexed NLJ, hash join, sort‐merge join, shuffling rings, etc.
24.06.2011 DIMA – TU Berlin 35
■ Brent-Kung circuit strategy ■ Only upsweep phase necessary because only final result is
needed■ Permutation of elements to minimize memory bank conflicts■ Separate thread group to combine results of blocks
Example: Aggregation
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
24.06.2011 DIMA – TU Berlin 36
■ Map/Reduce
■ Regular Expression Matching
■ K-Means Clustering
■ Apriori Clustering
■ Exact String matching
■ …
There is more...
24.06.2011 DIMA – TU Berlin 37
■ J. Rao, K. A. Ross: Cache Conscious Indexing for Decision-Support in Main Memory, In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78-89
■ C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. Brandt, P. Dubey: FAST: fast architecture sensitive tree search on modern CPUs and GPUs, ACM SIGMOD ‘10, 339-350, 2010
■ P. B. Volk, D. Habich, W. Lehner: GPU-Based Speculative Query Processing for Database Operations, First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures In conjunction with VLDB 2010
References & Further Reading: Indexing
24.06.2011 DIMA – TU Berlin 38
■ D. G. Merrill and A. S. Grimshaw: Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 545-546
■ J. Wassenberg, P. Sanders: Faster Radix Sort via Virtual Memory and Write-Combining, CoRR 2010, http://arxiv.org/abs/1008.2849 (Visited May 2011)
■ N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey: Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 international conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 351-362.
References & Further Reading: Sorting
24.06.2011 DIMA – TU Berlin 39
■ B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, P. Sander: Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). ACM, New York, NY, USA, 511-524.
■ D. Merrill, A. Grimshaw: Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. December 2009.
References & Further Reading: Rel. Operators
24.06.2011 DIMA – TU Berlin 40
■ N. Cascarano, P. Rolando, F. Risso, R. Sisto: iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Comput. Commun. Rev. 40, 5 20-26.
■ M.C. Schatz, C. Trapnell: Fast Exact String Matching on the GPU, Technical Report.
■ W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, P. Y. Yang, B. He, Q. Luo, P. V. Sander, K. Yang: Parallel Data Mining on Graphics Processors, Technical Report HKUST-CS08-07, Oct 2008.
■ B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. 2008. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). ACM, New York, NY, USA, 260-269.
■ many more …
References & Further Reading: Further Ops
24.06.2011 DIMA – TU Berlin 41
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 42
A blast from the past: DIRECT
■ Specialized database processors already in 1978■ Back then could not keep up with rise of commodity CPUs■ Today is different: memory wall, ILP wall, power wall..
ControllerHost
Query processor
Massstorage
Memory
Query processor
Query processor
Memory Memory
DIRECT ‐ a multiprocessor organization for supporting relational data base management systems
David J. DeWitt
24.06.2011 DIMA – TU Berlin 43
■ Massively parallel & reconfigurable processors■ Lower clock speeds, but configuration for specific tasks
makes up for this■ Very difficult to program■ Used in real world appliances: Kickfire/Teradata,
Netezza/IBM..
FPGAs
FPGA: what's in it for a database?
Jens Teubner and Rene Mueller
CPU
RAM
FPGA
Network
HDD
24.06.2011 DIMA – TU Berlin 44
■ CPU & GPU integrated on one die, shared memory■ Uses GPU programming model (OpenCL)■ No PCIe-bottleneck but only memory bandwidth like CPU
CPU/GPU hybrid (AMD Fusion)
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
Local memory
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
ALUALUALUALU
Local memory…
Memory controller
x86 core x86 core…
RAM
AMD Whitepaper: AMD Fusion Family of APUs
24.06.2011 DIMA – TU Berlin 45
■ Network processors□ Many-core architectures□ Associative memory (constant time search)
■ CELL□ Similar to CPU/GPU hybrid: CPU core + several wide SIMD cores w/
local scratchpad memories
There is more...
24.06.2011 DIMA – TU Berlin 46
■ Similar approaches□ Many-core, massively parallel□ Distributed on-chip memory□ Hope for better perf/$ and perf/watt than traditional CPUs
■ Similar difficulties□ Memory bandwidth always a bottleneck□ Hard to program
− parallel− synchronization− low-level & architecture-specific
Common goals, common problems
24.06.2011 DIMA – TU Berlin 47
■ D. J. DeWitt: DIRECT - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture (ISCA '78). ACM, New York, NY, USA, 182-189.
■ R. Mueller and J. Teubner: FPGA: what's in it for a database?. In Proceedings of the 35th SIGMOD international conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 999-1004.
■ AMD White paper: AMD Fusion family of APUs, http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf (Visited May 2011)
■ N. Bandi, A. Metwally, D. Agrawal, and A. El Abbadi: Fast data stream algorithms using associative memories. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 247-256.
■ B. Gedik, P. S. Yu, and R. R. Bordawekar: Executing stream joins on the cell processor. In Proceedings of the 33rd international conference on Very large data bases (VLDB '07). VLDB Endowment 363-374.
References & Further Reading
24.06.2011 DIMA – TU Berlin 48
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems
Agenda
24.06.2011 DIMA – TU Berlin 49
Architectural constraints
■ PCIe bottleneck□ Direct access to storage devices, network etc.?□ Caching strategies for device memory
■ GPU memory size□ Deeper memory hierarchy
(e.g. a few GB of fast GDDR + large „slow“ DRAM)?
24.06.2011 DIMA – TU Berlin 50
■ Want forward scalability: performance should scale with next generations of GPUs□ Existing work often optimized for exactly 1 type of GPU chip
■ Need higher level programming models□ Hide hardware details
(processor count, SIMD width, local memory size..)
■ Cost estimation of operators
Performance portability challenges
24.06.2011 DIMA – TU Berlin 51
■ Big data volume and limited memory
■ Execution Plan consists of multiple operators
■ Where to execute each operator?
□ Trade off transfer time between CPU and GPU and computational
advantage
□ Cost Models for GPGPU, CPU and hybrid
□ Amdahl’s law
Database-specific challenges
24.06.2011 DIMA – TU Berlin 52
■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems
□ Monet DB/CUDA□ Parstream
Agenda
24.06.2011 DIMA – TU Berlin 53
■ Joint Project of TU Berlin and CWI
■ Goal: Investigate GPGPU in the context of database systems□ We extend the open-source database system MonetDB using OpenCL
■ Why MonetDB?□ Open-source □ Column-oriented in-memory database system
■ Why OpenCL?□ Open standard with support from many big players□ Not restricted to graphics cards
General Overview: Hybrid MonetDB
24.06.2011 DIMA – TU Berlin 54
■ Proof-of-concept implementation□ Supports only fixed-size columns fitting into device memory□ Implemented operators
− Table Scan− Nested-Loop Join− Aggregation− Group By
■ Implementation consists of:□ OpenCL context handling□ GPU memory management□ Integration into MonetDB Assembly Language (MAL).□ Support for multiple GPUs
Hybrid MonetDB System Outline
24.06.2011 DIMA – TU Berlin 55
Parstream High Performance Indexing
ParallelExcecutionIndex Compression PartitionBig Data
24.06.2011 DIMA – TU Berlin 56
ParStream combines state of the art database technologies with unique technologies
ColumnStore
Fast data access for analytical processing
CustomQuery Operators
SQL, JDBC and powerful C++ API enable fast query processing
In-MemoryTechnology
Data and indices can stay in memory due to efficient compression
High PerformanceIndex
Unique index structure allows highly parallel execution of queries
Patent Pending – High USP
ParStream – Building Blocks
24.06.2011 DIMA – TU Berlin 57
■ S. Hong and H. Kim: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37, 3 (June 2009), 152-163.
■ L. Bic and R. L. Hartmann: AGM: a dataflow database machine. ACM Trans. Database Syst. 14, 1 (March 1989), 114-146.
■ H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun: A domain-specific approach to heterogeneous parallelism. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP '11). ACM, New York, NY, USA, 35-46.
■ http://www.dima.tu-berlin.de■ http://www.monetdb.nl■ http:// www.parstream.com
References & Further Reading
24.06.2011 DIMA – TU Berlin 58
Thank You
MerciGrazie
Gracias
Obrigado
Danke
Japanese
English
French
Russian
German
Italian
Spanish
Brazilian Portuguese
Arabic
Traditional Chinese
Simplified Chinese
Hindi
Tamil
Thai
Korean
24.06.2011 DIMA – TU Berlin 59
■ Disclaimer & Attribution■ The information presented in this document is for informational purposes only and may contain
technical inaccuracies, omissions and typographical errors.■ The information contained herein is subject to change and may be rendered inaccurate for many
reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
■ NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
■ ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
■ AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
■ The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.
■