student paper: gpu acceleration for sql queries on large

STUDENT PAPER: GPU Acceleration for SQL Queries onLarge-Scale Distributed Systems

Linh NguyenHampden-Sydney College

1 College RoadHampden-Sydney, VA 23943 USA

[email protected]

Paul HemlerHampden-Sydney College

1 College RoadHampden-Sydney, VA 23943 USA

[email protected]

ABSTRACTGeneral purpose GPUs are a powerful hardware with a num-ber of applications in the realm of relational databases. Wefurther extended a database framework designed to allowfor GPU execution queries. Our technique is novel in thatit implements Dynamic Parallelism, a new feature in recenthardware, to accelerate SQL JOINs. Query execution re-sults in 1.25X speedup on average with respect to a pre-vious method, also accelerated by GPUs, which employs amulti-dimensional CUDA Grid.

More importantly, we divided the queries to run on multi-ple BW nodes to investigate the scalability of both SELECTand JOIN.

KeywordsGPGPU, BWSIP, CUDA, SQL, Distributed Computing

1. INTRODUCTIONA relational database management system (RDBMS) man-

ages the organization, storage, access, security, and integrityof data. Manipulating an RDBMS requires the use of Struc-tured Query Language (SQL), an industry-standard declar-ative language capable of performing very complex queriesand aggregations over data sets. In an RDBMS, these datasets are organized in structured tables, and SQL serves asan intermediary between a client program and its databases.The explosive growth of various types of unstructured data,has called for the next wave of innovation in data storage,management, and analysis, otherwise known as Big Data.

MapReduce presents one such innovation. Initially devel-oped by Google, MapReduce is a programming paradigmwidely used for data intensive and large-scale data analysisapplications [10]. The architecture is simple abstraction thatallows programmers to write a map function that processeskey-value pair associated with the input data. The reducefunction then merges all the intermediate values associatedwith the same intermediate key. The programming modelinvolves three phases: Map; Shuffle and Sort; Reduce.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Copyright ©JOCSE, a supported publication ofthe Shodor Education Foundation Inc.DOI: https://doi.org/10.22369/issn.2153-4136/8/1/5

In some cases, a MapReduce framework has replaced tra-ditional SQL database, though the advantage of one overanother remains a debated topic. Both approaches are verygeneral methods through which data can be processed. Infact, there are efforts to combine the two in order to easeprogrammers’ learning curve. In particular, Apache Hive al-lows programmers to write queries in SQL-like fashion andthen converts the queries to map/reduce [1]. Regardless ofthe specifics, SQL remains highly applicable in the Big Dataera thanks to the power of its declarative syntax. As such,significant efforts have been made in improving the perfor-mance of SQL, targeting parallel heterogeneous hardwarearchitectures, which incorporates many processor types in asingle machine [5, 6, 7, 8, 11, 12, 13, 14, 16].

The SQL model of data organization fits a parallel ex-ecution model extremely well since different processors canwork on different portions of a table simultaneously. Amongrelevant attempts, the most promising approach currentlyutilizes the newest generation of Graphics Processing Units(GPUs), which are implemented as massively parallel archi-tectures designed for rapidly rendering complex and realisticgraphical scenes. More importantly, several general purposesoftware development frameworks for programming GPUshave become standard. Two such frameworks are Com-pute Unified Device Architecture (CUDA) [19] and the OpenCompute Language (OpenCL) [17]. These frameworks al-low applications to simply and effectively harness the powerof the GPU. In addition, they have also become a commonmethod of accelerating many compute-intensive applications[9].

This project aims to extend a GPU-based database to al-low for the execution of SQL queries on multiple GPUs con-tained within the Blue Waters (BW) supercomputer system[18]. The specific focus of this project targets two commonbut extremely important SQL queries: SELECT and JOIN.While most previous papers utilize parallel primitives, ourimplementation executes with an opcode virtual machinemodel. In short, the key contributions of this paper include:

• Dynamic Parallelism Approach - Earlier GPU hard-ware did not include the ability to launch new GPU func-tions directly from threads. The BW compute nodes pro-vide newer GPUs with the Kepler Architecture, whichincludes this important feature, referred to as DynamicParallelism (DP). We employ DP to enhance hardwareutilization in accelerating JOIN.

• Multi-GPU Configuration - To the best of our knowl-edge, all previous work has shown speedups on a single-

Volume 8, Issue 1 Journal of Computational Science Education

20 ISSN 2153-4136 January 2017

GPU system. However, databases often reside on manycompute nodes and querying data incurs additional expen-sive communication costs. Since the BW is a large-scaledistributed system with thousands of nodes, it is particu-larly suitable to investigate this necessary aspect of bothSELECT and JOIN queries.

• Educational Values - Work on this project has exposedme to a multitude of areas in Computational Sciences,such as Compilers, Computer Architecture, and ParallelComputing. Most importantly, I have gained significantexperiences in multi-GPU programming. Our specific im-plementation is under the MapReduce dwarf. The rele-vant characteristics will be discussed in subsequent sec-tions.

In summary, we performed an in-depth investigation ofdividing queries across multiple GPUs and the implicationsof novel hardware features to explore the potential benefitsof GPU acceleration for SQL queries.

2. RELATED BACKGROUNDThis section provides overviews of SQL queries, and MapRe-

duce. It also discusses current solutions in parallel SQL ex-ecution on the GPUs

2.1 SQL SELECTs and JOINsIn the following discussion, we denote a table as Ti and

each column in a table as ci where i is their correspondingenumeration.

A SELECT statement extracts data from a database inan organized and readable format. A typical SELECT state-ment is accompanied by clauses such as FROM, WHERE,and ORDER BY. These clauses specify the conditions onwhich the data is filtered. A predicate includes one or moreof these conditions.

T1

c1 c21 w2 z3 z

⇒c1 c22 z3 z

Figure 1: The result rows of a SELECT query:SELECT ∗ FROM T1 WHERE T1.c1 ≥ 2

A JOIN query requires at least two tables that share acommon attribute, referred to as a foreign key. While thereare many types of JOINs, we focus only on the cross-join,denoted ./, since it is the simplest JOIN query but still em-bodies all the computational characteristics of other JOINS.A cross-join computes the Cartesian product of all the keysin the relevant tables. Figure 2 highlights the result rowsof a cross-join between T1 and T2 based on the predicateT1.c2 = T2.c2, where c2 is the foreign key. While the re-sult consists of nine rows in total, the predicate reduces theresult to just three rows.

A SELECT query involves a loop that examines every rowin a table. A JOIN entails a nested loop, where for each rowin the first table, the corresponding loop iterates over theentire second table and emits rows that match the predicate.Clearly, this entire process is computationally demanding forvery large tables with millions of rows.

T1

c1 c21 w2 z3 z

./

T2

c2 c1x 5z 6w 7

⇒

T1 ./ T2

c1 c2 c2 c31 w x 51 w z 61 w w 72 z x 52 z z 62 z w 73 z x 53 z z 63 z w 7

Figure 2: The cross join T1 ./ T2:SELECT ∗ FROM T1, T2 WHERE T1.c2 = T2.c2

2.2 MapReduceAs previously mentioned, the MapReduce paradigm is an

active research area. The three main phases are shown inFigure 3.

Figure 3: The three phases of MapReduce.

Map Phase: The input data is partitioned into splitsand each is assigned to Map tasks to be processed in separatenodes in the cluster. These Map tasks perform computationson each input key-value pair from its assigned partition ofdata. Finally, they generate a set of intermediate results foreach key.

Shuffle and Sort Phase: This phase sorts the data gen-erated by the Map tasks from other nodes and divides thisdata into regions to be further processed by the Reduce task.Therefore, all Map tasks must complete prior to this phase.In addition, this phase distributes the data as necessary tonodes where the Reduce tasks will execute.

Reduce Phase: The Reduce tasks perform additionaloperations on the intermediate data by merging values as-sociated with a particular key to a smaller set of values toproduce the output. The number of Reduce tasks does notneed to be equal to the number of Map tasks. For morecomplex data processing procedures, multiple MapReducecalls may be linked together in sequence.

MapReduce is a simple and scalable approach to achievehigh speed and efficient parallel processing over a massiveamount of data.

2.3 Parallel SQL Databases

Journal of Computational Science Education Volume 8, Issue 1

January 2017 ISSN 2153-4136 21

There have been other works in the literature [7, 8, 11, 14]that employ the parallel primitives such as scan, sort, or fil-ter to process queries on the GPUs. Each of these primitivesis a kernel launched on a given set of data in an order spec-ified by programmers. Every kernel then needs to retrieverelevant data from global memory. Fundamentally, this ap-proach entails many memory accesses and consequently af-fects performance.

We build directly on the work presented in [3, 4, 5, 6].This line of research proposes using a virtual machine (VM)instead. From a high-level point of view, a VM is a sequenceof jump instructions inside the GPU. Each instruction isidentified by an opcode and performs a certain operation.Each thread is mapped to a data point, small enough to beloaded directly into a register.

The source code of the Virginian database, which em-bodies this approach, is open source [4]. While inspired bySQLite [2] , the initial implementation presents a completelycustom VM, which we will describe in Section 3. We mod-ified this VM to accommodate the distributed architectureof the BW system.

3. IMPLEMENTATIONThe program first builds an abstract syntax tree (AST)

from the SQL query. The AST is then processed in severalpasses to generate a virtual program, namely a set of jumpinstructions that represents the query. The details of theparsing algorithm and code generations are documented inthe Virginian database page [4].

3.1 Virtual Machine InfrastructureA VM is implemented as a CUDA kernel that executes

a sequence of instructions procedurally. Each instruction isidentified by an opcode and contains four registers. Eachopcode serves a certain functionality.

The aforementioned work partitions a SQL table into chunksof data, called the tablet [4, 5]. Each tablet is a fix-sizedgroup of rows. As a result, accessing an entire table of datamay involve multiple tablets, but accessing a single row ofdata involves only a single tablet. Partitioning is beneficialat two levels of parallelism, especially for SQL SELECT.First, for a heavily distributed table, each of this table’stablets resides on a different node and thus can be processedindependently. Second, for each tablet, threads in a CUDAgrid handle their corresponding rows.

Since SELECT operates only on single source table, anextension was developed to this model to support opera-tions over two or more tables, namely JOIN [3]. CUDA’sorganization of threads into a grid of up to three dimensionscoincides with the structure of a Cartesian product wherea two-table join corresponds to a two dimensional grid. Werecall the example from Figure 2, where two tables T1 andT2 are cross-joined on the same foreign key. Here, the x-dimension of the CUDA grid corresponds to the row indexof T1 while the y-dimension corresponds to row index of T2.The shortcoming, however, is that this approach is limitedto joining at most three tables since this is the maximumsupported dimensions of a CUDA grid.

This method, which we will now refer to as the Gridmethod, is used as the baseline performance for our noveltechnique for accelerating JOIN, which will be discussedshortly. In addition, we scale both SELECT and JOIN to ex-

ecute on multiple nodes and compare the performance withthat of a single node.

3.2 Dynamic Parallelism for SQL JOINsDP is a new functionality provided by the Kepler Archi-

tecture [19]. DP allows kernels to be launched from thedevice without going back to the host. The kernel, blockor thread that initiates the launch is referred to as parent.Also, the kernel, block, or thread that is launched is referredto as the child. DP allows explicit synchronization betweenthe parent and the child through a built-in device function.Launches can be nested from parent to child, then child tograndchild and so on. The deepest nesting level that requiresexplicit synchronization is the synchronization depth. Par-ent and child kernels have coherent access to global memory.Shared and local memories are exclusive for parent and childkernels.

The JOIN algorithm involves at least two data arrays,taken from the related data tables. Each thread takes oneelement from the first array in the JOIN predicate and findsthe matching keys from the other array. DP implementationlaunches a kernel to gather the result elements in parallel foreach thread.

3.3 Multi-GPU ConfigurationTo further improve the performance of large-scale data

processing, it is essential to share workloads among dis-tributed compute nodes equipped with GPUs. As discussedin Section 1, this project is the first to examine the possi-bility of breaking up a data set and running a query con-currently on multiple GPUs. Our custom data structure isuseful in the context of a heavily distributed database byenabling efficient management of data between networkedmachines. In our implementation, tablets play a major rolesince they vastly simplify the process of moving data to andfrom different nodes. Tablets allow each node to operateexclusively on known-size records.

Node 0 Node 0 Node N

Tablet 0 Tablet 1 Tablet N

Result 0 Result 1 Result N

Final

. . .

. . .

Figure 4: Query plan for SQL SELECT

A table is first divided into tablets. This step is trivialthanks to the partitioning scheme of a SQL table. Eachtablet is sent to a XK compute node through MPI. On eachnode, each tablet is then processed by the same virtual ma-chine row by row on the node’s GPU. After producing itscorresponding set of result rows, each node passes these rowsback to the master node, which then reorganizes these rowsand emits the final table to the user. Figure 4 illustratesthis approach for the SELECT queries.


22 ISSN 2153-4136 January 2017

Node 0 . . . Node N

...

Node N ×M

T1T2

Figure 5: Query plan for SQL JOIN

Similarly, the two tables in a JOIN query are each di-vided into an array of tablets. Figure 5 demonstrates thisplan. Table T1 contains N tablets, and table T2 containsM tablets. Since a JOIN statement is a Cartesian productof two arrays of data, each node is responsible for a dif-ferent section of the cross product. For instance, Node 0computes T1.Tablet 0 ./ T2.Tablet 0. The results of thesecomputations are again passed to the master node in thesame procedure as in Figure 4.

Splitting data across multiple nodes has a two-fold ad-vantage: data can be processed more quickly and a largerquantity of data can be processed. The database can nowsupport much larger tables while retaining the same per-formance. For the same data set, more computing powersreduce the processing time.

4. RESULTSWe adopted the test data from [4, 3, 6], which includes

8 million rows randomly generated numerical values. Thecolumns consists of an integer primary key and 2 columnseach with distribution in [-100,100], a normal distributionwith standard deviation 5, and another with standard de-viation 20. Each of these distributions was generated onceeach for a 32-bit integer column and a IEEE 754 32-bit float-ing point column. The GNU Scientific Library was used toensure the quality of the random distribution.

A suite of ten different queries were written to thoroughlytest the different characteristics for both SELECT and JOIN.Hence, there are twenty queries in total. For SELECT, fiveof the queries for each suite involve integer values; the restof the suite test floating point values. For JOIN, only onequery operates exclusively on integer values; one query testsfloating point exclusively; the rest of the suite test a arith-metic and conditional statements with a mixture of bothinteger and floating points.

We have little reason to believe results would change sig-nificantly with realistic data sets, since all rows are checked.Also, both textual and non-textual data are ultimately rep-resented by numeric values. Besides, we have tested virtu-

ally all basic possible combinations of common case queries.This section also highlights the educational values of our

project.

4.1 PerformanceWe analyze in detail the runtime growth of our implemen-

tation and then discuss performance improvement relativelyto the baseline as well as the scalability of our implementa-tion as a function of the number of nodes.

4.1.1 SELECTThe characteristics of runtime growth on a single node has

been discussed [5]. We therefore focus mainly on the scalingfactor of the implementation. Figure 6 visually presents thespeedups observed as we scale our program. Single-nodeexecution was predictably slower by factors proportional tothe number of nodes used in computation.

2 4 6 8 10 12 14 160

2

4

6

8

10

Number of Nodes

Sp

eedup(X

)

Scaling Performance of SELECT and JOIN

SELECT

JOIN

Figure 6: Scaling factors for SELECT and JOIN

4.1.2 JOINTest were performed using two 3500 row tables contain-

ing randomly generated values in the same layout as forSELECT. We chose this number of rows since the Carte-sian product of two tables with 3500 rows each will space3500 · 3500 = 12, 250, 000 rows, larger than the test numberof rows for SELECT.

Figure 8(a) graphically demonstrates the differences inrunning times using DP and Grid methods. The mean ex-ecution time for all ten queries for Grid is 0.2526 secondsand 0.233 seconds for DP. We noticed that floating pointsare slightly faster and reported the average between integerand floating points. Figure 8(b) depicts the speedups of DPwith respect to Grid, which averaged to be 1.245X.

Scaling SQL JOINs on distributed is less straightforwardthan with SELECT. Since the Cartesian product is a two-dimensional product, we chose the sizes of the two tables inFigure 5 as {1 × 2, 2 × 2, 2 × 3, 2 × 4, 3 × 4, 2 × 7, 4 × 4}.These numbers correspond to the same number of nodes forSELECT and their speedups are also displayed in Figure 6.We chose these dimensions to examine the effects of the ra-


January 2017 ISSN 2153-4136 23

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

Proportions of Cross Product

Exec

uti

on

Tim

e(s)

Proportions of Rows Returned vs. Execution Time

GPU Kepler K20c

Figure 7: As fewer rows in the Cartesian productare filtered, the GPU becomes less efficient

tio NM

have on the scalability of JOINs. The graph showeda weaker scalability for JOINs due to a more complex dis-tribution plan. The mean percentage to theoretical speedupis 54.4%.

Figure 9 provides the latency breakdown as percentagesof critical parts in the execution cycle of a program. Pre-dictably, as the number of nodes increases, the system incursmuch more overhead to organize data to the assigned nodes.We notice a gradual decline in the role actual computationplays with respect to total runtime. In contrast, the node-to-node communication overhead increases linearly, up to75.2% at 16 nodes.

4.2 Educational ValuesThe use of GPU in HPC is becoming extremely popular

due to the high computational power and high bandwidthcoupled with the availability of (de facto) frameworks. How-ever, effectively programming a hybrid system with MPIand CUDA is especially difficult as the growing complexi-ties of applications require more and more processors. Be-cause the communication patterns and resource schedulingare common to a wide range of domains, we generalize ourexperiences and hope that they could be helpful to futureprogrammers faced with similar problems.

4.2.1 Multi-GPU ProgrammingMPI and CUDA combine the two parallel programming

frameworks enables solving problems with a data size toolarge to fit into the memory of a single GPU, or that wouldrequire an unreasonably long compute time on a single node.Also, this combination allows programmers to efficiently ac-celerate existing MPI applications with GPUs.

In multi-GPU programming, an MPI launcher starts mul-tiple instances of an application and distributes these in-stances across the nodes of the BW. Each instance thenlaunches its own CUDA kernel. A major problem ariseswhen each node does not receive an even amount of dataor does not return the same amount of data, referred to

(a) Query Execution Times

0 1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

Query

Exec

uti

on

Tim

e(s

)

Grid

DP

(b) DP Query Speedup

0 1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Query

Sp

eedup

(X)

Figure 8: Variations of two different methods foraccelerating JOINs

as workload imbalance. Our implementation automaticallysolves the first problem since data is evenly divided intotablets. The latter is harder to address since imbalancemay occur due to the specific characteristics of an appli-cation. For instance, we chose a uniform distribution forour test data since the distribution has equal probability forany random region of data, thus minimizes the chance forload imbalance. It is critical for the programmer to be awareof these challenges in order to address them properly.

4.2.2 MapReduce IntroductionMapReduce has been shown to work well on the BW [15],

which gives a compelling reason to develop a section on howto effectively utilize this paradigm. Database aggregationfalls into the MapReduce dwarf since each stage is directlymapped to a phase in MapReduce, as described in Figure3. In our case, the input data is the original tables and theoutput data is the result table.

Map: Each of the partitioned tablets in a table is mappedto a node. The node’s number and its corresponding tabletform a key/value pair. Each node then executes the VM gen-erated and then produces its own set of result rows. Theseare the intermediate results to be passed to next phase.

Shuffle and Sort: Each node sorts its own result tablebased on programmer’s specification. For instance, thesenodes arrange the numeric values in ascending/descendingorder.

Reduce: The master node is responsible for this phase,collecting other nodes’ intermediate tables based on theirkeys and then rearrange these tables to into one single finaltable to display.

5. CONCLUSIONThe Virginian database was used as a platform for the


24 ISSN 2153-4136 January 2017

2 4 6 8 10 12 14 160

20

40

60

80

100P

erce

nta

ge(

%)

MPI PCIe Kernel Execution

Figure 9: The fraction of each query runtime due tocore parts of computation

project, enabling the use of an existing SQL parsing mecha-nism and switching between host and device. Execution onthe GPU was supported by a completely custom VM im-plemented as a CUDA kernel. Our DP approach shows anaverage of 1.245X faster than the previously implementedGrid method. The characteristics of each query, the typeof data being queried, the size of the result set, and thenumbers of nodes involved were all significant factors in theperformance of GPU-enabled database. Despite these vari-ation, the minimum speedup for DP was 1.1X.

SQL is an excellent interface through which the GPUcan be accessed: it is much simpler and more widely usedthan many alternatives. Using SQL represents a break fromthe paradigm of previous research which drove GPU queriesthrough the use of operational primitives. Additionally, SQLdramatically reduces the effort required to employ GPUs fordatabase acceleration. While still enjoying the simplicity ofSQL, users can now benefit from a much reduced runtime.All the details are hidden so that users do not need to learnany additional materials besides their existing SQL knowl-edge.

Our opcode model allows the programmer to choose anygranularity for database operation in conjunction with a rel-atively simple VM, while also enabling efficient data han-dling. Programmers can simply add or modify opcodes tosupport more complex queries. We also generalize the mod-ule to help with the learning curve by presenting the generalchallenges associated with multi-GPU programming as wellas the project’s relation to MapReduce.

6. REFLECTIONSThe Blue Waters Student Internship Program (BWISP)

impacts me significantly on many aspects of my personaland professional developments.

The Petascale Institute (PI) allowed me the opportunityto learn how to use the Blue Waters supercomputer, us-ing the various parallel programming frameworks it pro-

vides. This experience and exchanges with my fellow stu-dents and the instructors improved my programming skillsgreatly. Also, the exposure to other disciplines though theseconversations opened my eyes to the vast application ofComputer Science (CS). The areas I was introduced to in-clude biomathematics, physical simulations, and financialmodelling.

After the PI, working on my own project helped me applythese learned skills in practice. I gained invaluable experi-ences working with large software project that utilizes manysoftware tools and involves various areas of CS, such as Com-pilers, Computer Architecture, and Software Development.More importantly, the experience provides me with the ma-terials to discuss in my graduate school applications. As aresult, I was accepted to all programs I applied to, includ-ing well-known institutions in High Performance Computingsuch as University of Michigan and Georgia Tech.

All in all, the BWISP was an immense success and equippedme with the necessary skills to succeed in my career as aComputer Science researcher.

7. FUTURE WORKOur implementation has been designed partly to demon-

strate a very general framework for GPU data processing.Using this framework, a next step is to implement and testadditional database features such as other types of join, in-cluding inner,outer, and natural Joins. Modern RDBMSsare extremely complex, and much more work in this areais required to fully replicate this functionality in a GPU-friendly manner. We have shown that our opcode frameworkwill be adaptable to this additional functionality, with mod-ifications to our VM as appropriate to facilitate inter-opcodecommunication [4, 3].

Regardless of the direction, the general area of GPU ap-plication development is ripe for future research.

8. ACKNOWLEDGMENTSThis work is supported by a grant from the Shodor Ed-

ucation Foundataion through the Blue Waters Student In-ternship Program (BWSIP) and by the National Center forSupercomputing Applications, who provides the hardwareused in this project.

9. REFERENCES[1] apache hive. https://hive.apache.org/. Date accessed:

2016-04-11.

[2] the sqlite virtual machine.https://www.sqlite.org/opcode.html. Date accessed:2016-04-06.

[3] K. Angstadt and E. Harcourt. A virtual machinemodel for accelerating relational database joins usinga general purpose gpu. In Proceedings of theSymposium on High Performance Computing, HPC’15, pages 127–134, San Diego, CA, USA, 2015.Society for Computer Simulation International.

[4] P. Bakkum. The Virginian Database.https://github.com/bakks/virginian. Date accessed:2015-05-10.

[5] P. Bakkum and S. Chakradhar. Efficient datamanagement for gpu databases. Technical report,NEC Laboratories America, Princeton, NJ, 2013.


January 2017 ISSN 2153-4136 25

[6] P. Bakkum and K. Skadron. Accelerating sql DatabaseOperations on a GPU with CUDA. In Proceedings ofthe 3rd Workshop on General-Purpose Computationon Graphics Processing Units (GPGPU’ 10), pages94–103, New York, NY, 2010. ACM.

[7] S. Baxter. Relational Joins.https://nvlabs.github.io/moderngpu/join.html. Dateaccessed: 2016-03-19.

[8] S. Bress, M. Heimel, N. Siegmund, L. Bellatreche, andG. Saake. Gpu-accelerated database systems: Surveyand open challenges. 2014.

[9] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn,L. Wang, and K. Skadron. A characterization of therodinia benchmark suite with comparison tocontemporary cmp workloads. In WorkloadCharacterization (IISWC), 2010 IEEE InternationalSymposium on, pages 1–11, Dec 2010.

[10] J. Dean and S. Ghemawat. Mapreduce: Simplifieddata processing on large clusters. Commun. ACM,51(1):107–113, Jan. 2008.

[11] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, andD. Manocha. Fast computation of database operationsusing graphics processors. In Proceedings of the 2004ACM SIGMOD International Conference onManagement of Data, SIGMOD ’04, pages 215–226,New York, NY, USA, 2004. ACM.

[12] R. J. Halstead, I. Absalyamov, W. A. Najjar, andV. J. Tsotras. Fpga-based multithreading forin-memory hash joins. In Conference on InnovativeData System Research (CIDR’ 15), 2015.

[13] R. J. Halstead, B. Sukhwani, H. Min, M. Thoennes,P. Dube, S. Asaad, and B. Iyer. Accelerating joinoperation for relational databases with fpgas. InField-Programmable Custom Computing Machines(FCCM), 2013 IEEE 21st Annual InternationalSymposium on, pages 17–20, April 2013.

[14] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,Q. Luo, and P. V. Sander. Relational querycoprocesing on graphics processors. ACM Trans.Database Syst., 34(4):21:1–21:39, Dec. 2009.

[15] G. B. W. K. Manisha Gajbe, Kalyana Chadalavada.Benchmarking and performance studies of mapreduce/ hadoop framework on blue waters supercomputer. InWorldComp 15 - ABDA’15 International Conferenceon Advances in Big Data Analytics. ISBN, 2015.

[16] S. Meki and Y. Kambayashi. Acceleration of relationaldatabase operations on vector processors. Systems andComputers in Japan, 31(8):79–88, 2000.

[17] Advanced Micro Devices. Opencl programming guide.http://developer.amd.com/wordpress/media/2013/07/AMD Accelerated Parallel Processing OpenCLProgramming Guide-rev-2.7.pdf. Date accessed:2015-05-10.

[18] National Center for Supercomputing Applications.Brief blue waters system overview.https://bluewaters.ncsa.illinois.edu/user-guide. Dateaccessed: 2016-03-19.

[19] NVIDIA Corporation. Nvidia cuda programmingguide. http://docs.nvidia.com/cuda/pdf/CUDA CProgramming Guide.pdf. Date accessed: 2015-05-10.


26 ISSN 2153-4136 January 2017

student paper: gpu acceleration for sql queries on large

Documents