nestgpu: nested query processing on gpu · 2020. 11. 18. · nestgpu: nested query processing on...

12
NestGPU: Nested Query Processing on GPU Sofoklis Floratos * , Mengbai Xiao * , Hao Wang * , Chengxin Guo , Yuan Yuan , Rubao Lee § , Xiaodong Zhang * * The Ohio State University, Columbus, OH, USA Renmin University of China, Haidian District, China Google Inc., Mountain View, CA, USA § RateUp Inc., Columbus, OH, USA Abstract—Nested queries are commonly used to express com- plex use-cases by connecting the output of a subquery as an input to the outer query block. However, their execution is highly time- consuming. Researchers have proposed various algorithms and techniques that unnest subqueries to improve performance. Since this is a customized approach that needs high algorithmic and engineering efforts, it is largely not an open feature in most existing database systems. Our approach is general-purpose and GPU-acceleration based, aiming for high performance at a minimum development cost. We look into the major differences between nested and unnested query structures to identify their merits and limits for GPU processing. Furthermore, we focus on the nested approach that is algorithmically simple and rich in parallels, in relatively low space complexity, and generic in program structure. We create a new code generation framework that best fits GPU for the nested method. We also make several critical system optimizations including massive parallel scanning with indexing, effective vectorization to optimize join operations, exploiting cache locality for loops and efficient GPU memory management. We have implemented the proposed solutions in NestGPU, a GPU-based column-store database system that is GPU device independent. We have extensively evaluated and tested the system to show the effectiveness of our proposed methods. I. I NTRODUCTION Nested queries are an important part of SQL for users to express complex use-cases by connecting the output of a subquery (the inner query block) as an input to the outer query block. Among them, a nested query is correlated if its subquery refers attributes of the outer query block [1]. An efficient way to execute a correlated subquery is to unnest it into an equivalent flat query. This unnested method has been recognized as an active research field of databases for about 40 years [2], [3], [4], [5], [6], [7], [4], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17]. However, due to its customized approach, existing database systems cannot unnest arbitrary correlated subqueries by following some general rules and require high engineering efforts to implement unnesting strategies into a database engine. For an example of such a query presented in Query 1, a conventional method [2] is to replace the subquery with derived tables. By replacing the table scans that contain the correlated columns, i.e., R.col1 = S.col1, in the nested loop with a JOIN in conjunction with a GROUP BY, the correlated nested query can be unnested into an unnested one Floratos and Xiao contributed equally to this work as the first author. SELECT R.col1, R.col2 FROM R WHERE R.col2 = ( SELECT min(S.col2) FROM S WHERE R.col1 = S.col1); Query 1: An example of correlated nested query SELECT R.col1, R.col2 FROM R, ( SELECT min(S.col2) as t1_min_col2, S.col1 as t1_col1 FROM S GROUP BY S.col1) T1 WHERE R.col1 = T1.t1_col1 AND R.col2 = T1.t1_min_col2; Query 2: The unnested query of Query 1 as shown in Query 2. However, if the correlated operator in the subquery is >, <, !=, or others, e.g., the condition is changed from R.col1 = S.col1 to R.col1 > S.col1 in the WHERE clause of the nested loop, this query cannot be unnested without extending the current SQL standard and introducing new operators [16], [17]. A more general way to execute a subquery is the nested loop approach that is the final option of most databases when a given nested query cannot be unnested. This nested method directly executes a correlated subquery by iterating all tuples of the correlated columns from the outer query block and executing the subquery as many times as the number of the outer-loop tuples [1]. The method is general-purpose since it is not affected by the type of correlated operators or subquery structures. However, it has a high computational complexity as the correlated columns in the subquery have to be accessed multiple times. In Query 1, the nested method has O(N 2 ) computational complexity, where N is the size of correlated columns (assuming both R.col1 and S.col1 have N tuples). In contrast, the unnested approach in Query 2 has lower computational complexity as it only accesses the correlated columns in the subquery once: when using a GROUP BY and a JOIN to execute the subquery, the unnested method has O(N ) computational complexity in both JOIN and GROUP BY (assuming the hash join is used). Such a significant complexity gap between the two methods and the optimization opportunities for flat queries have motivated

Upload: others

Post on 21-Feb-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

NestGPU: Nested Query Processing on GPUSofoklis Floratos∗, Mengbai Xiao∗, Hao Wang∗, Chengxin Guo†,

Yuan Yuan‡, Rubao Lee§, Xiaodong Zhang∗∗The Ohio State University, Columbus, OH, USA

†Renmin University of China, Haidian District, China‡Google Inc., Mountain View, CA, USA§RateUp Inc., Columbus, OH, USA

Abstract—Nested queries are commonly used to express com-plex use-cases by connecting the output of a subquery as an inputto the outer query block. However, their execution is highly time-consuming. Researchers have proposed various algorithms andtechniques that unnest subqueries to improve performance. Sincethis is a customized approach that needs high algorithmic andengineering efforts, it is largely not an open feature in mostexisting database systems.

Our approach is general-purpose and GPU-acceleration based,aiming for high performance at a minimum development cost.We look into the major differences between nested and unnestedquery structures to identify their merits and limits for GPUprocessing. Furthermore, we focus on the nested approach thatis algorithmically simple and rich in parallels, in relativelylow space complexity, and generic in program structure. Wecreate a new code generation framework that best fits GPUfor the nested method. We also make several critical systemoptimizations including massive parallel scanning with indexing,effective vectorization to optimize join operations, exploitingcache locality for loops and efficient GPU memory management.We have implemented the proposed solutions in NestGPU, aGPU-based column-store database system that is GPU deviceindependent. We have extensively evaluated and tested the systemto show the effectiveness of our proposed methods.

I. INTRODUCTION

Nested queries are an important part of SQL for usersto express complex use-cases by connecting the output ofa subquery (the inner query block) as an input to the outerquery block. Among them, a nested query is correlated if itssubquery refers attributes of the outer query block [1]. Anefficient way to execute a correlated subquery is to unnestit into an equivalent flat query. This unnested method hasbeen recognized as an active research field of databases forabout 40 years [2], [3], [4], [5], [6], [7], [4], [8], [9], [10],[11], [12], [13], [14], [15], [16], [17]. However, due to itscustomized approach, existing database systems cannot unnestarbitrary correlated subqueries by following some general rulesand require high engineering efforts to implement unnestingstrategies into a database engine.

For an example of such a query presented in Query 1,a conventional method [2] is to replace the subquery withderived tables. By replacing the table scans that contain thecorrelated columns, i.e., R.col1 = S.col1, in the nestedloop with a JOIN in conjunction with a GROUP BY, thecorrelated nested query can be unnested into an unnested one

Floratos and Xiao contributed equally to this work as the first author.

SELECT R.col1, R.col2FROM RWHERE R.col2 = (SELECT min(S.col2)FROM SWHERE R.col1 = S.col1);

Query 1: An example of correlated nested query

SELECT R.col1, R.col2FROM R, (SELECT min(S.col2) as t1_min_col2,

S.col1 as t1_col1FROM SGROUP BY S.col1) T1

WHERER.col1 = T1.t1_col1 ANDR.col2 = T1.t1_min_col2;

Query 2: The unnested query of Query 1

as shown in Query 2. However, if the correlated operator inthe subquery is >, <, ! =, or others, e.g., the condition ischanged from R.col1 = S.col1 to R.col1 > S.col1in the WHERE clause of the nested loop, this query cannotbe unnested without extending the current SQL standard andintroducing new operators [16], [17].

A more general way to execute a subquery is the nestedloop approach that is the final option of most databaseswhen a given nested query cannot be unnested. This nestedmethod directly executes a correlated subquery by iterating alltuples of the correlated columns from the outer query blockand executing the subquery as many times as the numberof the outer-loop tuples [1]. The method is general-purposesince it is not affected by the type of correlated operatorsor subquery structures. However, it has a high computationalcomplexity as the correlated columns in the subquery have tobe accessed multiple times. In Query 1, the nested methodhas O(N2) computational complexity, where N is the sizeof correlated columns (assuming both R.col1 and S.col1have N tuples). In contrast, the unnested approach in Query2 has lower computational complexity as it only accessesthe correlated columns in the subquery once: when using aGROUP BY and a JOIN to execute the subquery, the unnestedmethod has O(N ) computational complexity in both JOINand GROUP BY (assuming the hash join is used). Such asignificant complexity gap between the two methods andthe optimization opportunities for flat queries have motivated

Page 2: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

many research and development activities for the unnestedmethod, even though it is not as general as the nested oneand consumes more memory spaces due to the derived tables.

In response to the ending of the Moore’s Law, GPU hasbeen widely used to accelerate query processing [18], [19],[20], [21], [22], [23], [24], [25], [26], [27], [28]. GPU canaccelerate performance over CPU due to its massively parallelarchitecture and high bandwidth memory. However, nestedqueries are yet in the scope of existing GPU-based queryprocessing systems due to software complexity, includingGPUDB [21], OmniSci [29], and Kinetica [30]. An input queryis by default considered as a flat query and unnesting shouldbe done before the execution.

In this paper, we aim to provide a generic method fornested query processing on GPU without unnesting queries.The rationale of our approach is as follows. The high com-putational complexity of the nested method can be offset byoptimized parallel processing on GPU; and its low memoryusage can sustain the nested method in the limited GPUmemory capacity. This makes the nested method for executingcorrelated subqueries to become feasible on GPU. However,implementing such a method on GPU is challenging. On CPU,the nested method requires little engineering efforts sincerecursively executing a subquery in a predicate of an operator(e.g., selection and join) is feasible and efficient, naturallyfollowing the same way of processing flat queries. In contrast,implementing a recursive method on GPU is complicatedand inefficient. On GPU, the relational operators are usuallyimplemented as GPU kernels [29], [30], [21], [20], [31], [23].Recursively calling GPU kernels with dynamic parallelism[32] that is the only way for recursive kernel execution on GPUnot only incurs significant overhead at runtime [33], [34], [35],[36], but also complicates the system design, where differentGPU kernels must be generated in advance for differentsubqueries or the execution control flow, i.e., analyzing thewhole query plan tree, has to be offloaded to GPU. To addressthis structural issue, we propose a code generation frameworkfor the nested query processing on GPU. Our method generatescode to iteratively evaluate a subquery for all correlated tuplesbefore evaluating the predicate containing the subquery, so thatwe can realize the subquery execution by manipulating pre-implemented and highly optimized GPU kernels in an iterativemanner.

In order to execute correlated subqueries efficiently, wedesign and implement a set of optimizations on GPU: (1)GPU memory management. In order to avoid frequent GPUmemory allocation and deallocation in relational operatorsrunning on GPU, we design and implement a memory poolmechanism and differentiate multiple types of memory re-quirements inside GPU kernels. (2) Indexing. Despite thehigh throughput of table scan on GPU, building an index andnarrowing the scan range can significantly speed up the tablescan that are repeatedly invoked in the iteration of the nestedmethod. It is particularly effective when the inner table islarge. (3) Invariant component extraction. During the codegeneration, we extract the invariant components inside the

nested query. At runtime we execute them only once to avoidredundant computation. (4) Vectorization. Inside the nestedquery block, intermediate data might be too small to fully takeadvantage of the parallelism of GPU. We implement the queryvectorization [37] on GPU by fusing GPU kernels to vectorizeoperators across multiple nested iterations. This improves GPUoccupancy effectively. (5) Caching for reused outer looptuples. With this effort, a large portion of computation can besaved if the correlated column contains duplicate values.

With these optimizations on GPU, the nested method caneffectively approach the unnested method in the executiontime. However, the gap between the computational complexitystill deteriorates the performance of the nested method if thecorrelated tables become large enough. Thus, we build a costmodel to predict the execution time of the nested method.This model can provide a precise estimation of the executiontime of the nested method for the query optimizer, so that theoptimizer can switch to the unnested method when the nestedmethod is performance unsuitable.

To integrate all these efforts together, we have developeda subquery processing system, called NestGPU. In short,GPU accelerated nested query processing in NestGPU has thefollowing unique merits. First, certain nested queries couldnot be algorithmically unnested [16], [17], thus the onlysolution on GPU is to use the nested method of NestGPU.Second, due to the high memory capacity requirement, certainalgorithmically unnested workloads could not complete theexecution on GPU (see Section V). In contrast, they runwell with NestGPU by effectively using the GPU devicememory. Third, our experiments show that execution per-formance for representative workloads by NestGPU and theunnested method are very comparable, and in certain cases,NestGPU outperforms the unnested one. Finally and mostimportantly, NestGPU is general-purpose without complexunnesting efforts, which strongly motivates us to design andimplement the nested method on GPU for this critical reasonin practice. Our contributions are summarized as follows.

• We look into the major structural differences between thenested and unnested methods, by comparing their meritsand limits on GPU. Our study gives insights into thereasons why we are able to deliver high performance inGPU accelerated processing for the nested method.

• We create a new code generation framework and developa set of GPU optimizations for effective implementationsof the nested method. We also develop a cost model toguide subquery processing by selecting the most effectiveexecution path, which ensures to achieve optimal perfor-mance for each subquery processing.

• We design and implement a subquery processing system,called NestGPU. Having extensively evaluated and testedthe system, we show the effectiveness of the systemand accuracy of the cost model. The unique value ofNestGPU is its simple and general-purpose structurewithout complex unnesting efforts, which we believe is afundamental contribution for subquery processing.

Page 3: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

II. BACKGROUND

A. Nested Queries

In SQL, a query can be nested in the SELECT, FROM,WHERE or HAVING clause along with an expression operator,e.g., <, >, =, and others. It can also be part of an UPDATEor DELETE statement, or a SQL view. Four nesting types aregeneralized to type-A, type-N, type-J and type-JA nesting [2].For type-A and type-N queries, the nested query block canbe executed once and then results are served to the outerquery block. The execution of these two types of subqueries issimilar to that of flat queries. For type-J and type-JA nesting,the inner query block contains parameters resolved from arelation of the outer block. Type-JA nesting further containsan aggregate function. These two types of nested queries arenamed as correlated subqueries. Because correlated subqueriesare most popular and cannot be executed straightforwardly astype-A and type-N nested queries, they are extensively studiedin existing research of the unnested method. In this work, wealso focus on correlated subqueries.

B. Nested and Unnested Approaches

. _ . _ _

. .

. , .

. 𝑺𝒖𝒃𝒒𝒖𝒆𝒓𝒚#𝟏( . )

∅; ( . )

. , .

. ; . → _ _

. → _ , _ _

Fig. 1: The query-plan trees of Query 1 and Query 2

Although the nested method is in a simple structure, itexecutes a correlated subquery in an algorithmically inefficientway. In principle, the subquery must be re-evaluated for eachtuple of the referenced column from the outer query block.This leads to high computation complexity. When evaluatingsuch a subquery, the execution engine usually analyzes thequery plan trees for the outer query block and the inner queryblock separately, and uses a syntax like the Subplan ofPostgreSQL to connect them. For the nest subquery in Query1, the query plan trees are shown in Figure 1 (a), where theleft side represents the nested query block and the right side isthe outer query block. The engine evaluates the subquery foreach tuple of the column R.col1 and accesses the relationS multiple times.

A common way to reduce the complexity of the nestedmethod is to unnest the nested loops. By performing JOINand GROUP BY operations, the correlated subquery can bemerged into its upper level query block. One can use thealgorithm described in [2] to unnest Query 1 and get theequivalent Query 2. Query 2 is a flat query and its queryplan tree is shown in Figure 1 (b). In the execution, a derivedtable T1 that is the result of grouping S.col1 and thenaggregating on S.col2 is first generated. The final resultis projected by a join on the relation R and the derived table

T1 with the conditions R.col1 = t1_col1 and R.col2= t1_min_col2. Notice that an unnested query usuallyrequires larger memory space than that of the unnested onebecause of the generated derived table, the size of whichdepends on the content of the inner query block.

C. Query Processing on GPU

A general framework of a GPU query-processing engineincludes three major components. The first is a set of pre-implemented and optimized relational operators, such as Se-lection, Join, Aggregation, and many others. Each of theseoperators is usually dismantled into several GPU kernels(called as primitives), e.g., scan, scatter, prefix-sum, etc. Thesecond is a parser to transform the SQL queries into their queryplan trees and apply necessary optimizations. The third is acode generator to translate the query plan trees into programs(called as drive programs) with optimizations on both CPUand GPU. The commonly used operators are highly optimizedon GPU [18], [20], [23], [38]. The memory management isalso a major concern. To avoid unnecessary data movement,the intermediate tables between operators could be preservedon GPU device memory. A pipelined design is also used tooverlap GPU computation and data movement [21], [24], [28].

D. Problem Statement

SELECT R.c1 R.c2FROM RWHERE R.c2 =

(SELECT min(S.c2)FROM SWHERE R.c1 = S.c1);

open () {R.open();

}

getNext() {do {

r = R.getNext();if(r == null) return null;subqRes = subquery(r.c1);

} until (r.c2 == subqRes);return r;

}

close() {R.close();

}

Fig. 2: The nested method on CPU: the pseudo-code to executeQuery 1 with the iterator interface method. The subquery is re-evaluated every time when a new tuple is retrieved.

Figure 2 shows a standard pseudo-code to execute Query 1in the nested method on CPU. The subqueries are recursivelyevaluated for different correlated values by following theiterator interface method [39]. As shown in the figure, asubquery function, i.e., the aggregate operator min in thiscase, can be directly called at any level of nested queries,because the recursive processing is naturally supported onCPU. Thus, the execution of a correlated subquery on CPUin the nested method is same to that of a flat query. However,to implement such a framework on GPU is cumbersomeand inefficient for two reasons. First, as mentioned above, acommon practice of GPU query processing engines is to pre-implement the relational operators as GPU kernels while toleave the control flow on CPU. Thus, the execution is in apush mode, i.e., calling GPU kernels of operators from leaves

Page 4: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

to the root of the query plan tree. If implementing the recursiveexecution of subqueries on GPU, one has to implement thecontrol flow completely on GPU, i.e., the traversal of thequery plan tree, and also re-implement GPU operators as GPUdevice functions to allow an operator to call another one,since CPU program invocation from GPU is not allowed. Thiswill lead to an increasingly large, complex, and inefficientGPU kernel. Second, dynamic parallelism [32] is the typicalway to support recursive kernel invocation on GPU. However,this approach causes inferior performance due to the kernellaunch overhead, the under utilization of GPU resources insub-kernels, and the divergence branch overhead in upper-levelkernels [33], [34], [35], [36]. More importantly, the maximumdepth of kernel invocation and the total amount of memoryreserved in dynamic parallelism are limited and varied ondifferent generations of GPU [32], which makes the recursiveimplementation on GPU that uses dynamic parallelism not asgeneral as that on CPU. A new framework rather than therecursive one for the nested method on GPU is needed.

Even if we have such a framework that can support thenested method on GPU, we still need to resolve the followingperformance issues, due to the nature of repeated evaluationsof the subquery. The first one is the kernel launch issue. Thenested method may incur high kernel launch overhead, becausethe kernel launch times grow along with the iterations of theouter block. The second one is the memory management issue.The nested method requires more advanced device memorymanagement. Allocating and deallocating device memory peroperator is usual in unnested methods, but it will lead tounacceptable memory management overhead in the nestedmethod, since each operator in a subquery is called as manytimes as the number of correlated tuples from the outer block.The third one is the inadequate parallelism issue. If the data setto be processed in an operator of a subquery is too small, GPUoccupancy is lowered and the overall performance is degraded.The issue may not be a big concern in unnested methods onGPU since the operator in the subquery manipulates the resultsof the join on the intermediate relation and the relation of theouter block. In contrast, in each iteration of the nested method,the operator in the subquery operates on results of the selectionthat uses a single tuple of the outer-loop relation as input andproduces much less output than that of a join.

In summary, the success of GPU-accelerated processing forcorrelated subqueries depends on two critical components: (1)a new code generation framework that best fits GPU for thenested method; and (2) a set of effective optimization methodsto deliver high performance. With these two components,we have accomplished our goal of high performance andminimum development cost in the query processing engine,which will be presented in the rest of sections.

III. SYSTEM DESIGN

A. Query Plan Trees

NestGPU introduces a new operator SUBQ into the queryplan tree to represent a subquery. When the parser of Nest-GPU identifies a subquery, it will insert a SUBQ into the

predicate containing the subquery, which will be eventuallysubstituted with an iteration loop of the execution code. Sincethe subquery itself is a complete query block, it will be parsedinto another query plan tree. Meanwhile, because a subquerymay include another subquery, the NestGPU parser will finallygenerate a tree-of-trees structure. In our implementation, allquery plan trees except the outermost one are stored into alist, so that any of them can be located by their indices whentraversing the tree during the code generation. The operatorSUBQ has varying operands, including the subquery index andthe correlated columns. Figure 3 shows an example of a three-level query plan tree that includes two subqueries. The firstsubquery reside in the predicate on the right table of the joinoperator at Level 0, and the second one is in the selectionoperator at Level 1.

Subqueries: [1, 2, …]The outmost query

JOIN: LEFT.col1 > SUBQ(2, RIGHT.col1, …)

SELECTION: R.col1 = SUBQ(1, R.col2, …)Level 0

Level 1

Level 2

Fig. 3: A query plan tree with two correlated subqueries

B. Code Generation

By traversing the query plan tree, NestGPU generates adrive program that will be further compiled and linked to pre-implemented relational operators. The code generator startsfrom the outermost query plan tree, from the leaves to theroot. For any node, i.e., a query operator, in the query-plantree having a subquery as an operand of its predicate, thecode generator generates code that evaluates the subquery forall correlated values passed from the outer loop and setups theresults as a vector. Then the operator containing the subqueryis evaluated with the result vector as the input.

SELECTION: R.col1 = SUBQ(1, R.col2)LoadToGPU(R.col1);

LoadToGPU(R.col2);for(int i = 0; i < R.tupleNum; i++) {

{// Generate code for subquery#1…LoadToGPU(S.col1);Table t_in = Selection(“=“, S.col1, R.col2[i]);…Res[i] = Materialize(resultTable);

}}

Table t_out = Selection(“=“, R.col1, Res);

The outmost query

Subqueries[1]

SELECTION: S.col1 = R.col2

Fig. 4: Generating code for a subquery in SelectionSelection with a subquery. Figure 4 shows the pseudo

code generated for a selection operator containing a correlatedsubquery in the drive program. The regular execution codeof a selection is to load all operands, e.g., a constant “10”

Page 5: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

JOIN: LEFT.col1 = SUBQ(2, RIGHT.col1)LoadToGPU(LEFT.col1);

LoadToGPU(RIGHT.col1);for(int i = 0; i < RIGHT.tupleNum; i++) {

{// Generate code for subquery#2…Res[i] = Materialize(resultTable);

}}

Table t_out = JOIN(“=“, LEFT.col1, Res);

LoadToGPU(LEFT.col1);

LoadToGPU(LEFT.col2);LoadToGPU(RIGHT.col1);for(int i = 0; i < LEFT.tupleNum * RIGHT.tupleNum; i++) {

{// Generate code for subquery#3// LEFT.col2[i / LEFT.tupleNum] and RIGHT.col1[i % LEFT.tupleNum]

are used in current iteration…Res[i] = Materialize(resultTable);

}}

// The value in Res used to join is at left_idx * RIGHT.tupleNum + right_idxTable t_out = JOIN(“=“, LEFT.col1, Res);

JOIN: LEFT.col1 = SUBQ(3, LEFT.col2, RIGHT.col1)

Fig. 5: Generating code for a subquery in Join

or a column R.col1, into GPU memory first, then to runGPU kernels over the loaded data to complete the selectionoperation, and to create a table that can be the input of thenext operator. However, if one operand of the selection isa subquery, the code generator needs to generate code thatprocesses the subquery. As shown in Figure 4, the correlatedcolumns, i.e., R.col2, are loaded into GPU device memory.The code processing the subquery will be generated in aniterative loop with the size of the input table of the selectionoperator, i.e., the tuple number of R. In the loop, the generatedcode uses each correlated value, i.e., R.col2[i], to dothe selection. The results may be used as the input for thefollowing operator in the subquery, and the final results arestored into the vector according to current iteration number,i.e., Res[i]. Following the loop, an upper level selectionis to evaluate the predicate containing the subquery, i.e., theselection on R.col1 and Res. In this method, our code gen-erator recursively traverses the query plan tree but generatesiterative code for the subquery in the selection operator.

Join with a subquery. The code generation for a joinoperator that contains a correlated subquery is different fromthat for a selection operator, because we need to determine theiteration number for the subquery. There are two cases. First, ifcorrelated tuples come from one of the two tables of the join,the iteration number is equal to the corresponding table size.Second, correlated tuples come from both of the join tables.The iteration number is the Cartesian product of two tables.Figure 5 shows the generated code for these two cases, wherethe difference is marked as red. Besides the code generatedfor the subquery, the code for the join operator containingthe subquery is also different. For the second case, the joinoperator following the loop needs to use both the left andright tuple IDs to locate the subquery result from Res. If

the join containing correlated subqueries is a nature join andall its predicates are connected by AND, we can optimize thequery by first joining two tables with the predicates withoutcorrelated subqueries, and then performing a selection overthe result table for predicates containing the subqueries. Thiscan effectively reduce the iteration numbers to call subquerisand hence improve the performance.

Subquery in a subquery. Our code generation frameworkcan handle a query with arbitrary levels of correlated sub-queries. Figure 6 shows a complete work flow to generatethe drive program for the three-level query of Figure 3, inwhich both subqueries are correlated with its upper-level queryblock. In Figure 6, the code pieces marked by different colorscorrespond to the subqueries at different levels. At each level,the generated code evaluates the subquery and stores its resultsinto a vector that will be fed back to the operator at the upperlevel. In this way, we handle nested subqueries by recursivelytraversing the query plan tree on CPU and generating theiterative code blocks in a nested loop with the push mode.

For type-JA correlated subqueries, every subquery evalua-tion returns a scalar and the result size is fixed. The resultscan be organized as a vector residing in a space of continuousmemory, like any table columns. But type-J correlated sub-queries are different because their results have variable lengths.An example is a type-J subquery with an IN operator, e.g.,R.col1 IN (...). For such subqueries, we use a two-level arrays to store the results of subquery evaluations. Thefirst level is to store the lengths of variable results for eachsubquery evaluation and the second is to store the results.

C. Memory Management

Due to the iterative execution of subqueries, memory man-agement becomes inefficient if the raw system interfaces likemalloc are frequently called. Thus, NestGPU build memorypools for efficiently managing memory towards table columns,meta data, intermediate tables, and inter-kernel results.Memory pools: Memory pools are designed to contain datathat needs to be frequently allocated and freed on both the hostand device memory. There are three types of data that shouldbe put into memory pools, including meta data, intermediatetables, and inter-kernel results. The meta data is the metainformation used at the host side, including the column types,the tuple number of a table, and others. The intermediate tablesare composed of column data produced by an operator, whichis the input of the next operator. As an operator may consistof several GPU kernels, the inter-kernel results are temporarydata on GPU from one kernel to the following one, e.g., a0-1 vector generated by the prefix-sum kernel will be used asthe input of a materialization kernel to determine which tuplesshould be written to the device memory.

In NestGPU, individual memory pools are used for thesethree types of data. The memory is linearly allocated in allmemory pools, by moving forward the tail pointers. Deallo-cation means moving backward the tail pointers while settingthe tail pointer to the head of a memory pool means releasingall allocated memory in it. For inter-kernel results, NestGPU

Page 6: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

The outmost query

JOIN: LEFT.col1 > SUBQ(2, RIGHT.col1)

SELECTION: R.col1 = SUBQ(1, R.col2)

Level 0

Level 1

Level 2

…LoadToGPU(R.col1);LoadToGPU(R.col2);for(int i0 = 0; i0 < R.tupleNum; i0++) {

// Generate code for subquery#1…LoadToGPU(LEFT.col1);LoadToGPU(RIGHT.col1);for(int i1 = 0; i1 < RIGHT.tupleNum; i1++) {

// Generate code for subquery#2…Res1[i1] = Materialize(resultTable1);

}Table joinRes = JOIN(“>”, LEFT.col1, Res1);…Res0[i0] = Materialize(resultTable0);

}Table selectionRes = SELECTION(“=“, R.col1, Res0);…

SELECT …FROM R, …WHERE …

R.col1 = (SELECT …FROM LEFT, RIGHT, …WHERE …

R.col2 = ……LEFT.col1 >

(SELECT …FROM …WHERE …

RIGHT.col1= ……)

…)…

;

SQL query Query plan trees Drive program

Fig. 6: The work flow eventually generating a drive program for a 3-level correlated nested query

…void *hostPos1 = host_mempool.tail;void *interPos1 = inter_mempool.tail;for(…) {

// Generate code for a level-1 subquery…

void *hostPos2 = host_mempool.tail;void *interPos2 = inter_mempool.tail;for(…) {

//Generate code for a level-2 subquery…Table t = SELECTION(…, inter_kern_mempool);inter_kern_mempool.tail = inter_kern_mempool.head;…host_mempool.tail = hostPos2;inter_mempool.tail = interPos2;

}…host_mempool.tail = hostPos1;inter_mempool.tail = interPos1;

}…

Free space for the meta data

Free space for the intermediate tables

Free space for the inter-kernel results

Record where to free for the meta data

Alloc space for the inter-kernel results in current operator

Record where to free for the intermediate tables

Fig. 7: The use cases of memory pools for three types of data: metadata, intermediate tables, and inter-kernel results

clears the memory pool after the execution of an operator.For meta data and intermediate tables, their tail positions arerecorded before the subquery execution. After the executionof an iteration of the subquery, the tail pointer position isrecovered so that the space allocated in the previous iterationcan be reused. Figure 7 shows how the memory pools are usedin a drive program generated for a 3-level subquery.Preloaded columns: To avoid frequent data loading, wepreload the required columns and move them to GPU devicememory before the subquery loop is executed and release themafter all iterations. If the device can not hold all columns, weseparate the device memory into two parts. One holds thepreloaded columns and the other is used for on-demand load-ing. Two principles guide column preloading: (1) the columnsused by a more inner-level subquery have a higher priority;and (2) for the columns at the same level, a smaller table hasa higher priority as sequential reads are more efficient.

Most of those techniques to improve GPU memory manage-ment, e.g., allocating large memory pools, assigning memoryregions to different data, and preloading partial data dependenton available memory size, are used in the existing systems [40]

[26]. We apply these techniques into the nested structures ofsubqueries for high performance of memory accesses.

D. Other Optimizations

Indexing: Since the passed-in parameter varies and thenested query block has to be re-evaluated in the loop, buildingan index over the correlated columns could narrow the scanrange in every iteration and improve the performance. In Query1, the index could be built on S.col1. Every time when wescan table S with a different passed-in R.col1 tuple in thepredicate R.col1 = S.col1, we can locate a valid rangeof S.col1 by using the binary search on the index and onlyscan a portion of table S. However, building such an indexfor correlated columns needs additional time as well as space,e.g., O(N · logN ) time is used to sort a correlated column (Nis the size of the column) and O(2N ) more space is requiredto store the sorted values and their original positions. Thus,we carefully evaluate if the time saved by indexing is offsetby the extra time for the sort and if we have enough space tohold the indexing results.

Vectorization: Inside a nested loop, NestGPU evaluates thesubquery iteratively, where operators are sequentially evalu-ated on GPU. If the data size of intermediate results betweentwo operators is too small to fully exploit all cores of GPU,the performance is degraded. Query 3 shows an example thatmay have such an issue. In the nested part of the query, aselection using the passed-in T.col1 tuple is executed beforejoining T and S. If the selection only produces few tuplesas the result, the join may have poor performance as mostGPU cores are not used. Thus, we implement the vectorizationthat fuses GPU kernels of multiple subqueries with differentpassed-in parameters and evaluates them in a single kernel.This can increase the GPU occupancy and thus improve theoverall performance.

Invariant Component Extraction: In the nested queryblock, we usually find invariant components that won’t changewith different passed-in parameters. In Query 3, despite thatwe join T and S every time when we evaluate the subquery, wecan build a hash table for table S once and only perform the

Page 7: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

SELECT R.col1, R.col2FROM RWHERE R.col2 = (SELECT min(T.col2)FROM T, SWHERE T.col1 = R.col1 ANDS.col1 > 0 ANDT.col3 = S.col3);

Query 3: Two selections and a join in a nested query block

probe in the iterative. To generally extract such invariant com-ponents from a nested query block, we implement the methodin [8] to distinguish which operations could be executed onceand which ones are not.

Before generating the iterative code for a subquery, we markall nodes in its query plan tree as invariant except the onescontaining correlated columns that are marked as transient.Then the transient flag starts to spread upward. If one childof a node is transient, the flag of this node is modified totransient until we reach the root node. After traversing thequery plan tree of a subquery and all nodes are markedcorrectly, NestGPU generates code for the invariant nodes,which are small trees embedded in the original one. Theirresults are intermediate tables inside the nested query blockand will be used later. A specific optimization is for the joinoperator. If the children nodes of the join are both transient(or invariant), the code of each child should be put inside (oroutside) the iteration. If one child of the join is transient andthe other is invariant, NestGPU will build the hash table onthe invariant child and use the transient child to probe.

Caching: If the passed-in parameter is not a primary key,subqueries would be re-evaluated with duplicate values. Thus,we build a cache for the results evaluated from the subquery,whose key is the combination of the passed-in parameters. Thiscaching technique improves performance if a highly skeweddistribution is observed on the passed-in columns.

Although these techniques are used in the existing systemsand are not the major contributions of this work, some ofthem are specifically efficient for the nested query processing.For example, under the nested structure, a single kernel islikely to underutilize GPU resources. We use vectorization tofuse multiple kernels. Also, under the nested structure, if thepassed-in tuples from the outer loop have locality, we usecaching to avoid redundant computation in the subquery.

IV. COST MODEL

The cost model is to predict the execution time of a nestedquery, involving a number of unique factors on GPU, e.g.,the number of GPU cores, the bandwidth of data transfers,the latency of memory accesses, and the number of GPUkernels. These factors are also considered on top of the regularproblems in CPU query processing, including join cardinality,filter selectivity estimation, time spent on fetching pages fromthe disk, etc. To keep the complexity reasonable, we assumenested queries are executed on a single GPU exclusively.

We first estimate the time of a single database operator, e.g.,a selection or an aggregation. In NestGPU, these operations

are performed by multiple GPU kernels followed by a mate-rialization step. Thus, we breakdown the execution time of asingle operator (Top) into several parts as:

Top =

Nk∑i=1

(Ki ∗ dDi

Thie) + (M ∗Dr ∗

Rc∑i=1

(Rsi)) + C. (1)

The first part of Eq. (1) estimates the time needed by GPUto execute all primitive kernels. We denote the number ofGPU kernels involved as Nk, the data input size as Di andthe number of GPU threads as Thi. We estimate how manyiterations will be performed by each thread by dividing Di

with Thi. In the equation, Ki is the time needed to performa single iteration that depends on the instructions executedin the kernel. One method to estimate Ki is actually runningthe query once and reusing this value for all predictions. Analternative way for Ki estimation is based on the total numberof memory transactions [21]. In our evaluation, we use theformer method since the optimizations, e.g., using GPU reg-isters for a fast sorting implementation [38], make the globalmemory transaction-based method less accurate. Ki is actuallythe parameter that includes the runtime stats. It is differentfrom the previous studies that abstract runtime parameters intothe cost model, but we find it is efficient to incorporate runtimeinformation in this way for two reasons. First, we assumethat GPU does not process any other workload and thus, Ki

is stable and does not rely on the operating system. Moreimportantly, in the nested execution structure that includesmany iterations to execute a subquery, it is practical to executethe inner part only once and get all Kis for different operators.The second part of Eq. (1) is on the materialization cost. Formost operations, the materialization cost is linear with the sizeof output data. Thus, we calculate it by multiplying the numberof rows in the result Dr, the time needed to materialize asingle byte M , and the accumulation of Rsi that is the sizeof column i with Rc the number of columns. The third partis a constant time C. It is the time to execute a GPU kernelwith an empty input.

Notice that the estimation on a join operation is morecomplicated than that by Eq. (1). As mentioned in SectionIII-D, NestGPU may build the hash table once and reusethem multiple times in the nested query block. Thus, we needto estimate the costs of building a hash table and probing ahash table separately. Furthermore, the materialization of a joinoperation is more complex as it uses two different kernels tomaterialize the columns from the left table and the right table,respectively. We decompose the cost of a join as the time tobuild the hash table (Tjh), to probe the hash table (Tjp), andto materialize the result (Tjm):

Tjh = Ht ∗ d Di

Thie, (2)

Tjp = P ∗ d Di

Thie, (3)

Tjm = (Ml ∗Dr ∗Rcl∑i=1

(Rsi)) + (Mr ∗Dr ∗Rcr∑i=1

(Rsi)). (4)

Page 8: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

In Eq. (2), the execution time of an iteration to build thehash table is denoted as Ht and in Eq. (3), the time of aniteration to probe an element is denoted as P . For Tjm inEq. (4), the materialization cost is estimated as the sum ofmaterializing results from the left and right tables. For datafrom the left table, we estimate the cost by multiplying thetime needed to materialize a single byte Ml, the number ofrows in the result Dr, and the size of each row from theleft table that is the accumulated sum of the column lengthsincluded in the result (

∑Rcli=1(Rsi)). Similarly, we can get the

estimation time of materializing results from the right table.Notice that we differ Ml and Mr since the materializationtime is different on the left and right tables observed inour experiments. This is because the structures of the tables,including the lengths of materialized columns and their strides,will significantly affect the time of memory copy on GPU. Thefinal cost of a join denoted as Tj can be estimated as:

Tj = Tjh + Tjp + Tjm + C. (5)

We also need to consider those optimizations proposed inthe previous section, specifically the cache optimization. If thesize of the outer table is S and the cache hit times is Ch, thenested part will be executed S−Ch times. To estimate the timeof the nested part, we add the cost of scans and aggregationsfrom Eq. (1) and the cost of joins from Eq. (5). Notice thatfor the join we only add the Tjp and Tjm into the nestedcomputation if the hash table build (with the time Tjh) can bemoved out. Assuming there are NOps scans and aggregationsand NJops joins in the nested query block, the total cost ofthe nested part, denoted as N , can be estimated as:

N = (S−Ch)∗[NOps∑i=1

(Topi)+

NJops∑i=1

(Tjpi+Tjmi)]+

NJops∑i=1

(Tjh).

(6)NestGPU optimizes memory accesses by using memory

pools and preloading table columns. We consider both opti-mizations and estimate the memory-related cost only once foreach operation in the nested block. We denote the memory-related cost for each operation as Tmem and calculate it basedon how much data the operation needs to transfer and howmuch memory to be allocated. We estimate the whole memory-related costs of the nested part, denoted as Nmem, as:

Nmem =

NOps+NJops∑i=1

(Tmemi). (7)

Finally, to complete the estimation, we also need to add thecost of the outer block that is denoted as U . We estimate it asthe sum of all scans and aggregations, denoted as UOps andall joins, denoted as UJops and have:

U =

UOps∑i=1

(Topi) +

UJops∑i=1

(Tji) +

UOps+UJops∑i=1

(Tmemi). (8)

The total cost of a nested query, denoted as Q, is:

Q = N +Nmem+ U. (9)

In our implementation we also estimate the I/O time asa division between the needed data and the available diskbandwidth. The I/O time is not a consideration in the costmodel evaluation of this work.

A good estimation of the cost model depends on Drsin Eq. (1) and Eq. (4) that represent filter selectivity andjoin cardinality, respectively. Estimating these two parametershave been exploited in both the offline methods and theonline solutions, e.g., sampling and indexing [41] and parallelalgorithms [42]. We notice that there is a promising solution[43] for heterogeneous computing resources. To accuratelyestimate these parameters, we break the subquery into piecesas “execution islands” [43], by partitioning the iterations ofthe subquery. For each partition, we execute several iterationsand use the results to estimate Drs. Under the nested structure,executing several iterations for each partition of the subquerymay be expensive. We set a threshold to determine the trade-off between the estimation cost and the prediction accuracy.We expect that this method will highly improve the accuracyof the cost model since Drs are adaptive in the subqueryexecution. In the cost model evaluation (Sec. V-C), we makethe assumption that Dr is known in advance for the simplicity.

The reasons why we design our cost model for correlatedsubqueries rather than use an existing one, like GPL [23],are summarized as follows: (1) The execution code gener-ated by NestGPU is structurally different from other GPU-based database systems. For example, GPL is designed forflat queries and focuses on how to improve GPU utilizationwhen concurrently executing multiple kernels. In GPL, TPC-H Q5, Q7, Q8, Q9, and Q14 are evaluated. None of themhas correlated subqueries. (2) Furthermore, GPL’s cost modelmight yield different results for correlated subqueries. Underthe nested structure, a single kernel is likely to underutilizeGPU resources. Thus, we use the vectorization techniquein Section III-D to fuse multiple kernels (from many toone); while in order to overlap CPU-GPU data movementand GPU kernel execution in a pipeline mode, GPL’s costmodel considers how to partition data into chunks and launchconcurrent kernels (from one to many). (3) Some optimizationsthat are not included in GPL are crucial for efficient nestedquery processing. For example, caching can avoid redundantcomputations in the iterative structure of subquery processingin NestGPU. It is simple but efficient. Therefore, in our costmodel, Ch is added to characterize this optimization.

V. EVALUATION

We evaluate NestGPU on different queries and datasets.Hardware configurations. We use a server with two In-tel Xeon E5-2680 v4 CPUs (28 cores in total) running at2.40GHz. The server also has 128GB main memory withLinux CentOS 7. The GPU on it is an NVIDIA Tesla V100with 32GB high bandwidth memory (HBM). We evaluate GPUmemory utilization on a desktop machine that has an IntelCore i7-3770K CPU running at 3.50GHz, 32GB main memory,and an NVIDIA GTX 1080 GPU with 8GB GDDR5 memory.Queries. We use TPC-H Q2, Q4, and Q17 [44]. These three

Page 9: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

102

103

104

105

106

107

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

pgSQL(nested)pgSQL(unnested)

MonetDB

OmniSciGPUDB+NestGPU

Fig. 8: TPC-H Q2

102

103

104

105

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

pgSQL(nested)pgSQL(unnested)

MonetDB

OmniSciNestGPU

Fig. 9: TPC-H Q4

102

103

104

105

106

107

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

pgSQL(nested)pgSQL(unnested)

MonetDB

OmniSciGPUDB+NestGPU

Fig. 10: TPC-H Q17

queries have a type-JA subquery. We also use the modifiedQ2 queries to evaluate NestGPU under different conditions.Datasets. We generate all datasets with the standard TPC-Hdata generator and change the scale factor from 1 and 20 (upto 100 only for the memory utilization test). Peer systems.We use PostgreSQL v12.1 [45] and MonetDB v11.37 [46] onCPU and the open-source software OmniSci [29] and GPUDB[21] on GPU for comparisons. We first enhance GPUDB fromonly allowing Star Schema queries to accept TPC-H queries.We also enhance it by applying our memory managementmechanisms. Thus, we call the enhanced GPUDB as GPUDB+in the following experiments. All nested queries used in ourexperiments cannot be automatically unnested by PostgreSQL,OmniSci, and GPUDB+. Thus, if the queries can be unnestedby Kim’s method [2], we manually rewrite the nested queriesto their unnested equivalents. For MonetDB, we measure thenested queries since they can be automatically unnested byMonetDB during execution. We report the execution time inmilliseconds where the disk I/O time is excluded.

A. Evaluations on TPC-H Q2, Q4, and Q17

Figures 8, 9, and 10 show the execution time of PostgreSQL,MonetDB, OmniSci, GPUDB+, and NestGPU with TPC-HQ2, Q4, and Q17 on our server node. For TPC-H Q2 shownin Figure 8, PostgreSQL takes ∼13 minutes and ∼31 minutesto run the nested version even with the table scales at 1 and5. So we will not present its results of the larger scales. Forthe unnested version of TPC-H Q2, PostgreSQL takes 263 msto 4763 ms to execute the query. By looking into the runtimeinformation, we find that PostgreSQL only uses a single threadeven on a multicore system. On GPU, OmniSci takes 188 msto 1904 ms and GPUDB+ takes 143 ms to 1109 ms to run theunnested query, while NestGPU uses 246 ms to 4041 ms torun the nested one. OmniSci and GPUDB+ are at most 2.12xand 3.73x faster than NestGPU, respectively. Even though thehigher computational complexity, in this case NestGPU has thecomparable performance to OmniSci and GPUDB+ on GPU.

Figure 9 shows the performance results on TPC-H Q4. InQ4, the subquery is in an EXISTS operator and the executioncan be optimized by applying a semi-join operator, insteadof executing the whole nested loop. OmniSci and GPUDB+do not use this optimization. Furthermore, GPUDB+ has amemory issue in GROUP BY that failed the execution. Thus,we compare NestGPU with other systems than GPUDB+ inthis test. The results show that NestGPU can outperform

PostgreSQL in all cases. NestGPU is 2.44x, 4.91x, 6.10x,6.86x, and 6.76x faster than PostgreSQL on the nested Q4when the scale factors varying from 1 to 20. The performancegains come from the massive parallelism of the semi-joinimplemented on GPU. The unnested Q4 is executed slower byPostgreSQL compared to the nested one. That is because anadditional GROUP BY is used to remove redundant tuples forthe inner table. NestGPU is 14.55x, 48.17x, 60.65x, 65.65x,and 66.35x faster than PostgreSQL and is 6.97x, 13.17x,15.11x, 11.18x, and 11.66x faster than OmniSci when theyrun the unnested Q4 with different scale factors, respectively.

Figure 10 shows the execution time on TPC-H Q17. Thestructure of Q17 is similar to that of Q2, but Q17 has a muchlarger inner table, i.e., LINEITEM. PostgreSQL is significantlyinferior than others when it executes the nested Q17. It takes∼23 minutes even on the smallest table with the scale factor 1.Thus, we don’t present its results with the larger scale factors.In this case, NestGPU is 2.11x, 5.51x, 4.74x, 4.19x, and 4.09xfaster than PostgreSQL executing the unnested Q17. However,OmniSci and GPUDB+ that execute the unnested Q17 bothhave better performance, running up to 7.27x and 16.58x fasterthan NestGPU even if NestGPU enables all optimizations.

MonetDB is one of the fastest database systems on CPUwith optimized query plans for TPC-H queries and hardware-related optimizations. In our evaluation, MonetDB takes 31ms to 138 ms, 30 ms to 334 ms, and 34 ms to 239 ms to runTPC-H Q2, Q4, and Q17, respectively. It achieves the bestperformance and its excellent performance comes from thefollowing three parts. First, the data movement between CPUand GPU counts a non-negligible percentage of the executiontime in GPU-accelerated systems. For TPC-H Q2, the CPU-GPU data movement consumes up to 19.55% of the executiontime of NestGPU. This overhead doesn’t exist in MonetDB.Second, the evaluation of MonetDB is carried out on two IntelXeon E5-2680 v4 CPUs, each of which has 14 cores and35 MB L3 cache. That is a powerful hardware configuration.Third, most importantly, MonetDB applies unique techniquesto optimize query plans, e.g., it can reduce cardinalities in theinner query by pushing down predicates from the outer query.This method can mostly optimize performance of Q2 and Q17.

B. Evaluations on TPC-H Q2 Variants

We further evaluate NestGPU with TPC-H Q2 variants.The variants are all derived from the base query presentedin Query 4, which is TPC-H Q2 with an additional predicate

Page 10: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

102

103

104

105

106

107

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

pgSQL(nested)NestGPU

Fig. 11: Query 5 that cannot be unnested

0

200

400

600

800

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

GPUDB+NestGPU

Fig. 12: Query 6 having a smaller outer table

0

5000

10000

15000

20000

25000

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

NestGPUNestGPU Idx

Fig. 13: Query 7 having a larger outer table

1: SELECT...+13: AND p_brand = ’Brand#41’

Query 4: TPC-H Q2 with an additional predicate in the WHEREclause of the outer query block

to reduce the execution time on large tables. First, we changethe predicate in Q2 and have Query 5 that cannot be unnested[16], [17] by the commonly used strategies. Second, we makeQuery 6 that has a smaller outer table. Third, we make Query7 that has a larger outer table to evaluate the indexing. Finally,we make Query 8 that has a larger inner table to evaluate GPUmemory footprint in both the nested and unnested methods.

16: AND ps_supplycost > (...22: p_partkey != ps_partkey

Query 5: from Query 4 but cannot be unnested

For Query 5 that cannot be unnested, PostgreSQL recur-sively evaluates the subquery as what NestGPU does. Theresults are shown in Figure 11. When the scale factor variesfrom 1 to 20, the speedup of NestGPU over PostgreSQL isfrom 109x to 359x. In summary, when a nested query cannotbe unnested, the nested method is the solution to executethe query, and NestGPU on GPU can achieve two orders ofmagnitude better performance over PostgreSQL on CPU.

+11: AND p_container like ’%BAG’12: AND p_size = 20

Query 6: from Query 4 but having a smaller outer table

We compare NestGPU with GPUDB+ on Query 6 that has asmaller outer table, and the results are shown in Figure 12. Forall scale factors, NestGPU outperforms GPUDB+ in executiontime. More specifically, the nested query block of Query 6 isonly evaluated 116 times because of the added predicate. Forsuch a small outer table, it is more efficient to perform a simpleaggregation over a column multiple times than a GROUP BYfollowed by a large JOIN. The experiment shows that whenthe outer table is sufficiently small, NestGPU can providebetter performance than the unnested method on GPU. Thecost model further provides the quantified information to thequery optimizer if the nested method is better.

We evaluate the indexing technique by using Query 7 thathas a larger outer table. Figure 13 shows the execution time

-13: AND p_brand = ’Brand#41’

Query 7: from Query 4 but having a larger outer table

of NestGPU with and without the indexing technique. Withoutit, NestGPU completes the query in 772 ms to 22956 ms ondifferent scale factors of datasets; while with it, the executiontime is reduced to 570 ms to 10557 ms that even includes theindex building time. The experimental results show that for alarger outer table, indexing could lead to better performance.

-25: AND r_name != ’ASIA’

Query 8: from Query 4 but having a larger inner table

We also check when the unnested method will run out ofGPU device memory. We make Query 8 that has a larger innertable and run the test on our desktop machine that has anNVIDIA GTX 1080 GPU with 8GB device memory. In theexperiment, we change the scale factors to 20, 40, 60, 80, and100 to enlarge the memory usage and show the experimentalresults in Figure 14. At the scale factors of 20, 40, and60, the performance gap between NestGPU and GPUDB+ranges from 2x to 3.7x. The unnested method GPUDB+ hashigher performance due to its lower computation complexitythan that of the nested method NestGPU. However, when thescale factor is beyond 80, GPUDB+ runs out of GPU devicememory, while NestGPU can still correctly execute the query.

C. Cost Model Verification

Finally, we evaluate the cost model of NestGPU. We verifythe accuracy of three relational operators from Query 4 andthen verify the accuracy for the whole nested query. In Figure15, we denote the real execution time with (R) and theestimated time by the cost model with (E). The error ratebetween the estimated time and the real execution time is from0.49% to 17.75%, from 4.03% to 17.48%, and from 0.15% to7.66% for selection, join, and aggregation, respectively. Figure16 shows that the results on the whole nested query. The errorrate is up to 12.7% with the scale factor 20.

VI. RELATED WORK

Optimizing query processing operators on GPU. He etal. [18] implement and optimize the operators, e.g., map,scatter/gather, prefixScan, etc., on GPU and use them toconstruct join algorithms. Paul et al.[23] propose a pipelinedexecution mechanism on GPU, focusing on concurrent kernel

Page 11: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20 40 60 80 100

Exec

utio

n Ti

me

(ms)

Scale Factor

GPUDB+NestGPU

Fig. 14: Query 8 having a larger inner table

0

100

200

300

400

500

600

5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

Select. (R)Select. (E)

Join (R)Join (E)Agg (R)Agg (E)

Fig. 15: Cost model verification for ops.

0

200

400

600

800

1000

1200

1 5 10 15 20

Exec

utio

n Ti

me

(ms)

Scale Factor

EstimationReal

Fig. 16: Cost model verification for Query 4

execution and communication optimization between kernels.Sioulas et al. [26] optimize hash join on GPU by exploitingGPU architecture characteristics, partitioning large dataset, andexecuting Join in a pipelined manner.

GPU-accelerated database systems. GPUDB [21] is aquery processing system on GPU. For a SQL query, GPUDBgenerates a drive program with pre-implemented GPU kernelsand executes the program on GPU. With the performanceanalysis on multiple queries, GPUDB concludes that the queryprocessing performance on GPU is determined by CPU-GPUdata transfer and GPU device memory access times. MapD[22] (rebranded to OmniSci) is another GPU-based queryprocessing system. It implements a memory managementsystem that uses the LRU policy to manage GPU devicememory and swap data between CPU and GPU. HorseQC[24] is a pipelined query processing system on GPU. It uses acompound kernel method to fuse multiple kernels into a singlekernel so that intermediate data movement between kernelscan be significantly reduced. HAPE [25] is a Heterogeneity-conscious Analytical query Processing Engine that uses bothmulti CPUs and multi GPUs for query execution. It imple-ments the hardware conscious hash join operators [26] and theefficient intra-device execution methods for multiple operators.H2TAP [27] optimizes HAPE by using the lazy data loadingand data transfer sharing so as to increase the CPU-GPUbandwidth usage and to bridge the throughput gap betweenGPU and PCIe. Hawk [47] is a hardware-adaptive querycompiler that can compile queries into OpenCL kernels withlow compilation time and without manual tuning. DogQC [28]identifies filter and expansion divergence in GPU-accelerateddatabase systems and proposes the push-down parallelismand lane refill techniques to balance the divergence effects.HetExchange [40] is another query execution engine on het-erogeneous hardware and focuses on the multi-CPU-multi-GPU configuration. SEP-Graph [48] is a GPU query process-ing engine for graph algorithms. It automatically selects theexecution path with three pairs of critical parameters, i.e., syncor async mode, push or pull communication mechanism, andtopology- or data-driven traversing scheme, with an objectiveto achieve the shortest execution time in each iteration.

The code generation for subqueries presented in this paperis uniquely different from these GPU systems. First, a commonpractice in the existing systems is to use CPU to call differentpre-implemented GPU kernels in the generated program, asusing an GPU kernel to call another GPU kernel is inefficient

and impractical (on the contrary, on CPU, the code generationfor a subquery operator SUBQ is to use a CPU function to callanother CPU function just like the flat query). Thus, a databaseoperator, e.g., Join and Selection, is represented as a singleoperator and corresponds to one or limited-numbers of GPUkernels in the existing work and also in ours. However, since asubquery can implement any query logic and hence can havearbitrary numbers of combinations of database operators, it isimpossible to pre-implement limited-numbers of GPU kernelsfor a subquery. Therefore, in our work, a crucial design isto generate a for loop for a SUBQ, instead of compiling itinto GPU kernels in other GPU database systems or into aCPU function in CPU systems. Second, when dealing witha subquery, we analyze the query and apply correspondingoptimizations into the query plan tree for different cases,e.g., Selection with a subquery, Join with a subquery, andSubquery in a subquery (in Section III-B). The subquery-specific optimizations are not used in existing systems. Insummary, both the structural change in compilation and thespecific optimizations for subqueries are new contributions.

Unnested methods. The unnested methods have been ex-tensively researched in the past 40 years [2], [3], [9], [14],[16]. There are two basic strategies for type-JA subqueries[2], [3]. Kim’s method does an aggregation on the inner tablefollowed by a join with the outer table [2]. Dayal’s method firstdoes an outerjoin on correlated tables and then an aggregationfollowed by a filtering [3]. Galindo-Legaria et al. [9] developthe APPLY operator, identifying that both Kim’s and Dayal’smethods can be included in the normalization rule. Cao etal. [14] focus on unnesting type-J subqueries. They proposea nested relational algebra based approach to unnest non-aggregate subqueries with hash joins and to treat all subqueriesin a uniform manner. Neumann and Kemper [16] show that theefficiency of an unnested strategy depends on the correlatedoperator of the subquery and the structure of the subquery.They propose dependent joins to unnest arbitrary subquerieswith a new operator, i.e., MAGIC.

VII. CONCLUSION

We present NestGPU, a GPU-based database system toaccelerate subquery processing under the nested structure. Incontrast to a customized approach of the unnested method,our nested method is general-purpose with minimized devel-opment efforts in its usage, and also retains high performance.To accomplish our goals, we develop a new code generation

Page 12: NestGPU: Nested Query Processing on GPU · 2020. 11. 18. · NestGPU: Nested Query Processing on GPU Sofoklis Floratos , Mengbai Xiao , Hao Wang , Chengxin Guoy, Yuan Yuanz, Rubao

framework to make the nested method best fit on GPUalong with a set of optimizations for high performance. WithNestGPU, we are able to efficiently process nested queriesthat cannot be algorithmically unnested and nested queriesthat can be unnested but failed to run on GPU due to limitedGPU memory capacity. In other representative cases, NestGPUachieves comparable performance with that of the unnestedmethod, while outperforms the unnested method if the outertable is relatively small. We also develop a cost model topredict the execution time of the relational operators under thenested structure, aiming to guide the database query optimizerto select the shortest execution method for nested queries.

In the post-Moore’s Law era, a highly efficient program isno longer machine independent; but comes from a balanceamong three critical factors: low algorithm complexity, lowdata movement, and high parallelism. In this paper, we havemade a case to significantly raise the execution performanceof the nested method in high complexity by exploiting massiveparallelism and device memory locality on GPU. In this way,we accomplish the goal for both general-purpose in softwaredesign and high performance in subquery processing.

VIII. ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for theirconstructive comments. This work has been partially sup-ported by the National Science Foundation under grants CCF-1513944, CCF-1629403, IIS-1718450, and CCF-2005884.

REFERENCES

[1] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, andT. G. Price, “Access path selection in a relational database managementsystem,” in SIGMOD, 1979.

[2] W. Kim, “On optimizing an sql-like nested query,” TODS, vol. 7, no. 3,1982.

[3] U. Dayal, “Of nests and trees: A unified approach to processing queriesthat contain nested subqueries, aggregates, and quantifiers,” in VLDB,1987.

[4] S. Chaudhuri, “An overview of query optimization in relational systems,”in PODS, 1998.

[5] M. Muralikrishna, “Improved unnesting algorithms for join aggregatesql queries,” in VLDB, 1992.

[6] L. Baekgaard and L. Mark, “Incremental computation of nested rela-tional query expressions,” ACM Trans. Database Systs., vol. 20, no. 2,1995.

[7] P. Seshadri, H. Pirahesh, and T. C. Leung, “Complex query decorrela-tion,” in ICDE, 1996.

[8] J. Rao and K. A. Ross, “Reusing invariants: A new strategy for correlatedqueries,” in SIGMOD, 1998.

[9] C. Galindo-Legaria and M. Joshi, “Orthogonal optimization of sub-queries and aggregation,” in SIGMOD, 2001.

[10] C. Zuzarte, H. Pirahesh, W. Ma, Q. Cheng, L. Liu, and K. Wong, “Win-magic: subquery elimination using window aggregation,” in SIGMOD,2003.

[11] M. O. Akinde and M. H. Bohlen, “Efficient computation of subqueriesin complex olap,” in ICDE, 2003.

[12] M. Brantner, N. May, and G. Moerkotte, “Unnesting scalar sql queriesin the presence of disjunction,” in ICDE, 2007.

[13] M. Elhemali, C. A. Galindo-Legaria, T. Grabs, and M. M. Joshi,“Execution strategies for sql subqueries,” in SIGMOD, 2007.

[14] B. Cao and A. Badia, “A nested relational approach to processing sqlsubqueries,” in SIGMOD, 2005.

[15] S. Bellamkonda, R. Ahmed, A. Witkowski, A. Amor, M. Zait, and C.-C.Lin, “Enhanced subquery optimizations in oracle,” in VLDB, 2009.

[16] T. Neumann and A. Kemper, “Unnesting arbitrary queries,” in BTW,2015.

[17] J. Holsch, M. Grossniklaus, and M. H. Scholl, “Optimization of nestedqueries using the nf2 algebra,” in SIGMOD, 2016.

[18] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander,“Relational joins on graphics processors,” in SIGMOD, 2008.

[19] T. Kaldewey, G. Lohman, R. Mueller, and P. Volk, “Gpu join processingrevisited,” in DaMoN, 2012.

[20] S. Zhang, J. He, B. He, and M. Lu, “Omnidb: Towards portable andefficient query processing on parallel cpu/gpu architectures,” PVLDB,vol. 6, no. 12, 2013.

[21] Y. Yuan, R. Lee, and X. Zhang, “The yin and yang of processing datawarehousing queries on gpu devices,” PVLDB, vol. 6, no. 10, 2013.

[22] T. Mostak, “An overview of mapd (massively parallel database),” Whitepaper. MIT, 2013.

[23] J. Paul, J. He, and B. He, “Gpl: A gpu-based pipelined query processingengine,” in SIGMOD, 2016.

[24] H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner, “Pipelined queryprocessing in coprocessor environments,” in SIGMOD, 2018.

[25] P. Chrysogelos, P. Sioulas, and A. Ailamaki, “Hardware-conscious queryprocessing in gpu-accelerated analytical engines,” in CIDR, 2019.

[26] P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, andA. Ailamaki, “Hardware-conscious hash-joins on gpus,” in ICDE, 2019.

[27] A. Raza, P. Chrysogelos, P. Sioulas, V. Indjic, A. Christos Anadiotis,and A. Ailamaki, “Gpu-accelerated data management under the test oftime,” in CIDR, 2020.

[28] H. Funke and J. Teubner, “Data-parallel query processing on non-uniform data,” PVLDB, vol. 13, no. 6, 2020.

[29] https://www.omnisci.com/platform/downloads.[30] https://www.kinetica.com/.[31] S. Breundefined and G. Saake, “Why it is time for a hype: A hy-

brid query processing engine for efficient gpu coprocessing in dbms,”PVLDB, vol. 6, no. 12, 2013.

[32] A. Adinets, “Cuda dynamic parallelism api and principles,”https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/,2014.

[33] J. Wang and S. Yalamanchili, “Characterization and analysis of dynamicparallelism in unstructured gpu applications,” in IISWC, 2014.

[34] G. Chen and X. Shen, “Free launch: Optimizing gpu dynamic kernellaunches through thread reuse,” in MICRO, 2015.

[35] I. E. Hajj, J. Gomez-Luna, C. Li, L.-W. Chang, D. Milojicic, and W.-m.Hwu, “Klap: Kernel launch aggregation and promotion for optimizingdynamic parallelism,” in MICRO, 2016.

[36] J. Zhang, A. M. Aji, M. L. Chu, H. Wang, and W.-c. Feng, “Tamingirregular applications via advanced dynamic parallelism on gpus,” in CF,2018.

[37] T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz,“Everything you always wanted to know about compiled and vectorizedqueries but were afraid to ask,” PVLDB, vol. 11, no. 13, 2018.

[38] K. Hou, W. Liu, H. Wang, and W.-c. Feng, “Fast segmented sort ongpus,” in ICS, 2017.

[39] G. Graefe, “Volcano/spl minus/an extensible and parallel query evalua-tion system,” TKDE, vol. 6, no. 1, 1994.

[40] P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki,“Hetexchange: Encapsulating heterogeneous cpu-gpu parallelism in jitcompiled engines,” PVLDB, vol. 12, no. 5, 2019.

[41] V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann, “Car-dinality estimation done right: Index-based join sampling.” in CIDR,2017.

[42] S. Heule, M. Nunkesser, and A. Hall, “Hyperloglog in practice: Algo-rithmic engineering of a state of the art cardinality estimation algorithm,”in EDBT, 2013.

[43] T. Karnagel, D. Habich, and W. Lehner, “Adaptive work placementfor query processing on heterogeneous computing resources,” PVLDB,vol. 10, no. 7, 2017.

[44] “TPC-H benchmark,” http://www.tpc.org/tpch, 2018.[45] PostgreSQL Global Development Group, “PostgreSQL,”

http://www.postgresql.org, 2017.[46] P. A. Boncz, M. Zukowski, and N. Nes, “Monetdb/x100: Hyper-

pipelining query execution.” in CIDR, 2005.[47] S. Breß, B. Kocher, H. Funke, S. Zeuch, T. Rabl, and V. Markl,

“Generating custom code for efficient query execution on heterogeneousprocessors,” VLDB Journal, vol. 27, no. 6, 2018.

[48] H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang, “Sep-graph: Finding shortest execution paths for graph processing under ahybrid framework on gpu,” in PPoPP, 2019.