integrationoffpgasindatabasemanagementsystems ......bala gurumurthy [email protected] more for...

SCHWERPUNKTBEITRAG

https://doi.org/10.1007/s13222-018-0294-9Datenbank Spektrum

Integration of FPGAs in Database Management Systems: Challengesand Opportunities

Andreas Becher1 · Lekshmi B.G.1 · David Broneske2 · Tobias Drewes2 · Bala Gurumurthy2 ·Klaus Meyer-Wegener1 · Thilo Pionteck2 · Gunter Saake2 · Jürgen Teich1 · StefanWildermann1

Received: 1 June 2018 / Accepted: 14 August 2018© Springer-Verlag GmbH Deutschland, ein Teil von Springer Nature 2018

AbstractIn the presence of exponential growth of the data produced every day in volume, velocity, and variety, online analyticalprocessing (OLAP) is becoming increasingly challenging. FPGAs offer hardware reconfiguration to enable query-specificpipelined and parallel data processing with the potential of maximizing throughput, speedup as well as energy and resourceefficiency. However, dynamically configuring hardware accelerators to match a given OLAP query is a complex task.Furthermore, resource limitations restrict the coverage of OLAP operators. As a consequence, query optimization throughpartitioning the processing onto components of heterogeneous hardware/software systems seems a promising direction.While there exists work on operator placement for heterogeneous systems, it mainly targets systems combining multi-coreCPUs with GPUs. However, an inclusion of FPGAs, which uniquely offer efficient and high-throughput pipelined processingat the expense of potential reconfiguration overheads, is still an open problem. We postulate that this challenge can onlybe met in a scalable fashion when providing a cooperative optimization between global and FPGA-specific optimizers. Wedemonstrate how this is addressed in two current research projects on FPGA-based query processing.

Keywords Database query processing · Hardware acceleration · Query optimization · FPGA · OLAP

1 Introduction

With the increasing volume, velocity, and variety of nowa-days data, the available parallelism and multiplicity ofemerging computer systems has to be exploited efficientlyin order to process these data. Consequently, we observe anincreasing research focus on accelerating database queryprocessing with multi-core CPUs and attached co-proces-sors. In addition to GPUs, FPGAs are deployed more and

This work has been supported by the German ResearchFoundation (DFG) as part of the Priority Programm 2037 (GrantNo.: ME 943/9, PI447/9, SA 465/51, TE163/21, WI4415/1).

� Andreas [email protected]

Lekshmi [email protected]

David [email protected]

Tobias [email protected]

Bala [email protected]

more for data-processing contexts. Prominent examples areMicrosoft’s Catapult [38], Baidu’s FPGA-based data analy-sis [17], and cloud services, e.g. the Amazon clouds. Whiledrastic performance improvements can be expected whenusing heterogeneous architectures, they are not guaranteedper se.

FPGAs consist of a regular array of programmable logicblocks that can be dynamically configured to implementcomplex hardware modules. Such reconfigurable hardware

Klaus [email protected]

Thilo [email protected]

Gunter [email protected]

Jürgen [email protected]

Stefan [email protected]

1 Friedrich-Alexander-Universität Erlangen Nürnberg,Nuremberg, Germany

2 Otto-von-Guericke-University, Magdeburg, Germany

K

https://doi.org/10.1007/s13222-018-0294-9

http://crossmark.crossref.org/dialog/?doi=10.1007/s13222-018-0294-9&domain=pdf

Datenbank Spektrum

a

b

Fig. 1 Example of a (a) QEP with a sub-query selected for FPGA ac-celeration (dashed lines) and a (b) pipelined and parallelized hardwarearchitecture that directly implements the dataflow of the sub-query.(a) Query Evaluation Plan (QEP), (b) Dedicated hardware pipeline forQEP

has the ability to be tailored to a specific task. FPGAsshow their full strength when parallelism and pipeliningcan be jointly exploited through provision of high-through-put hardware pipelines, i. e., for processing streams of data.Here, the high energy and resource efficiency compared toCPUs and GPUs stems from the specificity of the hard-ware circuit and the locality of deeply pipelined processingcomponents. These opportunities are compensated by a fac-tor of 10 lower achievable clock rates than modern CPUsand GPUs. Today, FPGAs are increasingly used for differ-ent data-analytics tasks, including data-stream processing,machine learning, extract-transfer load and other data trans-formation processes.

This article discusses the benefits and challenges ofusing FPGAs for online analytical processing (OLAP).OLAP queries can be represented by a query evaluationplan (QEP) that represents the operators and the dataflowof the underlying computations (illustrated in Fig. 1a).FPGAs provide hardware reconfigurability combining theadvantages of hardware efficiency with programmabilityat the circuit-level, thus allowing to directly implement

the dataflow according to the QEP, as illustrated in Fig. 1.All non-blocking operators can process the dataflow ina pipelined fashion at the speed of the I/O interface. Insome cases, blocking operators like join and sort operations(pipeline breakers) can be evaluated by highly parallelizedand specialized hardware implementations, as exemplifiedfor a parallel sort-merge join implementation in Fig. 1b. Insummary, the benefits for OLAP processing using FPGAsare:

● Pipelined processing of non-blocking operators at I/Orate.

● Energy efficiency and reduction of power consumptionand/or ...

● ... speedup through parallel and specialized hardware im-plementations of OLAP operators.

● Resource efficiency by taking workload from processorsand providing query-specific hardware acceleration.

However, in principle, the queries submitted to a databasesystem may vary considerably and can be arbitrarily com-plex. Unlike sequential processing on CPUs, a direct hard-ware implementation may not be possible due to the limitedamount of programmable logic blocks on FPGAs. Conse-quently, we believe that the benefits of FPGAs in DBMSscan only be utilized by carefully exploiting the synergyamong three major concepts:

● Exploitation of scalable and heterogeneous hardware/software target architectures;

● Query partitioning and optimization;● Dynamic hardware reconfiguration.

In this article, we survey how far related work is tacklingthese issues in Sect. 2. Sect. 3 summarizes open challengesof integrating FPGAs into heterogeneous DBMSs. Sect. 4gives an overview of our approaches to cope with thesechallenges, before we give a conclusion.

2 RelatedWork

2.1 FPGAs for Query Processing

2.1.1 Designs for Accelerating Specific Operators

In recent years, many investigations have evaluated andquantified the benefits of using FPGAs to acceleratedatabase operators. FPGA acceleration of sorting, which isan integral part of many database operations, has gainedmuch attention, see e.g., [11, 32, 44]. Another focus ofresearch has been on the hardware implementation of joinoperators. Similar to most other database operators, a mul-titude of algorithms for performing a join exist (e.g., hashjoin, sort-merge join, etc.). Halstead et al. [13] propose an

K

Datenbank Spektrum

FPGA accelerator to perform hash joins. They are able toachieve an about 11 times higher throughput on an FPGAcompared to a commercial software implementation exe-cuted on a multi-core CPU. Casper and Olukotun [11] pro-pose an FPGA implementation of a sort-merge join, whichachieves a throughput of 6.45GB=s. In comparison, an im-plementation on a multi-core CPU achieves a throughputof 1GB=s according to [24], and [19] reported a through-put of 4.6GB=s for a GPU implementation. Most notably,these results show two major potentials of using FPGAs foraccelerating database operations: First, it is often possibleto achieve I/O-rate processing, leading to a higher overallthroughput. Second, while the reported results of CPUs andGPUs are obtained with clock frequencies of over 3.2GHzand 1.5GHz, resp., FPGA implementations are often onlyclocked at � 200MHz. Hence, a higher energy efficiencycan be accomplished with FPGAs [2, 29].

Another computationally intensive operator in databasequery processing is the selection of records based on regu-lar expressions. Here, software solutions for both GPUs andmulti-core CPUs quickly become computation-bound withincreasing complexity of the regular expressions. There isa large body of work that uses regular-expression matchingon FPGAs, based on the work of Sidhu and Prasanna [41].On the downside, they evaluate only a single or a few fixedregular expressions, whereas the database users composeregular expressions at run-time. To address this issue, Istvánet al. [18] and Becher et al. [5] propose run-time parametriz-able regex operators.

Today’s DBMSs use a multiplicity of algorithms to im-plement each operator: Which algorithm suits a given querybest in terms of execution time often depends on the selec-tivity and the amount of data to be processed [39, 40].Therefore, when a query is received, the optimizer selectsthe best algorithm to achieve the goal of either executionspeed or throughput. However, the static FPGA designs fordatabase operators discussed above all suffer from the draw-back that they do not allow for run-time algorithm selectionto support instantaneous query optimizations. On the otherhand, FPGAs allow for (partial) dynamic reconfiguration oflogic blocks and interconnect, thus enabling the exchangeof hardware modules at run-time, lasting less than a mil-lisecond down to some tenths of a millisecond [3].

Ueda et al. [46] present an FPGA-based join acceleratorthat can switch between hash join and merge join by meansof reconfiguration, allowing more resources to be used foreach accelerator than in a combined static design. Insteadof holding both implementations on the FPGA simultane-ously, it is possible to change between both implementa-tions by reconfiguring the FPGA at run-time. A query op-timizer should then be able to select the most appropriateimplementation based on a cost model. Koch and Torre-sen [26] propose to make use of partial reconfiguration for

sort operations to adjust the resource allocation of a FIFO-based merge sorter and a tree-based merge sorter dependingon the problem size.

Recent work [1] on accelerating joins on compute clus-ters with multi-core nodes has shown that parallel joinmethods can scale (almost) linearly with the number ofcores. In this context, FPGAs may not only be used to par-tition the data for the parallel instances, as, e.g., describedby Kara et al. [20]. However, comparable scalability mayalso be expected when parallelizing join operators on singleor even clustered FPGA nodes.

2.1.2 Static Designs for Accelerating Queries

All the cases given above investigate the acceleration ofa single database operator. Queries without the acceleratedoperator in their QEP cannot benefit from FPGA accelera-tion. Furthermore, if an operator not available on the FPGAturns out to be the bottleneck of the query, the other acceler-ated operators do not give a speed benefit. Therefore, somework considers operator pipelines for accelerating wholequeries instead of just a single operator. Glacier [30, 31],for example, is a query-to-hardware compiler. For a givenquery, it automatically instantiates the required operatorsfrom a component library and generates a data-stream-pro-cessing accelerator, which is tailored to the given query. Itis not intended for acceleration in systems with frequentlychanging queries.

Sukhwani et al. [45] propose a hardware/software co-de-sign including an FPGA accelerator that consists of a feed-forward pipeline of hardware kernels for selection, projec-tion, and sorting. It can be used for the acceleration ofarbitrary queries via run-time parametrization in a tuple-at-a-time processing fashion, and pre-processed data is for-warded to the CPU-based host system for post-processing.This approach demonstrates that it is reasonable to coupleFPGA-based hardware acceleration with software perform-ing the remaining operations not supported by the FPGA.

Sidler et al. [42] propose a system architecture consistingof a CPU extended by an FPGA. The FPGA has accelera-tors for the Skyline operator, stochastic gradient descent,and regular expression matching. The system runs withthe in-memory DBMS MonetDB and accelerators are in-tegrated into the DBMS. Thus, hardware operators are justadditional operators for the DBMS and can be selected bythe query optimizer. Here, it is also possible to schedulemultiple queries on the same hardware operators [34].

Baidu [17] also implements specific accelerators for dif-ferent SQL core operators on the FPGA: filter, sort, aggre-gate, join, and group by. Baidu’s query optimizer can selecthardware accelerators for calculating parts of the query, andthe processing is then carried out on the accelerator.

K

Datenbank Spektrum

2.1.3 Reconfigurable Designs for Accelerating Queries

A drawback of the approaches mentioned in the previoussubsection is the static hardware-accelerator architecture: Itis not possible to adapt it to queries. Rather, the QEP mustbe adapted to match the hardware. For example, querieswith many predicates requiring more hardware units forpredicate evaluation than available in the accelerator can-not be mapped directly to the FPGA. Instead, the optimizerdecomposes them into sub-operations, each fitting to theaccelerator, and processes them sequentially. Operators notbeing part of the accelerator must be processed in soft-ware. Thus queries with complex operations, such as hashjoins and regular-expression matching, cannot benefit fromFPGA acceleration.

In this context, dynamic reconfiguration can provide a vi-able means to adapt the hardware to the query by exchang-ing acceleration modules at run-time. This provides query-specific acceleration optimized for the respective use case,while at the same time it achieves a more efficient utiliza-tion of the limited FPGA resources.

Wang et al. [48] investigate this potential of run-timereconfiguration for query processing. They show that evenfor the same query it makes sense to provide multiple ac-celerators and reconfigure between them while processingthe query. Again, each accelerator can utilize the FPGAresources exclusively to increase parallelism and thus de-crease execution time. However, due to the overhead of re-configuration, only large datasets can benefit from it, sincethe reduction in execution time is larger than the reconfigu-ration time. Wang et al. further propose a methodology forautomatically generating multiple QEPs for a given query.Each consists of a set of hardware accelerators and a sched-ule of how to switch between them. At run-time, a plan ischosen from the set of generated evaluation plans, so that ithas the shortest execution time according to a cost model.While this validates the benefit of adapting the hardwareto the query by means of reconfiguration, it has the ma-jor drawback that each accelerator is tailored to a specificquery. Synthesizing an accelerator can take from minutes tohours. Consequently, it is infeasible to assemble accelera-tors for ad-hoc queries at run-time. Therefore, this approachcan only be applied for queries that are already known atdesign time. Furthermore, the system only supports full re-configuration of the FPGA. This implies, first, a larger re-configuration overhead, as the time increases linearly withthe size of the FPGA area to be reconfigured (up to secondsaccording to [48]). Second, as the data contained in FPGAmemory typically is lost during full reconfiguration, an ad-ditional overhead occurs for transferring the local memorycontents back to the CPU-based host system.

Ziener et al. [52] present a methodology for on-the-fly data-path generation of a query-stream pipeline. Query

plans can be accelerated by composing and placing pre-synthesized modules (e.g., aggregation and restriction op-erators) on the FPGA by means of partial reconfiguration.Furthermore, this has enabled the use of column-orientedoperators, which is also used by Kara et al. [20].

2.2 Heterogeneous DBMS

The systems described above feature FPGA acceleratorscomplementing the host CPUs. Due to different computa-tion models, they are considered to be heterogeneous. Froman architectural view, the structure is similar to systems thatincorporate GPUs as co-processors in addition to CPUs [15,16, 51, 7]. Such heterogeneous systems pose specific re-quirements on algorithmic design, query optimization, andcost models for DBMSs. In the following we will describethese requirements by means of the state of the art for suchheterogeneous systems.

2.2.1 Optimizing Operators for Heterogeneous Hardware

Tuning operators to a single co-processor has already at-tracted much attention (c. f., for GPU [15, 19, 25, 43],for Xeon Phi [36, 37], and for FPGA [29, 32]). How-ever, a holistic approach should be able to work on ar-bitrary co-processors. There are two common implementa-tion approaches. The straight-forward one is the hardware-oblivious approach [16], where operators are implementedagainst a parallel-programming library, e.g., OpenCL. Thisapproach allows for portability of operator code, but not ofperformance [40]. Hence, a hardware-sensitive approachshould be used, where the code is optimized for each co-processor separately. DBMSs following a hardware-sensi-tive approach use a primitive-based execution or query-compilation strategies.

Primitive-Based Execution: Data-parallel primitiveshave been suggested by He et al. as a means to imple-ment database operators in a CPU/GPU system [15]. Theidea is to split an operator into smaller functions. Thesefunctions can then easily be tuned for any co-processor,allowing optimizations orthogonal to the original operator.Such optimization possibilities for heterogeneous co-pro-cessors have already been shown by Pirk et al. for vectoralgebra on heterogeneous systems [35].

Query Compilation for Heterogeneous Hardware: Querycompilation is an alternative to the operator-based execu-tion of queries [33]. Here, a query is split into executionunits fusing operator code. The dataflow allows data to bepipelined up to a pipeline breaker (a blocking operationlike join or aggregation). To enable query compilation forheterogeneous co-processors, Breß et al. present Hawk [10]and Funke et al. present HORSEQC [12] as hardware-adap-tive query compilers. Besides other optimizations, Hawk

K

Datenbank Spektrum

employs code optimizations to tune operator code for theused co-processor and HORSEQC fuses operators.

2.2.2 Query Optimization in Heterogeneous DBMSs

Query optimization is a crucial part in DBMSs due to SQL’sdeclarative nature. Traditional query optimization coversplan optimization and algorithm selection for a single pro-cessor. In a heterogeneous system, however, a multitudeof new optimization possibilities arise (e.g., co-processorselection). In general, DBMSs either employ global queryoptimization assigning operators at optimization time or dy-namic optimization at run-time. The latter can be achievedby query chopping [9], where operators are assigned to co-processors right when all their inputs are ready (i. e., allchild operators are finished). While query chopping takesinto account the load of the co-processors, it is a greedyexploitation, which may miss the global optimum that op-timizers could find across processors.

Global optimization needs to understand the capabilitiesof the co-processors. Karnagal et al. have conducted variesstudies in this area [21, 23]. According to them, query pro-cessing on heterogeneous co-processors is based on charac-teristics like data size, computational complexity of opera-tors, co-processor properties, and execution times of physi-cal operators. They propose a Heterogeneity-aware physicalOperator Placement (HOP) by assigning the physical opera-tors to co-processors at run-time using a cost model, whichcalculates the execution cost based on operator behavior,hardware characteristics, and run-time information. HOPcan find the best operator placement using a cost model, forany input-parameter values. Breß et al. also address the op-erator-placement problem in [6] using the optimizer libraryHyPE, which follows a greedy and cost-based approach.

However, the operator-placement strategy faces somechallenges such as an increasing number of operators,which leads to a search-space explosion, and an uncer-tainty in cardinality estimation for intermediate results.Karnagal et al. tackle this issue with a greedy technique,the so-called strong placement, which assigns an operatorto one co-processor and places the remaining operatorsrandomly by majority voting [22]. Since both technologiesrely on a greedy strategy, the optimal solution may not beproduced.

An adaptive placement approach was later proposedin [23] to tackle the unexpected behavior of an operator,caused by a complex query structure, large input data, andthe location of intermediate results. In this technique, theoptimization and execution of a part of the query, calledexecution island, is implemented. Since the optimization islimited to a part of the query, cardinalities of intermedi-ate results and execution time can be estimated with highprecision. Although this regional approach comes close to

a global optimization, a tuple-at-a-time model would notbenefit from it [23].

Considering energy-efficient query processing along withperformance constraints is also a challenge in query opti-mization. In this case, an optimizer must find QEPs thatmeet the performance goals while consuming as little en-ergy as possible w.r.t. available (co-)processors. Lang et al.have designed a framework for query optimization thatconsiders energy efficiency along with performance con-straints. It works well for simple queries, but not for com-plex and concurrent queries [27]. Later, Zichen Xu et al.introduced a Power-Energy-Time (PET) framework for thePostgreSQL kernel [49]. They point out that the energy-efficient QEPs may not have the shortest processing time.Ungethüm et al. proposed a hardware/software-co-designapproach on the Tomahawk architecture [47], where theyutilize energy-efficient customizable instruction-set proces-sors. Becher et al. [2] propose an FPGA-based hardwaresoftware co-design for energy-efficient hash join operation.

Global query optimization in heterogeneous DBMSsgreatly depends on the dynamically changing volume ofdata, the execution time of the query, the workload char-acteristics, and the unpredictable cardinality of data. Theresearch addressing these characteristics in a fast as wellas cost- and energy-efficient way is still at its peak.

2.2.3 Cost Models for Operators on HeterogeneousHardware

In the presence of heterogeneous co-processors, costs foroperators on different co-processors have to be taken intoaccount as well as transfer times between processors. Thereare two ways to determine the costs of an operator or queryplan: using an algebraic or a learning-based cost model.

Algebraic Cost Models: An algebraic cost model is a pa-rameterized function describing the cost for an operator ac-cording to the access pattern and the computation an oper-ator causes. The specific parameters are usually determinedin a calibration phase at start-up. Algebraic cost modelsgive the most accurate cost estimations for an operator.

Cost models have a long history in DBMSs. For in-mem-ory systems, Manegold et al. proposed a cost model takingcache hierarchies as the new bottleneck into account [28].For GPUs, there is already a plethora of cost models for dif-ferent operators [15, 50]. However, a holistic and compre-hensive cost model is hard to find for arbitrary co-processorand operator combinations.

Learning-Based Cost Models: Learning-based cost mod-els use machine-learning techniques to estimate the perfor-mance of an operator on a co-processor. For heterogeneousco-processors, Breß et al. introduced HyPE [6, 8], and Kar-nagel presented HOP [21] as learning-based cost estimators.

K

Datenbank Spektrum

However, both estimators currently do not take concurrentexecution into account.

2.3 Summary

Due to resource limitations, there is no FPGA acceleratorthat supports all queries in the best way. For flexible andgeneric coverage of queries, mapping query operators tothe available co-processors and to the software of a het-erogeneous hardware/software system is necessary. Whilethere is work on operator placement for heterogeneous sys-tems, e.g., by Karnagel et al. [21] and by Breß [7], theymainly target systems combining multi-core CPUs withGPUs. How to extend these systems to FPGAs, which comewith very specific overheads when applying reconfigurationbetween operators, is still an open issue.

3 Challenges

The study of related work clearly shows that an integra-tion of FPGAs into a DBMS is challenging, ranging fromgeneral challenges of heterogeneous systems (query opti-mization and partitioning) to FPGA-specific challenges (dy-namic reconfiguration).

3.1 Exploitation of Heterogeneous Hardware/Software Architectures

Since different hardware concepts (e.g., CPUs, GPUs, andFPGAs) have their individual strengths and weaknesses indata processing, a clever partitioning of data and work hasto be done in a heterogeneous system. Although first at-tempts have been made to schedule and distribute data andoperations (c. f., CoGaDB, HyPE), they consider neither theopportunity of hardware reconfiguration of the FPGA norits stream-oriented processing style. Furthermore, cachingstrategies will probably not apply to an FPGA with its lim-ited amount of dedicated memory.

3.2 Query Partitioning and Optimization

For a query optimizer in a heterogeneous system, it is chal-lenging and a so far unsolved problem to to find (1) anoptimal partitioning of a query on the different heteroge-neous co-processors and (2) determine the configuration ofthe reconfigurable hardware itself.

The first challenge (global view) creates a huge opti-mization space when considering multiple different co-pro-cessors, multiple implementation variants for an operator,different granularities for executing an operator, concurrentresource utilization on a co-processor, and even multipleobjectives.

The second challenge (local view) implies a maintenancetask that composes and schedules a reconfiguration of theFPGA when a specific function is efficiently supported orfrequently used. This involves (a) selecting appropriate co-processors for the given query while considering recon-figuration overhead, (b) mapping query operations to suchaccelerators, and (c) reconfiguring the hardware to generatedata paths for the accelerators that enable stream processingas close as possible to I/O-rate.

3.3 Dynamic Hardware Reconfiguration

The challenge of dynamic reconfiguration is to efficientlyadapt FPGAs to a query and to time-multiplex query pro-cessing under scarce resources. Especially ensuring that re-configuration pays off is a huge challenge, since FPGAreconfiguration itself may take a time ranging from 1ms(partial reconfiguration) to the order of 1s (full reconfigu-ration). Thus, overall savings must be higher than the re-configuration costs. So, either the amount of data processedin one query must be large enough or the query needs tooccur frequently enough. Another challenge is that queriesuse operators for data types with different or even arbi-trary lengths (e.g., strings). In order to restrict the space ofhardware accelerator configurations to be synthesized andstored, a big challenge here is to identify a representativeset of often occurring, but in data types highly parameteri-zable kernel modules that are loaded and configured at run-time. Finding these modules with enough soft parametersleft to avoid a dynamic hardware reconfiguration or modulere-synthesis is a central focus and challenge of our ongoingresearch.

Through the remainder of this paper, we will present ourapproaches to meet these challenges.

4 Proposed Solutions

A substantial part of the challenges refers to the query opti-mization for heterogeneous systems that include an FPGA.In the following two research approaches, ADAMANT atthe Otto-von-Guericke-Universität Magdeburg (OvGU) andReProVide at the Friedrich-Alexander Universität Erlan-gen-Nürnberg (FAU), are presented, where query partition-ing and optimization plays a major role. On an abstractlevel, the common ground of both methodologies is to geta grip on the huge solution space with a distribution ofoptimization tasks among cooperative optimizers.

4.1 Cooperative Optimizers

The idea of cooperative optimizers is to split the optimiza-tion into a global and a local optimization according to the

K

Datenbank Spektrum

Fig. 2 Interplay of global and local optimizer

global and local views described in Sect. 3.2. Global op-timization applies strategies at the level of the QEP, withthe major goal of partitioning the QEP into sub-trees andassigning them to the heterogeneous processors. The localoptimizer is responsible for finding the best implementationof a sub-tree on the specific target hardware. Fig. 1a alreadyshowed an example where (a) parts of the QEP are assignedto an FPGA (dashed line). The (b) target-specific optimiza-tion of the sub-tree involves selecting operator variants to beexecuted (e.g. sort-merge join) and the parallelism granular-ity (e.g. parallel sorting), as well as generating the hardwareconfigurations of the FPGA and determining the schedulefor reconfiguring between them.

Due to the huge solution space, distributing certain opti-mization decisions to the autonomous co-processors is rea-sonable to parallelize query optimization and lower the de-gree of detail and thus the load on the requesting CPU. Theglobal optimizer decides based on the current load and alsoon co-processor properties according to cost models thatare frequently refreshed at run-time with the help of infor-mation from the local optimizers. The capabilities of theheterogeneous co-processors such as device power, bufferpool, cost of operator evaluation, etc. play a key role inthe degree of accuracy of query-cost estimation. They areonly known to the resp. local optimizers and thus must becommunicated to the global optimizer. Partitioning a querybetween heterogeneous processors as well as further oper-ator placement and mapping of sub-queries are only effec-tive after a communication between local and global opti-mizer. Our proposed approach of cooperative optimizationinvolves exchanging hints between global and local opti-mizer. This is illustrated in Fig. 2. With the help of thesehints, the global optimizer can make a decision on howmuch optimization is required per plan and can maximizethe performance benefits. We expect this bidirectional hintinterface to enable scalable optimization in both directions,i. e., globally and locally, which helps to subsequently re-duce optimization overhead.

In the remainder of this section, we introduce two pro-posed solutions and demonstrate how to incorporate theinterplay of local and global optimizers.

4.2 ADAMANT: Adaptive Data Management inEvolving Heterogeneous Hardware/SoftwareSystems

ADAMANT integrates a DBMS with extensible, adaptablesupport for heterogeneous hardware. To support a multi-tude of devices, we employ the concept of Plug ’n’ Play byabstracting their common functionalities. The main objec-tives of our project in order to support this highly volatileenvironment are as follows:

Abstraction of devices: Seamless integration of heteroge-neous devices with different interfaces (e.g., PCIe,QPI, CAPI) on a one-by-one-basis is a difficult,time-consuming, and redundant task. Thus, it is nec-essary to find a useful abstraction level that considersthe fundamental differences between accelerators.

Multi-layered optimization: The presence of multiple pro-cessing devices for query execution creates a newdimension of parallelism – cross-device parallelism.Additionally, FPGAs provide spatial parallelism, butrun-time reconfiguration limits temporal availabilityof operators. Along with the dimensions of data andfunctional parallelism, these new dimensions requirean improved optimization process in order to over-come the difficulties resulting from the explosion ofsearch space.

Exploiting device features: Exploiting the different fea-tures available in heterogeneous co-processors re-sults in more efficient implementations. Device-spe-cific functions are improved by fine-tuning device-dependent implementation parameters such as loop-unrolling depth, vector width, based on feedbackfrom the environment. This goes beyond possibleoptimizations of OpenCL by also allowing for data-sensitive performance portability w.r.t. workloadselectivity, for example [40]. Here, FPGAs allowfor operator implementations using a wider set ofpossible architectures than fixed (co-)processors.

4.2.1 Architecture

Since the performance of database operators, especially do-main-specific ones, depends on optimized implementations,a direct implementation can easily become inflexible, whichhinders adoption of new processing devices and custom op-erators. In order to decouple domain-specific operators andhardware processing units, we base our system on a lay-

K

Datenbank Spektrum

Fig. 3 ADAMANT: AdaptiveDBMS architecture

ered architecture. This enables an efficient incorporationof FPGAs with their vastly different intrinsic features andrequirements.

The architecture of the ADAMANT system is shown inFig. 3. The global optimizer shown at the top performs theoptimization steps found in classical DBMSs. It also del-egates parts of the optimization process to local optimiz-ers that are assigned to the co-processors. Thus, a layeredapproach to optimization enables efficient support for het-erogeneous hardware. In addition to the local device-spe-cific optimizer, the device manager of each co-processor,such as GPUs or FPGAs as well as the host CPU, containsrun-time monitoring facilities to support the higher-leveloptimization components. At runtime, the device manageraims to improve operator performance by going beyondstatic compiler optimizations for the underlying device us-ing techniques such as micro-adaptivity [39].

The storage manager on the right side of Fig. 3 is re-sponsible for requesting data transfers between dedicateddevice memories and main memory. It also passes estimatesto higher optimization layers, for example using cost mod-

els similar to [14]. Finally, edge cases such as devices thatshare a unified memory view are handled in this component.The data gathered by all managers is aggregated by a sin-gle global hardware supervisor, providing a unified abstractview of the complete system to the global optimizer.

Supporting the software part of the ADAMANT archi-tecture with FPGAs requires more than custom implemen-tations of the usual database operators. Both the architec-ture of the base design and the hardware interfaces betweenstatic infrastructure and dynamically configured operatorshave to be engineered for high throughput and flexibility.This is also part of the ADAMANT project.

4.2.2 Query Optimization

As mentioned above, a global optimizer determines the bestexecution plan for a given domain-specific query. As mul-tiple devices can be used for processing that query, we adda new layer: parallelism optimization, to exploit concurrentprocessing on available devices. However, adding this newlevel of parallelism to the existing solution space makes it

K

Datenbank Spektrum

even harder to find the optimal path. Hence, an efficientsolution for traversing this multi-dimensional parallelismspace in a reasonable time is required. To overcome this,we follow three basic concepts:

Information propagation: Since device-related characteris-tics are a key factor for optimizing a query executionplan, information about the devices is propagatedto the global optimizer. The characteristics of thesedevices, such as memory organization and memoryfootprint, are considered for optimization.

Delegation: To further reduce the search space, we split theoptimization into smaller steps and delegate them tothe local optimizer. As it is hard to keep track ofall device-related characteristics and tune operatorsbased on them, especially the device-dependent op-timization steps are delegated to the local optimizeravailable on the device.

Parallelism optimization: In order to fully exploit the dataand functional parallelism, we consider different lev-els of granularity for operators (cf. Fig. 1b) for eachof the underlying devices. The granularity and de-vice-specific variant of an operator affect the overallperformance of the system. Hence, we have to care-fully consider these trade-offs between the devices,which requires the optimizers to collaborate and de-cide on the granularity level of the operators.

Overall, the QEP given by the global optimizer is fur-ther optimized to fit the underlying heterogeneous system.Since these hardware characteristics are considered for anoptimized evaluation plan, a multi-level optimization is per-formed by propagating information about the devices to theglobal optimizer and pushing the device-specific optimiza-tions to the local. Finally, each device has its own optimalimplementation points for its supported primitives, leadingto collaboration among available devices for path selection.

4.3 ReProVide: Query Optimization and Near-DataProcessing on Reconfigurable SoCs for Big DataAnalysis

ReProVide (standing for Reconfigurable Data ProVision)follows the idea of near-data processing. ReProVide unitsimplemented on programmable SoC as illustrated in Fig. 4serve as data providers accessible via network and equippedwith a reconfigurable SoC. The data is stored in non-volatilememory such as SSDs or volatile DDR-SDRAM, which isdirectly attached to and managed by the platform. A mainassumption and advantage of ReProvide is that it abstractsfrom the actual table and storage schema. Rather, theDBMS conveys which data it requires in which format. Re-ProVide units come with processing capabilities (in form

Fig. 4 A schematic overview of a ReProVide near-data query process-ing system implemented on a programmable SoC with a multi-coreCPU and multiple partially reconfigurable FPGA regions

of hardware accelerators and closely-coupled multi-coreprocessors) that can be used for data processing. Moreover,they contain special dataflow-control circuits (so-calledReOrder units [4]) to assemble tables and data as requestedby the DBMS at I/O-rate (schema-on-read). This designenables ReProVide units to locally optimize the storagelocation of the data with regards to availability, accesslatency, power consumption, and the access patterns of theavailable hardware accelerators.

These schema-on-read capabilities provided by ReOrderunits [4] bring multiple benefits:

(a) Flexibility to process data stored in row-oriented as wellas column-oriented format.

(b) The data can be modified on the fly so that only columnswhich are actually needed are sent over the network tothe requesting DBMS host, thus relaxing the I/O bottle-neck.

(c) Additional columns can be inserted, e.g., while the datais fetched by the DBMS.

One example would be the on-the-fly calculation of in-termediate results (e.g., price � discount) which is then in-serted into the tuple stream relieving not only the DBMS ofits calculation, but maybe also the network by transmittingfewer attributes. Moreover, the host can request an oper-ator-optimized schema. Such a schema could be a hybridbetween column- and row-store to, e.g., separate join keysand join data to allow for faster sorting/joining of the keys.

In addition, the processing capabilities of ReProVide it-self can be exploited to reduce the number of rows sentover the network by pushing restrictions to the data.

K

Datenbank Spektrum

4.3.1 Architecture

The ReProVide platform itself constitutes a heterogeneoushardware/software system, as shown in Fig. 4. While theCPU subsystem can process the full variety of operatorsand types, its performance may be limited. On the otherhand, operators implemented in hardware can utilize par-allel and pipelined implementations to achieve up to I/O-rate processing. ReProVide units therefore include partiallyreconfigurable areas (highlighted in blue in Fig. 4), intowhich query-/operator-specific accelerators can be dynam-ically loaded. The dynamic synthesis of a new operator(query compilation) can take from minutes up to multiplehours. We therefore make use of a library of pre-synthesizedreconfigurable hardware accelerators.

As storage space for this library and thus the amountof accelerators is limited, these accelerators will most of-ten implement multiple operators in order to gain a propercoverage of operators and queries. The challenge is to finda good trade-off between a high coverage of common oper-ators, the resource limitations imposed by the partial areas,and the memory requirement for storing the library. Largerpartial areas mean less resource restriction for implement-ing an accelerator, but increased reconfiguration time (start-up cost) and fewer concurrently available partial areas intotal (reduced parallelism). We will therefore provide a de-sign flow to automatically generate partially reconfigurableaccelerators, i.e., pipelines of operators based on statisti-cal data collected by the ReProVide unit itself (e.g., whichoperators in which combination and with which frequencywere requested). By making use of model-based multi-ob-jective optimization, it is thus possible to automatically ob-tain an optimized set of accelerator designs, which posesa optimized solution for frequently occuring operators andqueries.

Fig. 5 Example of partitioninga QEP into sub-trees (dashedboxes), binding of partitions toaccelerators, and scheduling ofaccelerators on allocated partialareas

To improve resource efficiency, unused parts of loadedaccelerators can be used to generate meta-data on datastreamed through them, e.g., statistics and indices—if re-execution is likely. While indices can help a local optimizer,statistics are important for the global optimizer. Both op-timizers need to work together closely to leverage the fullpotential of such a system.

4.3.2 Query Optimization

The interaction of the DBMS and ReProVide has alreadybeen sketched in Fig. 2. Required is the investigation ofnovel global query-optimization techniques based on con-cepts known from multi-databases. The analysis and pre-processing of a QEP consists of three phases:

In the first phase, the global optimizer decides whichoperations of a QEP are worthwhile to be assigned to theReProVide SoC. The local optimizer provides informationon its capabilities and statistics of its local data. In orderto obtain a fast, energy-efficient, and cost-effective queryexecution, a novel cost model as well as a novel hierarchicalquery-optimization technique need to be investigated.

In the second phase, the QEP is passed from the globaloptimizer to a local optimizer, including some hints. Thelocal optimizer then selects accelerators, maps query oper-ations to accelerators, and schedules the reconfiguration ofthe accelerators (see Fig. 5). We envision that the local op-timizer not only returns the cost for the complete QEP, butalso the cost for specific sub-trees. In Fig. 5, each QEP par-tition is annotated with the expected end-to-end executiontime for processing the partition’s input data based on costmodels and selectivity statistics of the data sources. Thegoal of this strategy is to reduce the amount of subsequentcalls for cost estimations of other QEPs, as it gives theoptimizer the possibility of anticipating the impact whenreordering the QEP, or assigning only sub-trees of it to

K

Datenbank Spektrum

the ReProVide platform. Moreover, ReProVide may alsoanswer that it cannot execute some operator at all, for in-stance, because some expressions are too complex. Alongwith the QEP, the frequency of QEP execution and an up-per bound for latency are also transferred as hints. So, thelocal optimizer can keep often used accelerations and alsoset a time budget for running the local optimizer.

In the third and final phase, the query is executed andthe results are returned to the host system.

5 Conclusion

FPGAs have a huge potential in query processing inDBMSs because of their special capabilities. In orderto efficiently use them, we address three major challenges.First, efficient partitioning of data and DBMS operationsamong available processors is necessary to fully exploit allof their capabilities. Second, the availability of a multitudeof heterogeneous devices requires improved optimizationstrategies that take into consideration the device-based char-acteristics for finding an efficient evaluation plan. Finally,FPGAs have high reconfiguration and design-iteration costsand hence, require concepts to adapt an accelerator circuitryto a given query. Since all of these challenges lie in thedomain of query optimization, we propose a cooperativeoptimization approach. To overcome the issue of largersearch space, we split the general optimizer into a globaland a local optimizer so that device-related optimization isperformed by the local optimizer.

Thus, the major contributions of this paper are: Iden-tifying the challenges of integrating FPGAs into DBMSs,providing a comprehensive survey of the state-of-the-art inFPGA-accelerated DBMS processing, and proposing a two-layered co-operative optimizer design to efficiently traversethe enlarged search space.

References

1. Barthels C, Alonso G, Hoefler T, Schneider T, Müller I (2017) Dis-tributed join algorithms on thousands of cores. Proceedings VLDBEndowment 10(5):517–528

2. Becher A, Ziener D, Meyer-Wegener K, Teich J (2015) A co-designapproach for accelerated SQL query processing via FPGA-baseddata filtering. In: FPT, pp 192–195

3. Becher A, Pirkl J, Herrmann A, Teich J, Wildermann S (2016a)Hybrid energy-aware reconfiguration management on Xilinx ZynqSoCs. In: ReConFig, pp 1–7

4. Becher A, Wildermann S, Mühlenthaler M, Teich J (2016b) Re-order: Runtime datapath generation for high-throughput multi-stream processing. In: ReConFig, pp 1–8

5. Becher A, Wildermann S, Teich J (2018) Optimistic regular ex-pression matching on FPGAs for near-data processing. Proc. 14thInt. Workshop on Data Management on New Hardware, Houston,11.6.2018, pp 1–4

6. Breß S (2013) Why it is time for a HyPE: a hybrid query process-ing engine for efficient GPU coprocessing in DBMS. ProceedingsVLDB Endowment 6(12):1398–1403

7. Breß S (2014) The design and implementation of CoGaDB: acolumn-oriented GPU-accelerated DBMS. Datenbank Spektrum14(3):199–209. https://doi.org/10.1007/s13222-014-0164-z

8. Breß S, Beier F, Rauhe H, Sattler KU, Schallehn E, Saake G (2013)Efficient co-processor utilization in database query processing. InfSyst 38(8):1084–1096. https://doi.org/10.1016/j.is.2013.05.004

9. Breß S, Funke H, Teubner J (2016) Robust query processing inco-processor-accelerated databases. In: SIGMOD, pp 1891–1906https://doi.org/10.1145/2882903.2882936

10. Breß S, Köcher B, Funke H, Rabl T, Markl V (2018) Generatingcustom code for efficient query execution on heterogeneous pro-cessors. VLDB J. https://doi.org/10.1007/s00778-018-0512-y

11. Casper J, Olukotun K (2014) Hardware acceleration of databaseoperations. In: FPGA, pp 151–160

12. Funke H, Breß S, Noll S, Markl V, Teubner J (2018) Pipelinedquery processing in coprocessor environments. In: SIGMOD, pp1603–1618 https://doi.org/10.1145/3183713.3183734

13. Halstead RJ, Sukhwani B, Min H, Thoennes M, Dube P, Asaad SW,Iyer B (2013) Accelerating join operation for relational databaseswith FPGAs. In: FCCM, pp 17–20

14. Hampel V, Pionteck T, Maehle E (2012) An approach for perfor-mance estimation of hybrid systems with FPGAs and GPUs as co-processors. In: ARCS, pp 160–171

15. He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, SanderPV (2009) Relational query coprocessing on graphics processors.ACM Trans Database Syst 34(4):Art. No 21. https://doi.org/10.1145/1620585.1620588

16. Heimel M, Saecker M, Pirk H, Manegold S, Markl V (2013) Hard-ware-oblivious parallelism for in-memory column-stores. Proceed-ings VLDB Endowment 6(9):709–720

17. Hemsoth N (2016) Baidu takes FPGA approach to acceleratingSQL at scale. The next platform. https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-accelerating-big-sql/. Ac-cessed 24 Aug 2018

18. István Z, Sidler D, Alonso G (2016) Runtime parameterizable reg-ular expression operators for databases. In: FCCM, pp 204–211

19. Kaldewey T, Lohman GM, Müller R, Volk PB (2012) GPU joinprocessing revisited. In: DaMoN, pp 55–62

20. Kara K, Giceva J, Alonso G (2017) FPGA-based data partitioning.In: SIGMOD, pp 433–445

21. Karnagel T, Habich D, Schlegel B, Lehner W (2014) Hetero-geneity-aware operator placement in column-store DBMS. Daten-bank Spektrum 14(3):211–221. https://doi.org/10.1007/s13222-014-0167-9

22. Karnagel T, Habich D, Lehner W (2015) Local vs. global optimiza-tion: operator placement strategies in heterogeneous environments.In: Proceedings of the Workshops of the EDBT/ICDT 2015 JointConference, Brussels, Belgium, 27 March 2015, pp 48–55

23. Karnagel T, Habich D, Lehner W (2017) Adaptive work placementfor query processing on heterogeneous computing resources. Pro-ceedings VLDB Endowment 10(7):733–744

24. Kim C, Sedlar E, Chhugani J, Kaldewey T, Nguyen AD, Blas AD,Lee VW, Satish N, Dubey P (2009) Sort vs. hash revisited: fast joinimplementation on modern multi-core CPUs. Proceedings VLDBEndowment 2(2):1378–1389

25. Kim C, Chhugani J, Satish N, Sedlar E, Nguyen AD, KaldeweyT, Lee VW, Brandt SA, Dubey P (2010) FAST: fast architecturesensitive tree search on modern CPUs and GPUs. In: SIGMOD, pp339–350

26. Koch D, Tørresen J (2011) FPGASort: a high performance sortingarchitecture exploiting run-time reconfiguration on FPGAs for largeproblem sorting. In: FPGA, pp 45–54

K

https://doi.org/10.1007/s13222-014-0164-z

https://doi.org/10.1016/j.is.2013.05.004

https://doi.org/10.1145/2882903.2882936

https://doi.org/10.1007/s00778-018-0512-y

https://doi.org/10.1145/3183713.3183734

https://doi.org/10.1145/1620585.1620588

https://doi.org/10.1145/1620585.1620588

https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-accelerating-big-sql/

https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-accelerating-big-sql/

Datenbank Spektrum

27. Lang W, Kandhan R, Patel JM (2011) Rethinking query processingfor energy efficiency: slowing down to win the race. IEEE Data EngBull 34(1):12–23

28. Manegold S, Boncz P, Kersten ML (2002) Generic database costmodels for hierarchical memory systems. In: Proceedings VLDBEndowment pp 191–202

29. Müller R, Teubner J (2009) FPGA: what’s in it for a database? In:SIGMOD, pp 999–1004

30. Müller R, Teubner J, Alonso G (2009) Streams on wires – Aquery compiler for FPGAs. Proceedings VLDB Endowment2(1):229–240

31. Müller R, Teubner J, Alonso G (2010) Glacier: a query-to-hardwarecompiler. In: SIGMOD, pp 1159–1162

32. Müller R, Teubner J, Alonso G (2012) Sorting networks on FPGAs.VLDB J 21(1):1–23. http://dx.doi.org/10.1007/s00778-011-0232-z

33. Neumann T, Leis V (2014) Compiling database queries into ma-chine code. IEEE Data Eng Bull 37(1):3–11

34. Owaida M, Sidler D, Kara K, Alonso G (2017) Centaur: a frame-work for hybrid CPU-FPGA databases. In: FCCM, pp 211–218

35. Pirk H, Moll O, Zaharia M, Madden S (2016) Voodoo – a vec-tor algebra for portable database performance on modern hardware.Proceedings VLDB Endowment 9(14):1707–1718

36. Pohl C (2017) Exploiting manycore architectures for parallel datastream processing. In: GvDB, pp 66–71 (http://ceur-ws.org/Vol-1858/paper13.pdf)

37. Polychroniou O, Raghavan A, Ross KA (2015) Rethinking SIMDvectorization for in-memory databases. In: SIGMOD, pp 1493–1508

38. Putnam A, Caulfield A, Chung E, Chiou D, Constantinides K,Demme J, Esmaeilzadeh H, Fowers J, Gray J, Haselman M, HauckS, Heil S, Hormati A, Kim JY, Lanka S, Peterson E, Smith A,Thong J, Xiao PY, Burger D, Larus J, Gopal GP, Pope S (2014)A reconfigurable fabric for accelerating large-scale datacenterservices. In: ISCA, pp 13–24

39. Raducanu B, Boncz PA, Zukowski M (2013) Micro adaptivity invectorwise. In: SIGMOD, pp 1231–1242 https://doi.org/10.1145/2463676.2465292

40. Rosenfeld V, Heimel M, Viebig C, Markl V (2015) The operatorvariant selection problem on heterogeneous hardware. In: ADMS,pp 1–12

41. Sidhu RPS, Prasanna VK (2001) Fast regular expression matchingusing FPGAs. In: FCCM, pp 227–238

42. Sidler D, István Z, Owaida M, Alonso G (2017) Accelerating pat-tern matching queries in hybrid CPU-FPGA architectures. In: SIG-MOD, pp 403–415

43. Sitaridi E, Ross K (2013) Optimizing select conditions on GPUs.In: DaMoN, ACM. pp, vol 4, pp 1–4

44. Sukhwani B, Thoennes M, Min H, Dube P, Brezzo B, Asaad SW,Dillenberger D (2013) Large payload streaming database sort andprojection on FPGAs. In: SBAC-PAD, pp 25–32

45. Sukhwani B, Thoennes M, Min H, Dube P, Brezzo B, AsaadSW, Dillenberger D (2015) A hardware/software approach fordatabase query acceleration with FPGAs. Int J Parallel Program43(6):1129–1159

46. Ueda T, Ito M, Ohara M (2015) A dynamically reconfigurable equi-joiner on FPGA. IBM Technical Report RT0969

47. Ungethüm A, Habich D, Karnagel T, Lehner W, Asmussen N, VölpM,Noethen B, Fettweis GP (2015) Query processing on low-energymany-core processors. In: ICDE, pp 155–160

48. Wang Z, Paul J, Cheah HY, He B, Zhang W (2016) Relational queryprocessing on OpenCL-based FPGAs. In: FPL, pp 1–10

49. Xu Z, Tu Y, Wang X (2012) PET: reducing database energycost via query optimization. Proceedings VLDB Endowment5(12):1954–1957

50. Yuan Y, Lee R, Zhang X (2013) The yin and yang of processingdata warehousing queries on GPU devices. Proceedings VLDB En-dowment 6(10):817–828

51. Zhang S, He J, He B, Lu M (2013) OmniDB: Towards portableand efficient query processing on parallel CPU/GPU architectures.Proceedings VLDB Endowment 6(12):1374–1377

52. Ziener D, Bauer F, Becher A, Dennl C, Meyer-Wegener K, Schür-feld U, Teich J, Vogt J, Weber H (2016) FPGA-based dynamicallyreconfigurable SQL query processing. ACM Trans ReconfigurableTechnol Syst 9(4):Article No. 25

K

http://ceur-ws.org/Vol-1858/paper13.pdf

http://ceur-ws.org/Vol-1858/paper13.pdf

https://doi.org/10.1145/2463676.2465292

https://doi.org/10.1145/2463676.2465292

integrationoffpgasindatabasemanagementsystems ......bala gurumurthy [email protected] more for...

Documents