saber: window-based hybrid stream processing for

15
SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures Alexandros Koliousis , Matthias Weidlich †‡ , Raul Castro Fernandez , Alexander L. Wolf , Paolo Costa ] , Peter Pietzuch Imperial College London Humboldt-Universität zu Berlin ] Microsoft Research {akolious, mweidlic, rc3011, alw, costa, prp}@imperial.ac.uk ABSTRACT Modern servers have become heterogeneous, often combining multi- core CPUs with many-core GPGPUs. Such heterogeneous architec- tures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by cur- rent relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heteroge- neous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling. We describe SABER,a hybrid high-performance relational stream processing engine for CPUs and GPGPUs.SABER executes window- based streaming SQL queries in a data-parallel fashion using all available CPU and GPGPU cores. Instead of statically assigning query operators to heterogeneous processors, SABER employs a new adaptive heterogeneous lookahead scheduling strategy, which increases the share of queries executing on the processor that yields the highest performance. To hide data movement costs, SABER pipelines the transfer of stream data between CPU and GPGPU memory. Our experimental comparison against state-of-the-art en- gines shows that SABER increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with both small and large window sizes. 1. INTRODUCTION The recent generation of stream processing engines, such as S4 [45], Storm [52], Samza [1], Infosphere Streams [13], Spark Streaming [56] and SEEP [17], executes data-intensive streaming queries for real-time analytics and event-pattern detection over high volumes of stream data. These engines achieve high processing throughput through data parallelism: streams are partitioned so that multiple instances of the same query operator can process batches of the stream data in parallel. To have access to sufficient parallelism during execution, these engines are deployed on shared-nothing clusters, scaling out processing across servers. An increasingly viable alternative for improving the performance of stream processing is to exploit the parallel processing capability Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA © 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2882903.2882906 offered by heterogeneous servers: besides multi-core CPUs with dozens of processing cores, such servers also contain accelerators such as GPGPUs with thousands of cores. As heterogeneous servers have become commonplace in modern data centres and cloud offer- ings [24], researchers have proposed a range of parallel streaming al- gorithms for heterogeneous architectures, including for joins [36,51], sorting [28] and sequence alignment [42]. Yet, we lack the design of a general-purpose relational stream processing engine that can transparently take advantage of accelerators such as GPGPUs while executing arbitrary streaming SQL queries with windowing [7]. The design of such a streaming engine for heterogeneous archi- tectures raises three challenges, addressed in this paper: (1) When to use GPGPUs for streaming SQL queries? The speed-up that GPGPUs achieve depends on the type of computation: execut- ing a streaming operator over a batch of data in parallel can greatly benefit from the higher degree of parallelism of a GPGPU, but this is only the case when the data accesses fit well with a GPGPU’s memory model. As we show experimentally in §6, some streaming operators accelerate well on GPGPUs, while others exhibit lower performance than on a CPU. A streaming engine with GPGPU support must therefore make the right scheduling decisions as to which streaming SQL queries to execute on the GPGPU in order to achieve high throughput for arbitrary queries. (2) How to use GPGPUs with streaming window semantics? While a GPGPU can naturally process a discrete amount of data in par- allel, streaming SQL queries require efficient support for sliding windows [7]. Existing data-parallel engines such as Spark Stream- ing [56], however, tie the size of physical batches of stream data, used for parallel processing, to that of logical windows specified in the queries. This means that the window size and slide of a query impact processing throughput; current data-parallel engines, for ex- ample, struggle to support small window sizes and slides efficiently. A streaming engine with GPGPU support must instead support arbi- trary windows in queries while exploiting the most effective level of parallelism without manual configuration by the user. (3) How to reduce the cost of data movement with GPGPUs? In many cases, the speed-up achieved by GPGPUs is bounded by the data movement cost over the PCI express (PCIe) bus. Therefore, a challenge is to ensure that the data is moved to the GPGPU in a continuous fashion so that the GPGPU is never idle. We describe the design and implementation of SABER,a hy- brid relational stream processing engine in Java that executes streaming SQL queries on both a multi-core CPU and a many-core GPGPU to achieve high processing throughput. The main idea behind SABER is that, instead of offloading a fixed set of query operators to an accelerator, it adopts a hybrid execution model in which all query operators can utilise all heterogeneous processors in-

Upload: vuongkhuong

Post on 02-Jan-2017

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SABER: Window-Based Hybrid Stream Processing for

SABER: Window-Based Hybrid Stream Processingfor Heterogeneous Architectures

Alexandros Koliousis†, Matthias Weidlich†‡, Raul Castro Fernandez†,Alexander L. Wolf†, Paolo Costa], Peter Pietzuch†

†Imperial College London ‡Humboldt-Universität zu Berlin ]Microsoft Research

{akolious, mweidlic, rc3011, alw, costa, prp}@imperial.ac.uk

ABSTRACTModern servers have become heterogeneous, often combining multi-core CPUs with many-core GPGPUs. Such heterogeneous architec-tures have the potential to improve the performance of data-intensivestream processing applications, but they are not supported by cur-rent relational stream processing engines. For an engine to exploit aheterogeneous architecture, it must execute streaming SQL querieswith sufficient data-parallelism to fully utilise all available heteroge-neous processors, and decide how to use each in the most effectiveway. It must do this while respecting the semantics of streamingSQL queries, in particular with regard to window handling.

We describe SABER, a hybrid high-performance relational streamprocessing engine for CPUs and GPGPUs. SABER executes window-based streaming SQL queries in a data-parallel fashion using allavailable CPU and GPGPU cores. Instead of statically assigningquery operators to heterogeneous processors, SABER employs anew adaptive heterogeneous lookahead scheduling strategy, whichincreases the share of queries executing on the processor that yieldsthe highest performance. To hide data movement costs, SABERpipelines the transfer of stream data between CPU and GPGPUmemory. Our experimental comparison against state-of-the-art en-gines shows that SABER increases processing throughput whilemaintaining low latency for a wide range of streaming SQL querieswith both small and large window sizes.

1. INTRODUCTIONThe recent generation of stream processing engines, such as

S4 [45], Storm [52], Samza [1], Infosphere Streams [13], SparkStreaming [56] and SEEP [17], executes data-intensive streamingqueries for real-time analytics and event-pattern detection over highvolumes of stream data. These engines achieve high processingthroughput through data parallelism: streams are partitioned so thatmultiple instances of the same query operator can process batches ofthe stream data in parallel. To have access to sufficient parallelismduring execution, these engines are deployed on shared-nothingclusters, scaling out processing across servers.

An increasingly viable alternative for improving the performanceof stream processing is to exploit the parallel processing capability

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA© 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2882903.2882906

offered by heterogeneous servers: besides multi-core CPUs withdozens of processing cores, such servers also contain acceleratorssuch as GPGPUs with thousands of cores. As heterogeneous servershave become commonplace in modern data centres and cloud offer-ings [24], researchers have proposed a range of parallel streaming al-gorithms for heterogeneous architectures, including for joins [36,51],sorting [28] and sequence alignment [42]. Yet, we lack the designof a general-purpose relational stream processing engine that cantransparently take advantage of accelerators such as GPGPUs whileexecuting arbitrary streaming SQL queries with windowing [7].

The design of such a streaming engine for heterogeneous archi-tectures raises three challenges, addressed in this paper:(1) When to use GPGPUs for streaming SQL queries? The speed-upthat GPGPUs achieve depends on the type of computation: execut-ing a streaming operator over a batch of data in parallel can greatlybenefit from the higher degree of parallelism of a GPGPU, but thisis only the case when the data accesses fit well with a GPGPU’smemory model. As we show experimentally in §6, some streamingoperators accelerate well on GPGPUs, while others exhibit lowerperformance than on a CPU. A streaming engine with GPGPUsupport must therefore make the right scheduling decisions as towhich streaming SQL queries to execute on the GPGPU in order toachieve high throughput for arbitrary queries.(2) How to use GPGPUs with streaming window semantics? Whilea GPGPU can naturally process a discrete amount of data in par-allel, streaming SQL queries require efficient support for slidingwindows [7]. Existing data-parallel engines such as Spark Stream-ing [56], however, tie the size of physical batches of stream data,used for parallel processing, to that of logical windows specified inthe queries. This means that the window size and slide of a queryimpact processing throughput; current data-parallel engines, for ex-ample, struggle to support small window sizes and slides efficiently.A streaming engine with GPGPU support must instead support arbi-trary windows in queries while exploiting the most effective level ofparallelism without manual configuration by the user.(3) How to reduce the cost of data movement with GPGPUs? Inmany cases, the speed-up achieved by GPGPUs is bounded by thedata movement cost over the PCI express (PCIe) bus. Therefore,a challenge is to ensure that the data is moved to the GPGPU in acontinuous fashion so that the GPGPU is never idle.

We describe the design and implementation of SABER, a hy-brid relational stream processing engine in Java that executesstreaming SQL queries on both a multi-core CPU and a many-coreGPGPU to achieve high processing throughput. The main ideabehind SABER is that, instead of offloading a fixed set of queryoperators to an accelerator, it adopts a hybrid execution model inwhich all query operators can utilise all heterogeneous processors in-

Page 2: SABER: Window-Based Hybrid Stream Processing for

terchangeably. By “processor” we refer here to either an individualCPU core or an entire GPGPU. More specifically, SABER makesthe following technical contributions:Hybrid stream processing model. SABER takes a streaming SQLquery and translates it to an operator graph. The operator graphis bundled with a batch of stream data to form a query task thatcan be scheduled on a heterogeneous processor. SABER’s hybridstream processing model then features two levels of data parallelism:(i) tasks run in parallel across the CPU’s multiple cores and theGPGPU and (ii) a task running on the GPGPU is further parallelisedacross its many cores.

One approach for selecting the processor on which to run a givenquery operator is to build an offline performance model that predictsthe performance of operators on each processor type. Accurateperformance models, however, are hard to obtain, especially whenthe processing performance can change dynamically, e.g. due toskewed distributions of input data.

In contrast, SABER uses a new heterogeneous lookahead schedul-ing (HLS) algorithm: the scheduler tries to assign each task to theheterogeneous processor that, based on past behaviour, achievesthe highest throughput for that task. If that processor is currentlyunavailable, the scheduler instead assigns a task to another processorwith lower throughput that still yields an earlier completion time.HLS thus avoids the need for a complex offline model and accountsfor changes in the runtime performance of tasks.Window-aware task processing. When SABER executes querytasks on different heterogeneous processors, it supports slidingwindow semantics and maintains high throughput for small win-dow sizes and slides. A dispatcher splits the stream into fixed-sized batches that include multiple fragments of windows processedjointly by a task. The computation of window boundaries is post-poned until the task executes, thus shifting the cost from the se-quential dispatching stage to the highly parallel task execution stage.For sliding windows, tasks perform incremental computation on thewindows in a batch to avoid redundant computation.

SABER preserves the order of the result stream after the parallel,out-of-order processing of tasks by first storing the results in localbuffers and then releasing them incrementally in the correct orderas tasks finish execution.Pipelined stream data movement. To avoid delays due to datamovement, SABER introduces a five-stage pipelining mechanismthat interleaves data movement and task execution on the GPGPU:the first three stages are used to maintain high utilisation of thePCIe bandwidth by pipelining the Direct Memory Access (DMA)transfers of batches to and from the GPGPU with the execution ofassociated tasks; the other two stages hide the memory latency asso-ciated with copying batches in and out of managed heap memory.

The remainder of the paper is organised as follows: §2 motivatesthe need for high-throughput stream processing and describes thefeatures of heterogeneous architectures; §3 introduces our hybridstream processing model; §4 describes the SABER architecture,focussing on its dispatching and scheduling mechanisms; and §5explains how SABER executes query operators. The paper finisheswith evaluation results (§6), related work (§7) and conclusions (§8).

2. BACKGROUND

2.1 High-throughput stream processingStream processing has witnessed an uptake in many applica-

tion domains, including credit fraud detection [26], urban trafficmanagement [9], click stream analytics [5], and data centre man-agement [44]. In these domains, continuous streams of input data

(e.g. transaction events or sensor readings) need to be processedin an online manner. The key challenge is to maximise processingthroughput while staying within acceptable latency bounds.

For example, already in 2011, Facebook reported that a queryfor click stream analytics had to be evaluated over input streamsof 9 GB/s, with a latency of a few seconds [49]. Credit card frauddetection systems must process up to 40,000 transactions per secondand detect fraudulent activity within 25 ms to implement effectivecountermeasures such as blocking transactions [26]. In finance,the NovaSparks engine processes a stream of cash equity options at150 million messages per second with sub-microsecond latency [46].

The queries for the above use cases can be expressed in a stream-ing relational model [7]. A stream is a potentially infinite sequenceof relational tuples, and queries over a stream are window-based. Awindow identifies a finite subsequence of a stream, based on tuplecount or time, which can be viewed as a relation. Windows mayoverlap—tuples of one window may also be part of subsequentwindows in which case the window slides over the stream. Distinctsubsequences of tuples that are logically related to the same setof windows are referred to as panes [41], i.e. a window is a con-catenation of panes. Streaming relational query operators, such asprojection, selection, aggregation and join, are executed per window.

To execute streaming relational queries with high throughput,existing stream processing engines exploit parallelism: they eitherexecute different queries in parallel (task parallelism), or executeoperators of a single query in parallel on different partitions of theinput stream (data parallelism). The need to run expensive analyticsqueries over high-rate input streams has focused recent researchefforts on data parallelism.

Current data-parallel stream processing engines such as Storm [52],Spark [56] and SEEP [17] process streams in a data-parallel fash-ion according to a scale-out model, distributing operator instancesacross nodes in a shared-nothing cluster. While this approach canachieve high processing throughput, it suffers from drawbacks: (i) itassumes the availability of compute clusters; (ii) it requires a high-bandwidth network that can transfer streams between nodes; (iii) itcan only approximate time-based window semantics due to the lackof a global clock in a distributed environment; and (iv) the overlapbetween windows requires redundant data to be sent to multiplenodes or the complexity of a pane-based approach [11].

2.2 Heterogeneous architecturesIn contrast to distributed cluster deployments, we explore a dif-

ferent source of parallelism for stream processing that is becomingincreasingly available in data centres and cloud services: serverswith heterogeneous architectures [10, 54]. These servers, featuringmulti-core CPUs and accelerators such as GPGPUs, offer unprece-dented degrees of parallelism within a single machine.

GPGPUs implement a throughput-oriented architecture. Unliketraditional CPUs, which focus on minimising the runtime of sequen-tial programs using out-of-order execution, speculative executionand deep memory cache hierarchies, GPGPUs target embarrassinglyparallel workloads, maximising total system throughput rather thanindividual core performance. This leads to fundamental architecturaldifferences—while CPUs have dozens of sophisticated, powerfulcores, GPGPUs feature thousands of simple cores, following thesingle-instruction, multiple-data (SIMD) model: in each cycle, aGPGPU core executes the same operation (referred to as a kernel)on different data. This simplifies the control logic and allows morefunctional units per chip: e.g. the NVIDIA Quadro K5200 has2,304 cores and supports tens of thousands of threads [47].

GPGPU cores are grouped into streaming multi-processors (SM),with each SM supporting thousands of threads. Threads have their

Page 3: SABER: Window-Based Hybrid Stream Processing for

own dedicated sets of registers and access to high-bandwidth, sharedon-chip memory. While modern CPUs hide latency by using a deepcache hierarchy, aggressive hardware multi-threading in GPGPUsallows latency to be hidden by interleaving threads at the instruction-level with near zero overhead. Therefore, GPGPU cores have onlysmall cache sizes compared to CPUs: e.g. the NVDIA QuadroK5200 has a 64 KB L1 cache per SM and a 1,536 KB L2 cacheshared by all SMs.

A dedicated accelerator typically is attached to a CPU via a PCIebus, which has limited bandwidth.1 For example, a PCIe 3.0 ×16bus has an effective bandwidth of approximately 8 GB/s. This intro-duces a fundamental throughput limit when moving data betweenCPU and GPGPU memory. Especially when data movements occurat a high frequency, as in streaming scenarios, care must be takento ensure that these transfers happen in a pipelined fashion to avoidstalling the GPGPU.

The potential of GPGPUs for accelerating queries has been shownfor traditional relational database engines. For example, Ocelot [32]is an in-memory columnar database engine that statically offloadsqueries to a GPGPU using a single “hardware-oblivious” representa-tion. GDB [29] and HyPE [16] accelerate SQL queries by offloadingdata-parallel operations to GPGPUs. All these systems, however,target one-off queries instead of continuous streaming queries withwindow semantics.

2.3 Using GPGPUs for stream processingThe dataflow model of GPGPUs purportedly is a good match

for stream processing. In particular, performing computation on agroup of tuples that is laid out in memory linearly should greatlybenefit from the SIMD parallelism of GPGPUs. At the same time,the scarcity of designs for relational stream processing engines onGPGPUs is evidence of a range of challenges:Different acceleration potential. Not all stream processing queriesexperience a speed-up when executed on a GPGPU compared to aCPU. Any speed-up depends on a range of factors, including theamount of computation per stream tuple, the size of the tuples, anymemory accesses to temporary state and the produced result size.For example, a simple projection operator that removes attributesfrom tuples typically is bottlenecked by memory accesses, whereasan aggregation operator with a complex aggregation function islikely to be compute-bound. This makes it challenging to predict ifthe performance of a given query can be improved by a GPGPU.

Existing proposals for accelerating traditional database queries(e.g. [37, 55]) assume that a comprehensive analytical model canestablish the relation between a query workload and its performanceon a heterogeneous processor. Based on such a performance model,they offload queries on a GPGPU either statically [32] or specula-tively [29]. Fundamentally, performance models are vulnerable toinaccuracy, which means that they may not make the most effectiveuse of the available accelerators, given an unknown query workload.

The challenge is to devise an approach that makes effective deci-sions as to which queries can benefit from GPGPU offloading.

Support for sliding windows. Any data-parallel stream process-ing must split streams into batches for concurrent processing, buteach streaming SQL query also includes a window specification thataffects its result. The batch size should thus be independent fromthe window specification—the batch size is a physical implementa-tion parameter that determines the efficiency of parallel processing,whereas the window size and slide are part of the query semantics.

1Despite recent advances in integrated CPU/GPGPU designs [22,30, 31], discrete GPGPUs have a substantial advantage in terms ofperformance and power efficiency.

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9

Th

rou

gh

pu

t (1

06 t

up

les/s

)

Window slide (106 tuples)

Figure 1: Performance of a streaming GROUP-BY query with a5-second window and different window slides under Spark Streaming

Existing data-parallel engines, however, tie the definition ofbatches to that of windows because this simplifies window-basedprocessing. For example, Spark Streaming [56] requires the windowslide and the batch size to be a multiple of the window size, enablingparallel window computation to occur on different nodes in lockstep.

Such a coupling of batches and windows has a performance im-pact. In Fig. 1, we show the throughput of a streaming GROUP-BY

query with a 5-second sliding window executing on Spark Stream-ing with a changing window slide. As the slide becomes smaller (i.e.there is more overlap between windows), the throughput decreases—when the batches are smaller, the parallel window processing be-comes less efficient. This makes it infeasible to process querieswith fewer than several million 64-byte tuples in each window slidewhile sustaining high throughput.

The challenge is how an engine can decouple batches and win-dows from each other and still ensure a consistently high throughput.

Cost of stream data movement. Since a stream processing engineprocesses continuous data, the bandwidth of the PCIe bus can limitthe throughput of queries offloaded to the GPGPU. In particular,any data movement to and from the GPGPU causes a delay—aDMA-managed memory transfer from the CPU to the GPGPU takesaround 10 microseconds [43], which is orders of magnitude slowerthan a CPU memory access.

The challenge is how an engine can amortise the delay of datamovement to and from the GPGPU while processing data streams.

2.4 Stream data and query modelWe assume a relational stream data model with window-based

queries, similar to the continuous query language (CQL) [7].Data streams. Let T be a universe of tuples, each being a sequenceof values of primitive data types. A stream S = 〈t1, t2, ...〉 is an infi-nite sequence of such tuples, and T ∗ denotes the set of all streams. Atuple t has a timestamp, τ(t)∈T , and the order of tuples in a streamrespects these timestamps, i.e. for two tuples ti and t j , i < j impliesthat τ(ti) ≤ τ(t j). We assume a discrete, ordered time domain Tfor timestamps, given as the non-negative integers {0,1, ...}. Times-tamps refer to logical application time: they record when a givenevent occurred and not when it arrived in the system. There is onlya finite number of tuples with equal timestamps.Window-based queries. We consider queries that are based onwindows, which are finite sequences of tuples. The set of all win-dows is T ∗ ⊂ T ∗, and the set of all, potentially infinite, sequencesof windows is (T ∗)∗. A window-based query q is defined forn input streams by three components: (i) an n-tuple of windowfunctions 〈ωq

1 , ... ,ωqn 〉, such that each function ω

qi : T ∗ → (T ∗)∗,

when evaluated over an input stream, yields a possibly infinite se-quence of windows; (ii) a possibly compound n-ary operator func-tion f q : (T ∗)n → T ∗ that is applied to n windows, one per inputstream, and produces a window result (a sequence of tuples); and(iii) a stream function φ q : (T ∗)∗→ T ∗ that transforms a sequenceof window results into a stream.

Page 4: SABER: Window-Based Hybrid Stream Processing for

Common window functions define count- or time-based windowswith a window size s and a window slide l: the window size deter-mines the amount of enclosed data; the window slide controls thedifference to subsequent windows. Let S = 〈t1, t2, ...〉 be a stream,and ω(s,l) be a window function with size s and slide l that createsa sequence of windows ω(s,l)(S) = 〈w1,w2, ...〉. Given a windowwi = 〈tk, ... , tm〉, i ∈ N, if ω is count-based, then m = k+ s−1 andthe next window is wi+1 = 〈tk+l , ... , tm+l〉; if ω is time-based, thenthe next window is wi+1 = 〈tk′ , ... , tm′〉 such that τ(tm)− τ(tk)≤ s,τ(tm+1)−τ(tk)> s, τ(tk′)−τ(tk)≤ l, and τ(tk′+1)−τ(tk)> l. Thismodel thus supports sliding (l < s) and tumbling windows (l = s).We do not consider predicate-based windows in this work.Query operators. The above model allows for the flexible defini-tion of an operator function that transforms a window into a windowresult, both being sequences of tuples. In this work, we supportrelational streaming operators [7], i.e. projection (π), selection (σ ),aggregation (α) and θ -join (1). These operators interpret a windowas a relation (i.e. the sequence is interpreted as a multi-set) overwhich they are evaluated. For the aggregation operator, we alsoconsider the use of GROUP-BY clauses.

Operator functions may also be specified as user-defined func-tions (UDFs), which implement bespoke computation per window.An example of a UDF is an n-ary partition join, which takes as inputan n-tuple of windows, one per input stream, and first partitions allwindows based on tuple values before joining the correspondingpartitions of the windows. Despite its similarity, a partition joincannot be realised with a standard θ -join operator.Stream creation. Window results are turned into a stream by astream function. We consider stream functions that are based onrelation-to-stream operators [7]. The RStream function concate-nates the results of windows under the assumption that this orderrespects the tuple timestamps: with 〈w1,w2, ...〉 as a sequence ofwindow results, and wi = 〈tki , ... , tmi〉 for i ∈ N, the result stream isa sequence SR = 〈tk1 , ... , tm1 , tk2 , ... , tm2 , ...〉. The IStream functionconsiders only the tuples of a window result that were not part ofthe previous result: with 〈... , t ji+1 , ...〉 as the projection of wi+1 toall transitions t ji+1 /∈ {tki , ... , tmi} that are not part of wi, the resultstream is SR = 〈tk1 , ... , tm1 , ... , t j2 , ... , t j3 , ...〉.

Typically, specific operator functions are combined with particularstream functions [7]: using the IStream function with projection andselection operators gives intuitive semantics, whereas aggregationand θ -join operators are used with the RStream function. In theremainder, we assume these default combinations.

3. HYBRID STREAM PROCESSING MODELOur idea is to sidestep the problem of “when” and “what” op-

erators to offload to a heterogeneous processor by introducing ahybrid stream processing model. In this hybrid model, the streamprocessing engine always tries to utilise all available heterogeneousprocessors for query execution opportunistically, thus achieving theaggregate throughput. In return, the engine does not have to makean early decision regarding which type of query to execute on agiven heterogeneous processor, be it a CPU core or a GPGPU.Query tasks. The hybrid stream processing model requires thateach query can be scheduled on any heterogeneous processor, leav-ing the scheduling decision until runtime. To achieve this, queriesare executed as a set of data-parallel query tasks that are runnableon either a CPU core or the GPGPU. For a query q with n inputstreams, a query task v = ( f q,B) consists of (i) the n-ary operatorfunction f q specified in the query and (ii) a sequence of n streambatches B = 〈b1, ... ,bn〉, bi ∈ T ∗, i.e. n finite sequences of tuples,

t5 t4 t3 t2 t1t6t7...

Input Data Stream

t8t9t10t11

t5 t4 t3 t2 t1t6t7t8t9t10

Batch b1 Batch b2

t4 t3 t2 t1t6t7t8t9

Batch b1Batch b2 Batches

win. size s=3

win. slide l=1

Batches

win. size s=7

win. slide l=2

t10 t5

Window w1w2

w3w4

w5

Window w1w2

w3

Figure 2: Stream batches under two different window definitions

one per input stream. We use ι(v) = q as a shorthand notation torefer to the query of task v (§4.2, Alg. 1).

The query task size ϕ is a system parameter that specifies howmuch stream data a query task must process, thus determiningthe computational cost of executing a task. Let v = ( f q,B) withB = 〈b1, ... ,bn〉 be a query task, then the query task size is the sumΣn

i=1|bi| of the stream batch sizes. The size of a stream batch bi =〈t1, ... , tm〉 is the data volume of its m tuples. The query task sizeequals the stream batch size in the case of single-input queries.

As we show experimentally in §6.4, a fixed query task size canin practice be chosen independently of the query workload, simplybased on the properties of the stream processing engine implementa-tion and its underlying hardware. There is a trade-off when selectingthe query task size ϕ: a large size leads to higher throughput be-cause it amortises scheduling overhead, exhibits more parallelismon the GPGPU and makes the transmission of query tasks to theGPGPU more efficient; on the other hand, a small size achieveslower processing latency because tasks complete more quickly.Window handling. An important invariant under our hybrid modelis that a stream batch, and thus a query task, are independent of thedefinition of windows over the input streams. Unlike pane-basedprocessing [11, 41], the size of a stream batch is therefore not deter-mined by the window slide. This makes window handling in the hy-brid model fundamentally different from that of existing approachesfor data-parallel stream processing, such as Spark Streaming [56] orStorm [52]. By decoupling the parallelisation level (i.e. the querytask) from the specifics of the query (i.e. the window definition), thehybrid model can support queries over fine-grained windows (i.e.ones with small slides) with full data-parallelism.

Fig. 2 shows the relationship between the stream batches of aquery task and the respective window definitions from a query. Inthis example, stream batches contain 5 tuples. With a windowdefinition of ω(3,1), stream batch b1 contains 3 complete windows,w1, w2, and w3, and fragments of two windows, w4 and w5. Theremaining tuples of the latter two windows are contained in streambatch b2. For the larger window definition ω(7,2), the first streambatch b′1 contains only window fragments: none of w1, w2, and w3are complete windows. Instead, these windows also span acrosstuples in stream batch b′2.Operators. A stream batch can contain complete windows or onlywindow fragments. Hence, the result of the stream query for aparticular sequence of windows (one per input stream) is assembledfrom the results of multiple query tasks. That is, a window w =〈t1, ... , tk〉 of one of the input streams is partitioned into m windowfragments {d1, ... ,dm}, di ∈ T ∗, such that the concatenation of thewindow fragments yields the window.

As shown in Fig. 3, this has consequences for the realisation of then-ary operator function f q: it must be decomposed into a fragmentoperator function f q

f and an assembly operator function f qa . The

fragment operator function f qf : (T ∗)n→ T ∗ defines the processing

for a sequence of n window fragments, one per stream, and yields asequence of tuples, i.e. a window fragment result.

Page 5: SABER: Window-Based Hybrid Stream Processing for

fb(b1)fb(b2) ff

t5 t4 t3 t2 t1t6t7t8t9t10

b1b2

ff

Windows

Window fragments

Batch operator functions

Fragment operator functions

Window fragment results

Assembly operator functions

Window results

ffff

Batches

w1

w2

Results

for w2

Results

for w1

fa(w1)fa(w2)

Figure 3: Example of computation based on window fragments

The assembly operator function f qa : T ∗ → T ∗ constructs the

complete window result by combining the window fragment resultsfrom f q

f . The composition of the fragment operator function f qf and

the assembly operator function f qa yields the operator function f q.

The decomposition of function f q into functions f qf and f q

a isoperator-specific. For many associative and commutative functions(e.g. aggregation functions for count or max), both f q

f and f qa are the

original operator function f q. For other functions, such as median,more elaborate decompositions must be defined [50].

The decomposition also introduces a synchronisation point in thedata-parallel execution because the assembly function may referto the results of multiple tasks. In §4.3, we describe how thissynchronisation can be implemented efficiently by reusing threadsexecuting query tasks to perform the assembly function.Incremental computation. When processing a query task with asliding window, it is more efficient to use incremental computation:the computation for a window fragment should exploit the resultsalready obtained for preceding window fragments in the same batch.

The hybrid model captures this optimisation through a batch oper-ator function f q

b : ((T ∗)n)m→ (T ∗)m that is applied to a query task.The batch operator function takes as input m sequences of n windowfragments, where n is the number of input streams and m is themaximum number of window fragments per stream batch in the task.It outputs a sequence of m window fragment results. The same resultwould be obtained by applying the fragment operator function f q

fm-times to each of the m sequences of n window fragments.Example. The effect of applying the hybrid stream processingmodel is shown in Fig. 3. Windows w1 and w2 span two streambatches, and they are processed by two query tasks. For each task,the batch operator function computes the results for the windowfragments contained in that task. The assembly operator functionconstructs the window results from the window fragment results.

4. SABER ARCHITECTUREWe describe SABER, a relational stream processing engine for

CPUs and GPGPUs that realises our hybrid stream processingmodel. We begin this section with an overview of the engine’sarchitecture, and then explain how query tasks are dispatched (§4.1),scheduled (§4.2), and their results collected (§4.3).Processing stages. The architecture of SABER consists of fourstages controlling the lifecycle of query tasks, as illustrated in Fig. 4:

(i) A dispatching stage creates query tasks of a fixed size. Thesetasks can run on either type of processor and are placed into asingle, system-wide queue.

(ii) A scheduling stage decides on the next task each processorshould execute. SABER uses a new heterogeneous lookaheadscheduling (HLS) algorithm that achieves full utilisation of allheterogeneous processors, while accounting for the differentperformance characteristics of query operators.

Dispatching Stage Scheduling Stage CPU-based Execution

GPGPU-based Execution

Queue

of Query

Tasks

Query Task

α π

Circular

Input

Buffers

Input

Data

Streams

WindowsQuery Task

α πQuery Task

α π

Query Task

ααα ααπ

Result Stage

Query Task

Result Buffer

Output

Data

Stream

Figure 4: Overview of the Saber architecture

(iii) An execution stage runs the query tasks on a processor byevaluating the batch operator function on the input windowfragments. A task executes either on one of the CPU cores orthe GPGPU.

(iv) A result stage reorders the query task results that may arriveout-of-order due to the parallel execution of tasks. It alsoassembles window results from window fragment results bymeans of the assembly operator function.

Worker thread model. All four stages are executed by workerthreads. Each worker is a CPU thread that is bound to a physicalcore, and one worker uses the GPGPU for the execution of querytasks. Worker threads handle the complete lifecycle of query tasks,which means that they are always fully utilised. This is in contrastto having separate threads for the dispatching or result stage, whichcould block due to out-of-order processing of query tasks.

Starting with the scheduling stage, the lifecycle of a worker threadis as follows: When a worker thread becomes idle, it invokes thescheduling algorithm to retrieve a task to execute from the system-wide task queue, indicating if it will execute the task on a CPUcore or the GPGPU. The worker thread then executes the retrievedtask on its associated heterogeneous processor. After task executioncompletes, the worker thread enters the result stage. In synchronisa-tion with other worker threads, it orders window fragment resultsand, if possible, assembles complete window results from their frag-ments. Any assembled window results are then appended to thecorresponding query’s output data stream.

4.1 Dispatching of query tasksDispatching query tasks involves two steps: the storage of the

incoming streaming data and the creation of query tasks. Querytasks are inserted into a system-wide queue to be scheduled forexecution on the heterogeneous processors.Buffering incoming data. To store incoming tuples, SABER usesa circular buffer per input stream and per query. By maintaininga buffer per query, processing does not need to be synchronisedamong the queries. Data can be removed from the buffer as soon asit is no longer required for query task execution.

The circular buffer is backed by an array, with pointers to thestart and end of the buffer content. SABER uses byte arrays, so thattuples are inserted into the buffer without prior deserialisation. Theimplementation is lock-free, which is achieved by ensuring that onlyone worker thread inserts data into the buffer (worker threads areonly synchronised in the result stage) and by explicitly controllingwhich data is no longer kept in the buffer. A query task contains astart pointer and an end pointer that define the respective streambatch. In addition, for each input buffer of a query, a query taskcontains a free pointer that indicates up to which position data canbe released from the buffer.

A worker thread that executes a query task has only read-access tothe buffer. Releasing data as part of the result stage, in turn, meansmoving the start pointer of the buffer to the position of the freepointer of a query task for which the results have been processed.

Page 6: SABER: Window-Based Hybrid Stream Processing for

Query task creation. The worker thread that inserts data into acircular buffer also handles the creation of query tasks. Instead offirst deserialising tuples and computing windows in the sequentialdispatching stage, all window computation is handled by the highlyparallel query execution stage.

In the dispatching stage, the data inserted into the buffer is addedto the current stream batch and, as soon as the sum of stream batchsizes in all query input streams exceeds the query task size φ , aquery task is created. A query task identifier is assigned to eachtask so that all tasks of a query are totally ordered. This permits theresult stage to order the results of query tasks.

4.2 Scheduling of query tasksThe scheduling goal is to utilise all heterogeneous processors

fully, yet to account for their different performance characteristics.The challenge is that the throughput of a given type of heterogeneousprocessor is query-dependent. Query tasks should therefore primar-ily execute on the processor that achieves the highest throughput forthem. At the same time, overloading processors must be avoidedbecause this would reduce the overall throughput. The schedul-ing decision must thus be query-specific and take the (planned)utilisation of the processors into account.

Instead of using a performance model, SABER observes the querytask throughput, i.e. the number of query tasks executed per unit oftime, defined per query and processor. Based on this, it estimateswhich processor type is preferred for executing a query task.

At runtime, the preferred processor may be busy executing othertasks. In such cases, SABER permits a task to execute on a non-preferred processor as long as it is expected to complete before thepreferred processor for that task would become available.

Making scheduling decisions based on the observed task through-put has two advantages: (i) there is no need for hardware-specificperformance models and (ii) the scheduling decisions are adaptive tochanges in the query workload, e.g. when the selectivity of a querychanges, the preferred processor for the tasks may also change.However, this approach assumes that the query behaviour is not toodynamic because, in the long term, the past execution of tasks mustallow for the estimation of the performance of future tasks for agiven query. Our experimental evaluation shows that this is indeedthe case, even for workloads with dynamic changes.Observing query task throughput. SABER maintains a matrix thatrecords the relative performance differences when executing a queryon a given type of processor. The query task throughput, denoted byρ(q, p), is defined as the number of query tasks of query q that canbe executed per time unit on processor p. For the CPU, this valuedenotes the aggregated throughput of all CPU cores; for the GPGPU,it is the overall throughput including data movement overheads.

For a set of queries Q = {q1, ... ,qn} and two types of processors,P = {CPU,GPGPU}, we define the query task throughput matrix as

C =

ρ(q1,CPU) ρ(q1,GPGPU)...

...ρ(qn,CPU) ρ(qn,GPGPU)

.

Intuitively, in each row of this matrix, the column with the largestvalue indicates the preferred processor (highest throughput) for thequery; the ratio r = ρ(q,CPU)/ρ(q,GPGPU) is the speed-up (r > 1)or slow-down (r < 1) of the CPU compared to the GPGPU.

The matrix is initialised under a uniform assumption, with a fixedvalue for all entries. It is then continuously updated by measuringthe number of tasks of a query that are executed in a certain timespan on a particular processor. Based on the average time over afixed set of task executions, the throughput is computed and updated.

Algorithm 1: Hybrid lookahead scheduling (HLS) algorithminput : C , a query task throughput matrix over Q = {q1, ... ,qn}

and P = {CPU,GPGPU},p ∈ {CPU,GPGPU}, the processor for which a task shall be scheduled,w = 〈v1, ... ,vk〉, k > 1, a queue of query tasks, ι(vi) ∈ Q for 1≤ i≤ k,count : Q×P→ N, a function assigning a number of executions per

query and processor,st, a switch threshold.

output : v ∈ {v1, ... ,vk} - the query task selected for execution on processor p

1 pos← 1 ; // Initialise position in queue2 delay← 0 ; // Initialise delay

// While end of queue is not reached3 while pos <= k do4 q← ι(w[pos]) ; // Select query of task at current position5 ppre f ← argmaxp′∈P C (q, p′) ; // Determine preferred processor

// If p is preferred or preference ignored due to delay or switch threshold6 if (p = ppre f ∧ count(q, p)< st) ∨

(p 6= ppre f ∧ (count(q, ppre f )≥ st ∨ delay≥ 1/C (q, p))) then// If preference was ignored, reset execution counter

7 if count(q, ppre f )≥ st then count(q, ppre f )← 0 ;

8 count(q, p)← count(q, p)+1 ; // Increment execution counter9 return w[pos] ; // Select task at current position

10 delay← delay+1/C (q, ppre f ) ; // Update delay

11 pos← pos+1 ; // Move current position in queue

12 return w[pos];

Hybrid lookahead scheduling (HLS). The HLS algorithm takes asinput (i) a query task throughput matrix; (ii) the system-wide queueof query tasks; (iii) the processor type (CPU or GPGPU) on whicha worker thread intends to execute a task; (iv) a function that countshow often query tasks have been executed on each processors; and(v) a switch threshold to ensure that tasks are not executed only onone processor. It iterates over the tasks in the queue and returns thetask to be executed by the worker thread on the given processor.

At a high level, the decision as to whether a task v is selected bythe HLS algorithm depends on the identified preferred processorand the outstanding work that needs to be done by the preferredprocessor for earlier tasks in the queue. Such tasks may have thesame preferred processor and will be executed before task v, therebydelaying it. If this delay is large, it may be better to execute task von a slower processor if task execution would complete before thepreferred processor completes all of its other tasks.

To perform this lookahead and quantify the amount of outstandingwork for the preferred processor, the HLS algorithm sums up theinverses of the throughput values (variable delay). If that delay islarger than the inverse of the throughput obtained for the current taskwith the slower processor, the slower processor leads to an earlierestimated completion time compared to the preferred processor.

Over time, the above approach may lead to the situation that tasksof a particular query are always executed on the same processor,rendering it impossible to observe the throughput of other processors.Therefore, the HLS algorithm includes a switch threshold, whichimposes a limit on how often a given task can be executed on thesame processor without change: a given task is not scheduled onthe preferred processor if the number of tasks of the same queryexecuted on the preferred processor exceeds this threshold.

The HLS algorithm is defined more formally in Alg. 1. It iteratesover the tasks in the queue (lines 3–11). For each task, the queryoperator function of the task is selected (line 4). Then, the preferredprocessor for this query is determined by looking up the highestvalue in the row of the query task throughput matrix (line 5).

The current task is executed on the processor associated with theworker thread: (i) if the preferred processor matches the worker’sprocessor, and the number of executions of tasks of this query onthis processor is below the switch threshold or (ii) if the worker’s

Page 7: SABER: Window-Based Hybrid Stream Processing for

q2

q1 q3 q3 q2 q2q2q1

...

Query Task QueueQueue Head

z1(CPU)

v7 v6 v5 v4 v3 v2 v1

Query of Tasks

q1 q3 q2 q2q2q1

...

z2(GPGPU)

v8 v7 v6 v5 v3 v2 v1

Time

q1q2q3

(CPU

50GPGPU

205 1520 30

)

Figure 5: Example of hybrid lookahead scheduling

processor is not the preferred one but the accumulated delay of thepreferred processor means that execution on the worker’s processoris beneficial, or the number of tasks of this query on the preferredprocessor is higher than the switch threshold (line 6).

If a non-preferred processor is selected for the current task, theexecution counter of the preferred processor is reset (line 7). The exe-cution counter for the selected processor is incremented (line 8), andthe current task is returned for execution (line 9). If the current taskis not selected, the accumulated delay is updated (line 10), becausethe current task is planned to be executed on the other processor.Finally, the next query task in the queue is considered (line 11).Example. Fig. 5 gives an example of the execution of the HLSalgorithm. The query task queue contains tasks for three queries, q1–q3. Assuming that a worker thread z1 for processor CPU becomesavailable, HLS proceeds as follows: the head of the queue, v1, isa task of query q2 that is preferably executed on GPGPU. Hence,it is not selected and the planned delay for GPGPU is set to 1/15.The second and third tasks are handled similarly, resulting in anaccumulated delay for GPGPU of 1/15+1/15+1/30 = 1/6. Thefourth task v4 is also preferably executed on GPGPU. However, theestimation of the accumulated delay for GPGPU suggests that theexecution of v4 on CPU would lead to an earlier completion time,hence it is scheduled there. Next, if a worker thread z2 that executestasks on GPGPU becomes available, it takes the head of the queuebecause GPGPU is the preferred processor for tasks of query q2.

4.3 Handling of task resultsThe result stage collects the results from all tasks that belong to

a specific query and creates the output data stream. To maintainlow processing latency, results of query tasks should be processedas soon as they are available, continuously creating the output datastream from the discretised input data streams. This is challengingfor two reasons: first, the results of query tasks may arrive out-of-order due to their parallel execution, which requires reordering toensure correct results; second, the window fragment results of awindow may span multiple query tasks, which creates dependencieswhen executing the assembly operator function.

To address these challenges, the result stage reduces the synchro-nisation needed among the worker threads. To this end, processingof query task results is organised in three phases, each synchronisedseparately: (i) storing the task results, i.e. the window fragmentresults, in a buffer; (ii) executing the assembly operator functionover window fragment results to construct the window results; and(iii) constructing the output data stream from the window results.Result storage. When worker threads enter the result stage, theystore the window fragment results in a circular result buffer, whichis backed by a byte array. The order in which results are stored isdetermined by the query task identifiers assigned in the dispatchstage: a worker thread accesses a particular slot in the result buffer,which is the query task identifier modulo the buffer size.

To ensure that worker threads do not overwrite results of earlierquery tasks stored in the result buffer, a control buffer records statusflags for all slots in the result buffer. It is accessed by worker threadsusing an atomic compare-and-swap instruction. To avoid blocking a

worker thread if a given slot is already populated, the result bufferhas more slots than the number of worker threads. This guaranteesthat the results in a given slot will have been processed before thisslot is accessed again by another worker thread.Assembly of window results. After a worker thread has stored theresults of a query task, it may proceed by assembling window resultsfrom window fragment results. In general, the results of multipletasks may be needed to assemble the result of a single window,and assembly proceeds in a pairwise fashion. Given the results oftwo consecutive tasks, the windows are processed in their logicalorder, as defined by the query task identifier. The window result iscalculated by applying the assembly operator function to all resultsof relevant fragments, which are contained in the two query taskresults. The obtained window results directly replace the windowfragment results in the result buffer.

If additional fragment results are needed to complete the assemblyfor a window, i.e. the window spans more than two query task results,multiple assembly steps are performed. The assembly operatorfunction is then applied to the result of the first assembly step andthe fragment results that are part of the next query task result.

If a window spans more than two query tasks, and the assemblyof window results is not just a concatenation of fragment results,the above procedure is not commutative—a window fragment resultmay be needed to construct the results of several windows, which arepart of different evaluations of the assembly operator function. Inthat case, the pairwise assembly steps must be executed in the orderdefined by the query task identifiers. Such ordering dependenciescan be identified by considering for each query task whether it con-tained windows that opened or closed in one of the stream batches.A query task result is ready for assembly either if it does not containwindows that are open or closed in one of the stream batches, or ifthe preceding query task result has already been assembled.

The information about task results that are ready for assemblyand task results that have been assembled are kept as status flags inthe control buffer. After a worker thread stores the result of a querytask in the result buffer, it checks for slots in the result buffer thatare ready for assembly. If there is no such slot, the worker threaddoes not block, but continues with the creation of the output stream.Output stream construction. After window result assembly, theoutput data stream is constructed. A worker thread that finishedassembly of window results (or skipped over it) checks if the slot ofthe result buffer that contains the next window result to be appendedto the output stream is ready for result construction. If so, the workerthread locks the respective slot and appends the window results tothe output stream. If there is no such slot, the worker thread leavesthe result stage and returns to the scheduling stage.

5. QUERY EXECUTIONWe now describe SABER’s query execution stage in more detail.

A challenge is that the performance of streaming operators maybe restricted by their memory accesses. SABER thus explicitlymanages memory to avoid unnecessary data deserialisation anddynamic object creation (§5.1). Whether the GPGPU can be utilisedfully depends on how efficiently data is transferred to and from theGPGPU. SABER employs a novel approach to pipelined stream datamovement that interleaves the execution of query tasks with datatransfer operations (§5.2). Finally, we describe our implementationof streaming operators on the CPU (§5.3) and GPGPU (§5.4).

5.1 Memory managementDynamic memory allocation is one of the main causes of perfor-

mance issues in stream processing engines implemented in managed

Page 8: SABER: Window-Based Hybrid Stream Processing for

Java Heap

Unsafe Memory

Window

Fragment

Result

Pinned

Input

Memory

Pinned

Output

Memory

Stream

Batch

GPGPU Local Memory

Local

Input

Buffer

Local

Output

Buffer

copyin

movein

execute

moveout

copyout

Data Movement Operations GPGPU Pipelining

time

query tasks

task 1

task 2

task 3

task 4

copyin execute

copyout

movein moveout

task 5

(buffer 1)

(buffer 2)

(buffer 3)

(buffer 4)

(buffer 1)

Thread

dependency

between moveout

operations

t1

Data dependency

between copyin and

movein operations

Figure 6: Data movement and pipelining for GPGPU-basedexecution of query tasks

languages with garbage collection, such as Java or C#. The SABERimplementation therefore minimises the number of memory alloca-tions by means of lazy deserialisation and object pooling.Lazy deserialisation. To manage the data in input streams effi-ciently, SABER uses lazy deserialisation: tuples are stored in theirbyte representation and deserialised only if and when needed. De-serialisation is controlled per attribute, i.e. tuple values that are notaccessed or needed to compute windows are not deserialised. Dese-rialisation only generates primitive types, which can be efficientlypacked in byte arrays without the overhead of pointer dereferencingof object types. In addition, whenever possible, operators in SABERrealise direct byte forwarding, i.e. they copy the byte representationof the input data to buffers that store intermediate results.Object pooling. To avoid dynamic memory allocation on the criticalprocessing path, SABER uses statically allocated pools of objects, forall query tasks, and of byte arrays, for storing intermediate windowfragment results. To avoid contention for the byte arrays when manyworker threads access it, each thread maintains a separate pool.

5.2 Pipelined stream data movementExecuting a query task on the GPGPU involves several data move-

ment operations, as shown on the left side of Fig. 6: (i) the input datais copied from Java heap memory to pinned host memory (copyin);(ii) using DMA, data is transferred from pinned host memory toGPGPU memory for processing (movein); (iii) after the query taskhas executed (execute), the results are transferred back from theGPGPU memory to pinned host memory (moveout); and (iv) theresults are copied back to Java heap memory (copyout).

Performing these data movement operations sequentially wouldunder-utilise the GPGPU. Given that stream processing handlessmall amounts of data in practice, the PCIe throughput becomes thebottleneck—executing the copyin and copyout operations sequen-tially would reduce the throughput to half of the PCIe bandwidth.

SABER therefore uses a five-stage pipelining mechanism, whichinterleaves I/O and compute operations to reduce idle periods andyield higher throughput. Pipelining is achieved by having dedicatedthreads execute each of the five data movement operations in par-allel: two CPU threads copy data from and to Java heap memory(copyin and copyout), and two GPGPU threads implement the datamovement from and to GPGPU memory (movein and moveout).The remaining GPGPU threads execute query tasks (execute).

The right side of Fig. 6 shows the interleaving of data movementoperations. Each row shows the five operations for a single querytask. The interleaving of operations can be seen, for example, attime t1, when moveout of task 1, execute of task 2, movein of task 3,and the copy operations of task 4 all execute concurrently.

The interleaving of operations respects data and thread dependen-cies. For a specific query task, all operations are executed sequen-

tially because each operation relies on the data of the previous one.The CPU threads executing the copy operations, however, run inparallel so that the result data of the i-th task is copied by the copyout

operation of the (i+4)-th task: as shown in Fig. 6, task 5’s copyout

operation returns the results of task 1. If task 1’s moveout operationhas not completed, task 5 will be blocked until a notification fromthe dedicated GPGPU thread performing moveout.

The execution of each data movement operation by a thread alsoresults in the sequential execution of the same operation of differenttasks: in the example, task 2’s moveout operation must completebefore the respective thread executes task 3’s moveout operation.

5.3 CPU operator implementationsTo execute a query task on a CPU core, a worker thread applies

the window batch function of a given query operator to the windowfragments of the stream batches for this task. SABER implementsthe operator functions as follows.Projection and selection operators are both stateless, and theirbatch operator function is thus a single scan over the stream batchof the corresponding query task, applying the projection or selectionto each tuple. The assembly operator function then concatenates thefragment results needed to obtain a window result according to thestream creation function (see §2.4).Aggregation operators, such as sum, count and average, over slid-ing windows exploit incremental computation [12, 50]. In a firststep, the batch operator function computes window fragments bypartitioning the stream batch of the query task into panes, i.e. dis-tinct subsequences of tuples. To compute the aggregation value perwindow fragment, the batch operator function reuses the aggregatedvalue—the result—of previous window fragments and incorporatestuples of the current fragment.

For an aggregation operator, a window result cannot be con-structed by simply concatenating window fragment results, but theoperator function must be evaluated for a set of fragment results.The assembly of window results from window fragment results isdone stepwise, processing the window fragment results of two querytasks at a time. As described in §4.3, SABER keeps track for eachfragment whether it is part of a window that opened, closed, ispending (i.e. neither opened nor closed), or fully contained in thestream batch of the query task. This is achieved by storing windowfragment results in four different buffers (byte arrays) that containonly fragments that belong to windows in specific states.

SABER also supports the definition of HAVING and GROUP-BY clauses.The implementation of the HAVING clause reuses the selection opera-tor. For the GROUP-BY clause, the batch operator function maintains ahash table with an aggregation value per group. To avoid dynamicmemory allocation, SABER uses a statically allocated pool of hashtable objects, which are backed by byte arrays.Join operators implement a streaming θ -join, as defined by Kang etal. [35]. The execution of query tasks for the join is sequential be-cause, unlike other task-parallel join implementations [27], SABERachieves parallelism by the data-parallel execution of query tasks.

5.4 GPGPU operator implementationsGPGPU operators are implemented in OpenCL [40]. Each opera-

tor has a generic code template that contains its core functionality.SABER populates these templates with tuple data structures, basedon the input and output stream schemas of each operator, and query-specific functions over tuple attributes (e.g. selection predicates oraggregate functions). Tuple attributes are represented both as prim-itive types and vectors, which allows the implementation to usevectorised GPGPU instructions whenever possible.

Page 9: SABER: Window-Based Hybrid Stream Processing for

Each GPGPU thread evaluates the fragment operator function. Toexploit the cache memory shared by the synchronised threads on asingle GPGPU core, tuples that are part of the same window are as-signed to the same work group. The assembly operator function thatderives the window results from window fragment results (see §4.3)is evaluated by one of the CPU worker threads.Projection and selection. To perform a projection, the threads in awork group load the tuples into the group’s cache memory, computethe projected values, and write the results to the global GPGPUmemory. Moving the data to the cache memory is beneficial becausethe deserialisation and serialisation of tuples becomes faster. Forthe selection operator, the result from a fragment operator functionis stored as a binary vector in the group’s cache memory. Here, eachvector entry indicates whether a tuple has been selected.

In a second step, SABER uses a prefix-sum (or scan) opera-tion [14] to write the output data to a continuous global GPGPUmemory region: it scans the binary vector and obtains continuousmemory addresses to which selected tuples are written. These ad-dresses are computed per window fragment and offset based on thestart address of each fragment in order to compress the results forall fragments in a stream batch.Aggregation first assigns a work group to each of the window frag-ments in the stream batch of the query task. For the implementedcommutative and associative operator functions, each thread of thework group repeatedly reads two tuples from the global GPGPUmemory and aggregates them. The threads thus form a reductiontree from which the window fragment result is derived. As in CPU-based execution (§5.3), the results are stored while keeping trackfor each fragment whether it is part of a window that opens, closes,is pending, or is fully contained in the stream batch.

To implement aggregations with HAVING clauses, SABER relieson the aforementioned approach for selection, which is applied tothe aggregation results. For a GROUP-BY clause, every work grouppopulates an open-addressing hash table. The table uses a linearprobing approach combined with a memory-efficient representation.The size of the table and the hash function used on the CPU andGPGPU are the same to ensure that, given a tuple in the CPU table,it can be searched in the GPGPU table, and vice versa.

An additional challenge, compared to the CPU implementation,is to ensure that the hash table for a window fragment is thread-safe. We reserve an integer in the intermediate tuple representationto store the index of the first input tuple that has occupied a tableslot. Threads try to atomically compare-and-set this index. When acollision occurs, a thread checks if the key stored is the same as thecurrent key, in which case it atomically increments the aggregatevalue, or it moves to the next available slot.

If the window fragment is incomplete, the hash table is outputas is—this is necessary because the result aggregation logic is thesame for both CPU and GPGPU. Otherwise, all sparsely populatedhash tables are compacted at the end of processing.Join operators exploit task-parallelism, similar to the cell join [27].To avoid quadratic memory usage, however, SABER adopts a tech-nique used in join processing for in-memory column stores [32] andperforms a join in two steps: (i) the number of tuples that match thejoin predicate is counted and (ii) the results are compressed in theglobal GPGPU memory, as for selection.

6. EVALUATIONWe evaluate SABER to show the benefits of its hybrid stream

processing model. First, we demonstrate that such a model achieveshigher performance for both real-world and synthetic query bench-marks (§6.2). We then explore the trade-off between CPU and

Datasets Queries

Name # Attr. Name Windows Operators Values

Synthetic (Syn) 7 PROJm various πa1 ,...,am 1≤ m≤ 10SELECTn various σp1 ,...,pn 1≤ n≤ 64AGGf various α f f ∈ {avg,sum}GROUP-BYovarious γg1 ,...,go 1≤ o≤ 64JOINr various 1p1 ,...,pr 1≤ r ≤ 64

Cluster 12 CM1 ω60,1 π,γ,αsumMonitoring (CM) CM2 ω60,1 π,σ ,γ,αavg

Smart Grid (SG) 7 SG1 ω3600,1 π,αavgSG2 ω3600,1 π,γ,αavgSG3 ω1,1,ω1,1 π,σ ,1

Linear Road 7 LRB1 ω∞ π

Benchmark LRB2 ω30,1,ωpart πdistinct,1(LRB) LRB3 ω300,1 π,σ ,γ,αavg

LRB4 ω30,1 π,γ,αcount

Table 1: Summary of evaluation datasets and workloads

GPGPU execution for different queries (§6.3). After that, we in-vestigate the impact of the query task size on performance (§6.4)and show that SABER’s CPU operator implementations scale (§6.5).Finally, we evaluate the effectiveness of HLS scheduling (§6.6).

6.1 Experimental set-up and workloadsAll experiments are performed on a server with 2 Intel Xeon

E5-2640 v3 2.60 GHz CPUs with a total of 16 physical CPU cores,a 20 MB LLC cache and 64 GB of memory. The server also hostsan NVIDIA Quadro K5200 GPGPU with 2,304 cores and 8 GBGDDR 5 memory, connected via a PCIe 3.0 (×16) bus. Whenstreaming data to SABER from the network, we use a 10 GbpsEthernet NIC. The server runs Ubuntu 14.04 with Linux kernel 3.16and NVIDIA driver 346.47. SABER uses the Oracle JVM 7 with themark-and-sweep garbage collector.

Table 1 summarises the query workloads and datasets in our evalu-ation. We use both synthetic queries, for exploring query parameters,and application queries from real-world use cases. The CQL repre-sentations of the application queries are listed in Appendix A.Synthetically-generated workload (Syn). We generate syntheticdata streams of 32-byte tuples, each consisting of a 64-bit timestampand six 32-bit attribute values drawn from a uniform distribution—the default being integer values, but for aggregation and projectionthe first value being a float. We then generate a set of queries usingthe operators from §2.4, and vary key parameters: the number ofattributes and arithmetic expressions in a projection; the number ofpredicates in a selection or a join; the function in an aggregation;and the number of groups in an aggregation with GROUP-BY.Compute cluster monitoring (CM). Our second workload em-ulates a cluster management scenario. We use a trace of time-stamped tuples collected from an 11,000-machine compute clusterat Google [53]. Each tuple is a monitoring event related to the tasksof compute jobs that execute on the cluster, such as the successfulcompletion of a task, the failure of a task, or the submission of ahigh-priority task for a production job.

We use two queries, CM1 and CM2, that express common clustermonitoring tasks [23,38]: CM1 combines a projection and an aggre-gation with a GROUP-BY clause to compute the sum of the requestedshare of CPU utilisation per job category; and CM2 combines aprojection, a selection, and an aggregation with a GROUP-BY to reportthe average requested CPU utilisation of submitted tasks.Anomaly detection in smart grids (SG). This workload focuseson anomaly detection in energy consumption data from a smartelectricity grid. The trace is a stream of smart meter readings fromdifferent electrical devices in households [34].

We execute three queries, SG1–3, that analyse the stream to detectoutliers. SG1 is a projection and aggregation that computes the

Page 10: SABER: Window-Based Hybrid Stream Processing for

0

10

20

30

40

50

60

CM1 CM2 SG1 SG2 SG3 LRB1 LRB2 LRB3 LRB4

Th

rou

gh

pu

t (1

06 t

up

les/s

) EsperSaber

Saber (GPGPU contrib.)

1150MB/s

1143MB/s 1010

MB/s

751MB/s

98MB/s

1175MB/s

635MB/s

1043MB/s

609MB/s

Figure 7: Performance for application benchmark queries

0

2

4

6

8

PROJ4 SELECT16 AGG* GROUP-BY8

Th

rou

gh

pu

t (G

B/s

)

Saber (CPU only)Saber (GPGPU only)

Saber

0

0.1

0.2

0.3

0.4

JOIN1

Figure 8: Performance for synthetic queries. AGG* evaluates allaggregate functions and GROUP-BY evaluates cnt and sum.

sliding global average of the meter load; SG2 combines a projectionand aggregation with a GROUP-BY to derive the sliding load averageper plug in a household in a house; and SG3 uses a join to comparethe global load average with the plug load average by joining theresults of SG1 and SG2, and then counts the houses for which thelocal average is higher than the global one.Linear Road Benchmark (LRB). This workload is the LinearRoad Benchmark [8] for evaluating stream processing performance.The benchmark models a network of toll roads, in which incurredtolls depend on the level of congestion and the time-of-day. Tuplesin the input data stream denote position events of vehicles on ahighway lane, driving with a specific speed in a particular direction.

6.2 Is hybrid stream processing effective?We first explore the performance of SABER for our application

benchmark queries. To put the achieved processing throughput intoperspective, we also compare to an implementation of the queriesin Esper [2], an open-source multi-threaded stream processing en-gine. The input streams for the queries are generated by a separatemachine connected through a 10 Gbps network link.

Fig. 7 shows the processing throughput for SABER and Esperfor the application queries. For SABER, we also report the splitof the contribution of the CPU (upper black part) and the GPGPU(lower grey part) to the achieved throughput. SABER manages tosaturate the 10 Gbps network link for many queries, including CM1,CM2 and LRB1. In contrast, the performance of Esper remains twoorders of magnitude lower due to the synchronisation overhead ofits implementation and the lack of GPGPU acceleration.

Different queries exhibit different acceleration potential on theCPU versus the GPGPU, which SABER automatically selects ac-cording to its hybrid execution model. For CM1, the aggregationexecutes efficiently on the CPU, whereas the selection in CM2 onthe GPGPU benefits from its high throughput, leaving aggregationover sliding windows to the CPU. SG1 and LRB1 do not benefitfrom the GPGPU because the CPU workers can keep up with thetask generation rate. SG2 and LRB3 utilise both heterogeneousprocessors equally because they can split the load.

0

10

20

30

40

CM1 CM2 SG1Thro

ughput (1

06 tuple

s/s

)

Spark StreamingSaber

Figure 9: Performance compared to Spark Streaming

0

2

4

6

8

10

1 2 4 8 16 32 64

Thro

ughput (G

B/s

)

Number of Predicates (n)

Saber (CPU only)Saber (GPGPU only)

Saber

(a) SELECTn with ω32KB,32KB

0

0.1

0.2

0.3

0.4

0.5

1 2 4 8 16 32 64

Thro

ughput (G

B/s

)

Number of Predicates (r)

Saber (CPU only)Saber (GPGPU only)

Saber

(b) JOINr with ω4KB,4KB

Figure 10: Performance impact of query parameters

Next, we use our synthetic operator queries to compare the per-formance achieved by SABER to CPU-only and GPGPU-only exe-cution, respectively. Fig. 8 shows that the hybrid approach alwaysachieves higher throughput than only using a single type of heteroge-neous processor. The contention in the dispatching and result stagesin SABER means that the combined throughput is less than the sumof the throughput achieved by the CPU and the GPGPU only.

Fig. 9 shows the performance of SABER and Spark Streamingwith the CM1, CM2 and SG1 queries. Since Spark does not supportcount-based windows, we change the queries to use 500 ms tumblingwindows based on system time for comparable results. For CM1

and CM2, SABER’s throughput is limited by the network bandwidth,while, for SG1, SABER saturates 90% of the network bandwidth,showing a throughput that is 6× higher than Spark Streaming, whichis limited due to scheduling overhead.

We put SABER’s performance into perspective by comparing itagainst MonetDB [33], an efficient in-memory columnar databaseengine. We run a θ -join query over two 1 MB tables with 32-bytetuples. The tables are populated with synthetic data such that theselectivity of the query is 1%. In SABER, we emulate the joinby treating the tables as streams and applying a tumbling windowof 1 MB. In MonetDB, we partition the two tables and join thepartitions pairwise so that the engine can execute each of the partialθ -joins in parallel. The output is the union of all partial join results.

When the query output has only two columns—those on whichwe perform the θ -join—MonetDB and SABER exhibit similar per-formance: with 15 threads, MonetDB runs the query in 980 ms andSABER in 1,088 ms.2 When the query output includes all columns(select *), MonetDB is 2× slower than SABER because, as a colum-nar engine, it spends 40% of the time reconstructing the output tableafter the join evaluation. However, when we change the query to anequi-join with the same selectivity (rather than a θ -join), MonetDBis 2.7× faster because it is highly-optimised for such queries.

6.3 What is the CPU/GPGPU trade-off?As we increase the complexity of some query operators, the CPU

processing throughput degrades. Keeping the number of workerthreads fixed to 15, Fig. 10 shows this effect for SELECTn andJOINr when varying the number of predicates (the other operatorsyield similar results). For SELECTn, we further observe that, forfew predicates, the throughput is bound by the rate at which thedispatcher can generate tasks.2We report the average of 5 runs. In all SABER runs, the GPGPUconsumes approximately 1/3 of the dispatched tasks.

Page 11: SABER: Window-Based Hybrid Stream Processing for

0

2

4

6

8

10

64 256 1024 4096 0

0.1

0.2

0.3

0.4T

hro

ughput (G

B/s

)

Late

ncy (

sec)

Query Task Size (KB)

SaberSaber (CPU only)

Saber (GPGPU only)Saber latency

(a) SELECT10 with ω32KB,32KB

0

2

4

6

8

10

64 256 1024 4096 0

0.1

0.2

0.3

0.4

Thro

ughput (G

B/s

)

Late

ncy (

sec)

Query Task Size (KB)

SaberSaber (CPU only)

Saber (GPGPU only)Saber latency

(b) AGGavgGROUP-BY64 with ω32KB,32KB

0

0.1

0.2

0.3

0.4

64 256 1024 4096 0

1

2

3

4

Thro

ughput (G

B/s

)

Late

ncy (

sec)

Query Task Size (KB)

SaberSaber (CPU only)

Saber (GPGPU only)Saber latency

(c) JOIN4 with ω32KB,32KB

Figure 12: Performance impact of query task size φ for different query types

0

2

4

6

8

10

1 8 64 512 4096 0T

hro

ughput (G

B/s

)

null

Query Task Size (KB)

Saber (CPU only)Saber (GPGPU only)

PCIe Bus

(a) SELECT1 with ω32B,32B

0

2

4

6

8

10

64 256 1024 4096 0T

hro

ughput (G

B/s

)

null

Query Task Size (KB)

Saber (CPU only)Saber (GPGPU only)

(b) SELECT1 with ω32KB,32B

0

2

4

6

8

10

64 256 1024 4096 0T

hro

ughput (G

B/s

)

null

Query Task Size (KB)

Saber (CPU only)Saber (GPGPU only)

(c) SELECT1 with ω32KB,32KB

Figure 13: Performance impact of query task size φ for different window sizes and slides

0

2

4

6

8

10

0.0625 0.5 2 8 32 0

0.1

0.2

Thro

ughput (G

B/s

)

Late

ncy (

sec)

Window Slide Size (KB)

SaberSaber (CPU only)

Saber (GPGPU only)Saber latency

(a) SELECT10 with ω32KB,x

0

2

4

6

8

10

0.0625 0.5 2 8 32 0

0.1

0.2

0.3

0.4

Thro

ughput (G

B/s

)

Late

ncy (

sec)

Window Slide Size (KB)

SaberSaber (CPU only)

Saber (GPGPU only)Saber latency

(b) AGGavg with ω32KB,x

Figure 11: Performance impact of window slide

We compare the CPU throughput with that obtained from usingjust the GPGPU. Fig. 10 shows that there are regions where theGPGPU is faster than the CPU, and vice versa. The GPGPU isbound by the data movement overhead, and any performance benefitdiminishes when the query is not complex enough. Except forthe case in which the dispatcher is the bottleneck because of asimple query structure (SELECTn, 1≤ n≤ 4), hybrid execution isbeneficial for both queries, achieving nearly additive throughput.

Fig. 11 shows the impact of the window slide on the CPU/GPGPUtrade-off for two sliding window queries, SELECT10 and AGGavg,under a fixed task size of 1 MB. The window size is 32 KB, whilethe slide varies from 32 bytes (1 tuple) to 32 KB. As expected, theslide does not influence the throughput or latency on any processorfor SELECT10 because the selection operator does not maintain anystate. For AGGavg, the CPU uses incremental computation, whichincreases the throughput until it is bound by the dispatcher. On theGPGPU, an increase of the window slide reduces the number ofwindows processed, which increases throughput. At some point,however, the GPGPU throughput is limited by the PCIe bandwidth.

6.4 How does task size affect performance?Next, we investigate the impact of the query task size on per-

formance. Intuitively, larger tasks improve throughput but nega-tively affect latency. We first consider three queries, SELECT10,AGGavgGROUP-BY64, and JOIN4, with ω32KB,32KB. In each ex-periment, we measure the throughput and latency when varyingquery task sizes from 64 KB to 4 MB.

The results in Fig. 12 confirm our expectation. While the absolutethroughput values vary across the different queries, they exhibit a

similar trend: initially, the throughput grows linearly with the querytask size and then plateaus at around 1 MB. The only exceptionis the GPGPU-only implementation of the JOIN query for whichthroughput collapses with query task sizes beyond 512 KB. This isdue to a limitation of our current implementation: the computationof the window boundaries is always executed on the CPU.

To validate our claim that the batch size is independent from thewindow size and slide, we use a single type of query, SELECT1,and vary the window size and slide: from a window size of just1 tuple (ω32B,32B), to a slide of just 1 tuple (ω32KB,32B), and to alarge tumbling window (ω32KB,32KB).

Fig. 13 shows that SABER achieves a comparable performancein all three cases, with the throughput growing approximately until1 MB. This demonstrates that the batch size is independent from theexecuted query but only depends on the hardware.

6.5 Does Saber scale on the CPU?We now turn our attention to the scalability of SABER’s CPU oper-

ator implementations. We measure the throughput achieved by eachoperator in isolation when varying the number of worker threads.Fig. 14 shows the results for a PROJ6 query with ω32KB,32KB win-dows. We omit the results for the other operators because theyexhibit similar trends. The results show that the CPU operator im-plementation scales linearly up to 16 worker threads. Beyond thisnumber, the performance plateaus (or slightly deteriorates for someoperators) due to the context-switching overhead when the numberof threads exceeds the number of available physical CPU cores.

6.6 What is the effect of HLS scheduling?We finish by investigating the benefit of hybrid lookahead schedul-

ing (HLS). For this, we consider a workload W1 with two queries,Q1=PROJ6

* (i.e. PROJ6 with 100 arithmetic expressions for eachattribute) with ω32KB,32KB, and Q2=AGGcntGROUP-BY1 withω32KB,16KB, executed in sequence. These queries exhibit oppo-site performance: when run in isolation, Q1 has higher throughputon the GPGPU (1,475 MB/s vs. 292 MB/s), while Q2 has higherthroughput on the CPU (2,362 MB/s vs. 372 MB/s).

We compare the performance achieved by HLS against two base-lines: a “first-come, first-served” (FCFS) scheduling policy and astatic scheduling policy (Static), in which Q1’s tasks are scheduled

Page 12: SABER: Window-Based Hybrid Stream Processing for

0

1

2

3

4

5

1 2 4 8 16 32Thro

ughput (G

B/s

)

Number of Worker Threads

Figure 14: Scalability of CPUoperator implementation

0

1

2

3

4

5

W1 W2

Thro

ughput (G

B/s

)

Workloads

FCFSStaticHLS

Figure 15: Performance impact ofHLS scheduling

0

0.1

0.2

Se

lectivity

0

2

4

6

8

0 5 10 15 20 25 30

Th

rou

gh

pu

t (G

B/s

)

Time (s)

Saber (GPGPU contrib.)Saber

Figure 16: Adaptation of HLS scheduling

on the GPGPU and Q2’s task are scheduled on the CPU. Note thatthis policy is infeasible in practice with dynamic workloads.

Fig. 15 shows that, as expected, FCFS has the worst performanceas tasks are not matched with the preferred processor. More inter-esting is the comparison between Static and HLS: although Staticsignificantly improves over FCFS, HLS achieves higher throughputbecause it exploits all available resources.

Next, we consider a scenario in which Static would under-perform.Workload W2 consists of two queries, Q3=PROJ1 with ω32KB,32KBand Q4=AGGsum with ω32KB,32KB, executed in sequence. Staticallyassigning Q3 to the CPU and Q4 to the GPGPU yields full utilisa-tion of the GPGPU, but CPU cores are under-utilised. Reversing theassignment has the opposite effect. In Fig. 15, we show the latterassignment due to its higher throughput. Splitting the workload withFCFS, both CPU and GPGPU are fully utilised with a CPU/GPGPUsplit of approximately 1:1.5 for both Q3 and Q4. HLS saturates bothprocessors too, but with peak throughput by converging to a betterCPU/GPGPU split of 1:2.5 for Q3 and 1:0.5 for Q4. The results forW1 and W2 thus show the ability of HLS to support heterogeneousprocessors without a priori knowledge of the workload.

Fig. 16 shows how HLS adapts to workload changes. We run aSELECT500 query over the cluster management trace, filtering taskfailure events. The selection predicate has the form p1∧ (p2∨·· ·∨p500) such that when a failure event is selected (p1) all other pred-icates are evaluated too, making query tasks with high selectivityexpensive to run. The trace contains a period with a surge of taskfailure events, which we repeat. The resulting selectivity is shownin the upper part of Fig. 16. To enable HLS to react to frequentchanges, we update the query task throughput matrix every 100 ms.

Fig. 16 shows how HLS adapts: when selectivity is low, theCPU is faster than the GPGPU and monopolises the task queue—the GPGPU contribution is limited to tasks permitted to run byHLS’s switch threshold rule. When the selectivity increases, theGPGPU is the faster processor. As the query task throughput forboth processors changes, HLS adapts by scheduling more tasks onthe GPGPU to sustain a high throughput.

7. RELATED WORKCentralised stream processing engines have existed for decades,yet engines such as STREAM [6], TelegraphCQ [20], and Nia-

garaCQ [21] focus on single-core execution. Recent systems such asEsper [2], Oracle CEP [3], and Microsoft StreamInsight [39] supportmulti-core architectures at the expense of weakening stream order-ing guarantees in the presence of windows. Research prototypessuch as S-Store [18] and Trill [19] have strong window semanticswith SQL-like queries. However, S-Store does not perform parallelwindow computation. Trill parallelises window processing througha map/reduce model, but it does not support hybrid query execution.Distributed stream processing systems such as Storm [52], Sam-za [1] and SEEP [17] execute streaming queries with data-parallelismon a cluster of nodes. They do not respect window semantics by de-fault. Millwheel [4] provides strong window semantics, but it doesnot perform parallel computation on windows and instead assumespartitioned input streams. Spark Streaming [56] has a batched-stream model, and permits window definitions over this model, thuscreating dependencies between window semantics, throughput andlatency. Recent work on adapting the batch size for Spark Stream-ing [25] automatically tunes the batch size to stay within a givenlatency bound. Unlike Streaming Spark, SABER decouples windowsemantics from system performance.Window computation. Pane-based approaches for window pro-cessing [41] divide overlapping windows into panes. Aggregationoccurs at the pane level, and multiple panes are combined to providethe final window result. In contrast to our hybrid stream processingmodel, the goal is to partition windows so as to avoid redundant com-putation, not to achieve data parallelism. Balkesen and Tatbul [11]propose pane processing for distribution, which incurs a different setof challenges compared to SABER due to its shared-nothing model.

Recent work [12, 50] has focused on the problem of avoidingredundant computation with sliding windows aggregation. Thisinvolves keeping efficient aggregation data structures, which can bemaintained with minimum overhead [12], even for general aggrega-tion functions that are neither invertible nor commutative [50]. Suchapproaches are compatible with the batch operator functions thatSABER uses to process multiple overlapping window fragments.Accelerated query processing. In-memory databases have ex-plored co-processing with CPUs and GPGPUs for acceleratingdatabase queries for both in-cache and discrete systems. All suchsystems [15, 16, 22, 30, 48, 55], however, target one-off and notstreaming queries, and they therefore do not require parallel windowsemantics or efficient fine-grained data movement to the accelerator.

GStream [57] executes streaming applications on GPGPU clus-ters, and provides a new API that abstracts away communicationprimitives. Unlike GStream, SABER supports SQL window seman-tics and deployments on single-node hybrid architectures.

8. CONCLUSIONSWe have presented SABER, a stream processing system that em-

ploys a new hybrid data parallel processing model to run continuousqueries on servers equipped with heterogeneous processors, namelya CPU and a GPGPU. SABER does not make explicit decisions onwhich processor a query should run but rather gives preference tothe fastest one, yet enabling other processors to contribute to theaggregate query throughput. We have shown that SABER achieveshigh processing throughput (over 6 GB/s) and sub-second latencyfor a wide range of streaming queries.Acknowledgements. Research partially funded by the EuropeanUnion Seventh Framework Programme (FP7/2007-2013) undergrant agreement number 318521, the German Research Founda-tion (DFG) under grant agreement number WE 4891/1-1, and a PhDCASE Award by EPSRC/BAE Systems.

Page 13: SABER: Window-Based Hybrid Stream Processing for

9. REFERENCES

[1] Apache Samza. http://samza.apache.org/. Last access:10/02/16.

[2] Esper. http://www.espertech.com/esper/. Last access:10/02/16.

[3] Oracle® Stream Explorer. http://bit.ly/1L6tKz3. Lastaccess: 10/02/16.

[4] T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak,J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, andS. Whittle. MillWheel: Fault-tolerant Stream Processing atInternet Scale. Proc. VLDB Endow., 6(11):1033–1044, Aug.2013.

[5] M. H. Ali, C. Gerea, B. S. Raman, B. Sezgin, T. Tarnavski,T. Verona, P. Wang, P. Zabback, A. Ananthanarayan,A. Kirilov, M. Lu, A. Raizman, R. Krishnan, R. Schindlauer,T. Grabs, S. Bjeletich, B. Chandramouli, J. Goldstein, S. Bhat,Y. Li, V. Di Nicola, X. Wang, D. Maier, S. Grell, O. Nano, andI. Santos. Microsoft CEP Server and Online BehavioralTargeting. Proc. VLDB Endow., 2(2):1558–1561, Aug. 2009.

[6] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani,I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, andJ. Widom. STREAM: The Stanford Stream Data Manager.IEEE Data Eng. Bull., 26(1):19–26, Mar. 2003.

[7] A. Arasu, S. Babu, and J. Widom. The CQL ContinuousQuery Language: Semantic Foundations and Query Execution.The VLDB Journal, 15(2):121–142, June 2006.

[8] A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey,E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: AStream Data Management Benchmark. In Proceedings of the30th International Conference on Very Large Data Bases,VLDB ’04, pages 480–491. VLDB Endowment, 2004.

[9] A. Artikis, M. Weidlich, F. Schnitzler, I. Boutsis, T. Liebig,N. Piatkowski, C. Bockermann, K. Morik, V. Kalogeraki,J. Marecek, A. Gal, S. Mannor, D. Gunopulos, and D. Kinane.Heterogeneous Stream Processing and Crowdsourcing forUrban Traffic Management. In Proceedings of the 17thInternational Conference on Extending Database Technology,EDBT ’14, pages 712–723. OpenProceedings.org, 2014.

[10] aws.amazon.com. Announcing Cluster GPU Instances forAmazon EC2. http://amzn.to/1RrDflL, Nov. 2010. Lastaccess: 10/02/16.

[11] C. Balkesen and N. Tatbul. Scalable Data PartitioningTechniques for Parallel Sliding Window Processing over DataStreams. In Proceedings of the 8th International Workshop onData Management for Sensor Networks, DMSN ’11, 2011.

[12] P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues.Slider: Incremental Sliding Window Analytics. InProceedings of the 15th International Middleware Conference,Middleware ’14, pages 61–72. ACM, 2014.

[13] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov,O. Verscheure, H. Koutsopoulos, and C. Moran. IBMInfosphere Streams for Scalable, Real-time, IntelligentTransportation Services. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of Data,SIGMOD ’10, pages 1093–1104. ACM, 2010.

[14] G. E. Blelloch. Vector Models for Data-parallel Computing.MIT Press, 1990.

[15] S. Breß. The Design and Implementation of CoGaDB: AColumn-oriented GPU-accelerated DBMS.Datenbank-Spektrum, 14(3):199–209, 2014.

[16] S. Breß and G. Saake. Why It is Time for a HyPE: A Hybrid

Query Processing Engine for Efficient GPU Coprocessing inDBMS. Proc. VLDB Endow., 6(12):1398–1403, Aug. 2013.

[17] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, andP. Pietzuch. Integrating Scale out and Fault Tolerance inStream Processing Using Operator State Management. InProceedings of the 2013 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’13, pages725–736. ACM, 2013.

[18] U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier,J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul,K. Tufte, H. Wang, and S. Zdonik. S-Store: A StreamingNewSQL System for Big Velocity Applications. Proc. VLDBEndow., 7(13):1633–1636, Aug. 2014.

[19] B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine,D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill:A High-performance Incremental Query Processor for DiverseAnalytics. Proc. VLDB Endow., 8(4):401–412, Dec. 2014.

[20] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin,J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden,F. Reiss, and M. A. Shah. TelegraphCQ: Continuous DataflowProcessing. In Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data, SIGMOD’03, pages 668–668. ACM, 2003.

[21] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: AScalable Continuous Query System for Internet Databases.SIGMOD Rec., 29(2):379–390, May 2000.

[22] L. Chen, X. Huo, and G. Agrawal. Accelerating MapReduceon a Coupled CPU-GPU Architecture. In Proceedings of theInternational Conference on High Performance Computing,Networking, Storage and Analysis, SC ’12, pages 25:1–25:11.IEEE Computer Society Press, 2012.

[23] X. Chen, C.-D. Lu, and K. Pattabiraman. Failure Analysis ofJobs in Compute Clouds: A Google Cluster Case Study. InProceedings of the 25th International Symposium on SoftwareReliability Engineering, ISSRE ’14, pages 167–177. IEEEComputer Society Press, 2014.

[24] S. Crago, K. Dunn, P. Eads, L. Hochstein, D.-I. Kang,M. Kang, D. Modium, K. Singh, J. Suh, and J. Walters.Heterogeneous Cloud Computing. In Proceedings of the 2011IEEE International Conference on Cluster Computing,CLUSTER ’11, pages 378–385. IEEE Press, 2011.

[25] T. Das, Y. Zhong, I. Stoica, and S. Shenker. Adaptive StreamProcessing Using Dynamic Batch Sizing. In Proceedings ofthe ACM Symposium on Cloud Computing, SOCC ’14, pages16:1–16:13. ACM, 2014.

[26] feedzai.com. Modern Payment Fraud Prevention at Big DataScale. http://bit.ly/1KCskD5, 2013. Last access: 10/02/16.

[27] B. Gedik, R. R. Bordawekar, and P. S. Yu. CellJoin: A ParallelStream Join Operator for the Cell Processor. The VLDBJournal, 18(2):501–519, Apr. 2009.

[28] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.GPUTeraSort: High Performance Graphics Co-processorSorting for Large Database Management. In Proceedings ofthe 2006 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’06, pages 325–336. ACM,2006.

[29] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo,and P. V. Sander. Relational Query Coprocessing on GraphicsProcessors. ACM Trans. Database Syst., 34(4):21:1–21:39,Dec. 2009.

[30] J. He, M. Lu, and B. He. Revisiting Co-processing for Hash

Page 14: SABER: Window-Based Hybrid Stream Processing for

Joins on the Coupled CPU-GPU Architecture. Proc. VLDBEndow., 6(10):889–900, Aug. 2013.

[31] J. He, S. Zhang, and B. He. In-cache Query Co-processing onCoupled CPU-GPU Architectures. Proc. VLDB Endow.,8(4):329–340, Dec. 2014.

[32] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl.Hardware-oblivious Parallelism for In-memoryColumn-stores. Proc. VLDB Endow., 6(9):709–720, July2013.

[33] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender,and M. L. Kersten. MonetDB: Two Decades of Research inColumn-oriented Database Architectures. IEEE DataEngineering Bulletin, 35(1):40–45, 2012.

[34] Z. Jerzak and H. Ziekow. The DEBS 2014 Grand Challenge.In Proceedings of the 8th ACM International Conference onDistributed Event-Based Systems, DEBS ’14, pages 266–269.ACM, 2014.

[35] J. Kang, J. Naughton, and S. Viglas. Evaluating window joinsover unbounded streams. In Proceedings of the 19th IEEEInternational Conference on Data Engineering, ICDE ’03,pages 341–352. IEEE Press, 2003.

[36] T. Karnagel, D. Habich, B. Schlegel, and W. Lehner. TheHELLS-join: A Heterogeneous Stream Join for ExtremelyLarge Windows. In Proceedings of the 9th InternationalWorkshop on Data Management on New Hardware, DaMoN’13, pages 2:1–2:7. ACM, 2013.

[37] T. Karnagel, M. Hille, M. Ludwig, D. Habich, W. Lehner,M. Heimel, and V. Markl. Demonstrating Efficient QueryProcessing in Heterogeneous Environments. In Proceedings ofthe 2014 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’14, pages 693–696. ACM,2014.

[38] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. AnAnalysis of Traces from a Production MapReduce Cluster. InProceedings of the 10th IEEE/ACM International Conferenceon Cluster, Cloud and Grid Computing, CCGrid ’10, pages94–103. IEEE Computer Society, 2010.

[39] S. J. Kazemitabar, U. Demiryurek, M. Ali, A. Akdogan, andC. Shahabi. Geospatial Stream Query Processing UsingMicrosoft SQL Server StreamInsight. Proc. VLDB Endow.,3(1-2):1537–1540, Sept. 2010.

[40] Khronos OpenCL Working Group. The OpenCL Specification,version 1.2, 2012.

[41] J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker. NoPane, No Gain: Efficient Evaluation of Sliding-windowAggregates over Data Streams. SIGMOD Rec., 34(1):39–44,Mar. 2005.

[42] W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig. StreamingAlgorithms for Biological Sequence Alignment on GPUs.IEEE Trans. Parallel Distrib. Syst., 18(9):1270–1281, Sept.2007.

[43] D. Lustig and M. Martonosi. Reducing GPU Offload Latencyvia Fine-grained CPU-GPU Synchronization. In Proceedingsof the 19th IEEE International Symposium on HighPerformance Computer Architecture, HPCA ’13, pages354–365. IEEE Computer Society, 2013.

[44] M. L. Massie, B. N. Chun, and D. E. Culler. The gangliadistributed monitoring system: design, implementation, andexperience. Parallel Computing, 30(7):817 – 840, July 2004.

[45] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:Distributed Stream Computing Platform. In Proceedings ofthe 2010 IEEE International Conference on Data Mining

Workshops, ICDMW ’10, pages 170–177. IEEE ComputerSociety, 2010.

[46] Novasparks™. NovaSparks Releases FPGA-based FeedHandler for OPRA. http://bit.ly/1kGbQ0D, Aug. 2014. Lastaccess: 10/02/16.

[47] NVIDIA®. Quadro K5200 Data Sheet.http://bit.ly/1O3D0Yx. Last access: 10/02/16.

[48] H. Pirk, S. Manegold, and M. Kersten. Waste not... Efficientco-processing of relational data. In Proceedings of the 30thIEEE International Conference on Data Engineering, ICDE’14, pages 508–519. IEEE Press, 2014.

[49] Z. Shao. Real-time Analytics at Facebook. Presented at theFifth Conference on Extremely Large Databases, XLDB5,Oct. 2011. http://stanford.io/1HqxPmw. Last access:10/02/16.

[50] K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu.General Incremental Sliding-window Aggregation. Proc.VLDB Endow., 8(7):702–713, Feb. 2015.

[51] J. Teubner and R. Mueller. How Soccer Players Would DoStream Joins. In Proceedings of the 2011 ACM SIGMODInternational Conference on Management of Data, SIGMOD’11, pages 625–636. ACM, 2011.

[52] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel,S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham,N. Bhagat, S. Mittal, and D. Ryaboy. Storm@Twitter. InProceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’14, pages147–156. ACM, 2014.

[53] J. Wilkes. More Google Cluster Data. Google Research Blog,http://bit.ly/1A38mfR, Nov. 2011. Last access: 10/02/16.

[54] wired.com. Google Erects Fake Brain With... Graphics Chips?http://wrd.cm/1CyGIYQ, May 2013. Last access: 10/02/16.

[55] Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang ofProcessing Data Warehousing Queries on GPU Devices. Proc.VLDB Endow., 6(10):817–828, Aug. 2013.

[56] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica.Discretized Streams: Fault-tolerant Streaming Computation atScale. In Proceedings of the 24th ACM Symposium onOperating Systems Principles, SOSP ’13, pages 423–438.ACM, 2013.

[57] Y. Zhang and F. Mueller. GStream: A General-Purpose DataStreaming Framework on GPU Clusters. In Proceedings of theInternational Conference on Parallel Processing, ICPP ’11,pages 245–254. IEEE Press, 2011.

APPENDIXA. BENCHMARK QUERIES

A.1 Cluster monitoring-- Query 1---- Input: TaskEvents-- long timestamp-- long jobId-- long taskId-- long machineId-- int eventType-- int userId-- int category-- int priority-- float cpu-- float ram-- float disk-- int constraints

Page 15: SABER: Window-Based Hybrid Stream Processing for

-- Output: CPUusagePerCategory-- long timestamp-- int category-- float totalCpu--select timestamp , category , sum(cpu) as totalCpufrom TaskEvents [range 60 slide 1]group by category---- Query 2---- Input: TaskEvents-- Output: CPUusagePerJob-- long timestamp-- long jobId-- float avgCpu--select timestamp , jobId , avg(cpu) as avgCpufrom TaskEvents [range 60 slide 1]where eventType == 1group by jobId

A.2 Smart grid-- Query 1---- Input: SmartGridStr-- long timestamp-- float value-- int property-- int plug-- int household-- int house-- [int padding]-- Output: GlobalLoadStr-- long timestamp-- float globalAvgLoad--select timestamp , AVG (value) as globalAvgLoadfrom SmartGridStr [range 3600 slide 1]---- Query 2---- Input: SmartGridStr-- Output: LocalLoadStr-- long timestamp-- int plug ,-- int household-- int house-- float localAvgLoad--select timestamp , plug , household , house ,

AVG (value) as localAvgLoadfrom SmartGridStr [range 3600 slide 1]group by plug , household , house---- Query 3---- Input: GlobalLoadStr , LocalLoadStr-- Output: Outliers-- long timestamp-- int house-- float count--(select L.timestamp , L.plug , L.household ,

L.housefrom LocalLoadStr [range 1 slide 1] as L,

GlobalLoadStr [range 1 slide 1] as Gwhere L.house == G.house and

L.localAvgLoad > G.globalAvgLoad) as R--select timestamp , house , count (*)from Rgroup by house

A.3 Linear Road Benchmark-- Query 1--

-- Input: PosSpeedStr-- long timestamp-- int vehicle-- float speed-- int highway-- int lane-- int direction-- int position-- Output: SegSpeedStr-- long timestamp-- int vehicle-- float speed-- int highway-- int lane-- int direction-- int segment--select timestamp , vehicle , speed ,

highway , lane , direction ,(position /5280) as segment

from SegSpeedStr [range unbounded]---- Query 2---- Input: SegSpeedStr-- Output: VehicleSegEntryStr-- long timestamp-- int vehicle-- float speed-- int highway-- int lane-- int direction-- int segment--select distinct

L.timestamp , L.vehicle , L.speed , L.highway , L.lane ,L.direction , L.segment

from SegSpeedStr [range 30 slide 1] as A,SegSpeedStr [partition by vehicle rows 1] as L

where A.vehicle == L.vehicle---- Query 3---- Input: SegSpeedStr-- Output: CongestedSegRel-- long timestamp-- int highway-- int direction-- int segment-- float avgSpeed--select timestamp , highway , direction , segment ,

AVG(speed) as avgSpeedfrom SegSpeedStr [range 300 slide 1]group by highway , direction , segmenthaving avgSpeed < 40.0---- Query 4---- Input: SegSpeedStr-- Output: SegVolRel-- long timestamp-- int highway-- int direction-- int segment-- float numVehicles--(select timestamp , vehicle , highway , direction , segment ,

count (*)from SegSpeedStr [range 30 slide 1]group by highway , direction , segment , vehicle) as R--select timestamp , highway , direction , segment ,

count(vehicle) as numVehiclesfrom Rgroup by highway , direction , segment