enabling task parallelism in the cuda...

Enabling Task Parallelism in the CUDA Scheduler

Marisabel Guevara Chris Gregg Kim Hazelwood Kevin SkadronDepartment of Computer Science, University of Virginia<marisabel,chg5w,hazelwood,skadron>@virginia.edu

Abstract

General purpose computing on graphics processingunits (GPUs) introduces the challenge of scheduling in-dependent tasks on devices designed for data parallel orSPMD applications. This paper proposes an issue queuethat merges workloads that would underutilize GPU pro-cessing resources such that they can be run concurrently onan NVIDIA GPU. Using kernels from microbenchmarks andtwo applications we show that throughput is increased in allcases where the GPU would have been underused by a sin-gle kernel. An exception is the case of memory-bound ker-nels, seen in a Nearest Neighbor application, for which theexecution time still outperforms the same kernels executedserially by 12-20%. It can also be beneficial to choose amerged kernel that over-extends the GPU resources, as weshow the worst case to be bounded by executing the kernelsserially. This paper also provides an analysis of the latencypenalty that can occur when two kernels with varying com-pletion times are merged.

1. IntroductionSpecializedacceleratorhardware shows great potential

for high performance computing, sacrificing generality toachieve greater throughput and energy efficiency. As a sepa-rate coprocessor, the programming model generally consistsof identifying tasks orkernelsto be offloaded. This modelis present, for example, in the main languages for general-purpose computing on graphics processors–CUDA [13],OpenCL [12], and Brook [4]. GPUs are of particular inter-est because their high parallelism and memory bandwidthoffer very high performance across a range of applications[5] as well as considerably improved energy efficiency [3]in a low-cost commodity platform that is PC-compatible.Both Intel and AMD have announced products that will in-tegrate the GPU onto the same chip as the CPU. At the sametime, the shift to multicore CPUs already poses a require-ment for parallel programming, so this aspect of the GPUdoes not constitute a major additional burden.

Many kernels do not present enough work to fully uti-lize the coprocessor, and unfortunately few systems al-

low tasks from different processes to run concurrently ona single coprocessor. In fact, CUDA and OpenCL donot even allow independent kernels from the same processto run concurrently.1 However, GPUs employ such deepmultithreading—the highest end GPU today, the GTX 280,currently supports 30,720 concurrent thread contexts—thatunderutilization is common. Moore’s Law is likely to leadto even greater thread capacities, making it harder to fill theGPU with enough threads and hence even further reducingutilization. We conjecture that allowing independent ker-nels (from the same or different processes) to share the GPUwould boost throughput and energy efficiency substantially.

In order to support concurrent execution, a system-levelscheduler must be able to examine pending kernels, decidewhether co-scheduling is desirable, and initiate the desiredconcurrent execution. Co-scheduling decisions should en-sure that resources are not over-subscribed, including threadcontexts, memory capacity, memory bandwidth, and in thecase of GPUs, possibly other resources such as registers,constant memory, and texture memory2, and that the ex-pected run-times of the co-scheduled kernels are as equalas possible. The current method for issuing work to a co-processor pushes the programmer to define kernels if andonly if substantial work is to be done in order to make-upfor the cost of the offload. Instead, an issue queue that co-schedules independent kernels would allow the programmerto define more modular kernels – possibly a larger numberof these that would be considered insubstantial for offloadunder the current programming model – with the certaintythat efficiency and concurrency will be achieved behind thescenes.

This paper presents a runtime environment that inter-cepts the stream of kernel invocations to an accelerator,makes scheduling decisions, and merges kernels as neces-sary. Our prototype solution targets CUDA and NVIDIAGPUs, and currently supports merging a maximum of two

1This restriction appears to be limited to general-purpose applications,since OpenGL and Direct3D applications do use multiple concurrent ker-nels to set up the graphics pipeline.

2Although constant and texture memory spaces are abstractions uniqueto the GPU, they have proven to be valuable in accelerating GPGPU appli-cations [5])

Figure 1. NVIDIA GPU Simultaneous Multipro-cessor Architecture.

kernels. It has primarily been tested with microbenchmarksto show feasibility, but we also present results with a full-size, data-intensive program, nearest neighbor (NN) search,and another compute-intensive program, Gaussian Elimina-tion. Both the microbenchmark and NN results show thatissuing merged kernels via our proposedcusub success-fully boosts throughput from the existing First Come FirstServe approach to launching kernels. When both kernelsfit simultaneously on the GPU, execution time is reducedto the larger of the two instead of the sum; in the case thatthe merged kernel requests more threads than the GPU canexecute, the resulting performance is at least as good as theserial execution.

This paper is organized as follows: Section 2 describesgeneral purpose graphics processing unit (GPGPU) com-puting and introduces the architecture and programmingmodel of NVIDIA GPUs. We place our work in context inSection 3. Section 4 describes how we achieve task paral-lelism using CUDA on NVIDIA GPUs, describes the mergedecision process, and outlines the method used for runtimemodification. Section 5 presents analysis of results usingmicrobenchmarks and two applications. Finally, Section 6concludes the paper and provides ideas for future work.

2. BackgroundLindholm et al. [10] provide an overview of the ar-

chitecture of NVIDIA GPUs, and Nickolls et al. [13] de-scribe specifically how this is exposed for general purposeprogramming through CUDA. Figure 1 is an outline ofthe Simultaneous Multiprocessors (SMs) that make up anNVIDIA GPU; one SM can execute 24 warps of 32 threadson each of its 8 cores, making that a total of 768 activethreads per SM [14]. The number of SMs on a GPU dif-fers across generations, increasing from 14 in the GT9800series to 30 in the GTX280 [14]. This leads to more than6,000 active parallel threads executing the same instructionson independent data.

The CUDA API does not provide a mechanism to inter-rupt a kernel that has already begun execution. This lim-its traditional resource scheduling models, such as time-slicing, from being part of this experimentation becausethe GPU resources are not accessible in the same way asCPUs. In general, the programming model for GPGPUsis a master-slave model, where the CPU thread is the mas-

ter that launches slave threads on the GPU [14]. The GPUdriver holds ready kernels in an issue queue until these areprocessed in a first come, first serve fashion.

A CPU thread that launches a kernel continues executionup to acudaThreadSynchronize() barrier, at whichpoint the process blocks until the kernel is finished3. In thiscontext, the throughput of the GPU becomes important ifmany CPU processes are attempting to launch kernels. If akernel is already running on the GPU and another kernel islaunched–either by the same process or another process–thesecond kernel is blocked until the first kernel finishes. Thelatency of a kernel call (how long it takes to receive resultsfrom a kernel) is also affected by how many kernels are inthe queue; the longer the queue, the greater the latency forthe CPU thread waiting for that kernel’s results. For ourdevelopment and testing we assume an environment wherethe GPU is a heavily contended resource, guaranteeing apopulated issue queue and hence an opportunity to combinekernels before they are executed.

3. Related WorkHeterogeneous architectures with specialized coproces-

sors such as GPUs show great promise for high performancecomputing, because the specialized nature of the coproces-sors allows them to dedicate more area toward computa-tional resources (at the expense of general-purpose flexibil-ity). In addition to GPUs, the Cell BE and FPGAs are alsofrequently mentioned as potential coprocessors.

In the context of GPU computing, a variety of applica-tions have shown as much as 100-200X speedups comparedto single-core CPUs. These are generally data-parallel ap-plications, but the GPU can also be effective at exploitingtask parallelism; for example, Boyer et al. [3] reported 98Xspeedup with the GPU over a single-core CPU implemen-tation and 26X speedup over a quad-core implementationfor leukocyte detection and tracking in video microscopysequences. Hwu et al. [16] have published work on op-timizing single applications for CUDA, Aji and Feng [2]have investigated increasing the performance of data-serialapplications on NVIDIA GPUs using CUDA, and similarlyDuran et al. [7] extend to the OpenMP programming modelin the search for task parallelism. In our work, however, theperformance improvements are a result of a novel approachto issuing tasks to an accelerator, and not due to identifyingdata or task parallelism within a workload.

Another stream of research has looked at how to in-terface coprocessors within the system architecture. TheMerge framework from Linderman et al. [9] achieves per-formance and energy efficiency across heterogeneous coresvia a dynamic library-based dispatch system. This approachis similar to our methodology as it is more closely related

3Furthermore, if a kernel is currently running on the GPU, both thecudaMalloc() andcudaMemcpy() functions block the CPU, as well.

to the work queue of an accelerator. Merge also proposes aprogramming model that abstracts the architecture and re-quires additional information from the programmer for dis-patch decisions, whereas our approach does not modify theexisting CUDA programming model. Harmony similarlyproposes a programming and execution model to improvescheduling decisions considering a heterogeneous comput-ing environment as an entity instead of individual cores [6].

What distinguishes our work from these prior researchefforts is our proposedcusub scheduler, which achieves afiner-grained increase in performance that would benefit alarger system-wide scheduler such as Merge or Harmony.Jacket from AccelerEyes employs a kernel execution modelthat does not guarantee immediate execution of a computekernel in order to provide a more optimal kernel scheduling[1], and could also benefit from our proposed method of is-suing more than one kernel at a time to broaden schedulingoptions. There has also been industry interest in leverag-ing GPGPU CUDA processing in cluster computing, suchas with the work done at Penguin Computing. Their solu-tion schedules work across many GPUs, each performingspecific tasks, which is orthogonal to our method [15]. Mc-Cool considers difficulties such as detecting and balancingtask parallelism within an application [11], however our ap-proach to task parallelism occurs at a coarser granularity inwhich independent kernels from the same or different pro-cesses are all sources for parallelism. Many problems re-main to be solved regarding scheduling decisions for hetero-geneous architectures, and our work provides a new aspectto parallel execution that increases flexibility for schedulingdecisions and can naturally be extended to other accelera-tors.

4. ApproachThere are three components to this research: the first

explores methods for coercing kernels to run concurrently(Section 4.1). The actual merging of the kernels is per-formed prior to runtime, but this is specific to our im-plementation (Section 4.2). Finally, applications withCUDA kernels are executed through a rudimentary sched-uler,cusub, that makes the decision to run either a mergedkernel or the kernels individually (Section 4.3).

4.1. Achieving task parallelism

Given two different kernels to be executed by the GPU,our approach to merging them into one kernel is to use theblock identifiers to assign work to the threads. All CUDAprograms define the execution configuration in two instruc-tions preceding the kernel launch, by setting the block di-mension and the grid dimension (variablesdimGrid anddimBlock in Algorithm 1 respectively). The block dimen-sion specifies the number of threads per block, and the griddimension is the total number of threads to be launched.The threads that make up a block execute instructions in

lock-step, hence by assigning the kernels to threads on dif-ferent SMs there is no concern of inflicting thread diver-gence by adding a second kernel to be executed. Figure 2shows the execution model achieved with block partition-ing; two or more kernels are executed by the sum of thenumber of blocks each kernel requests, and the threads ex-ecute in parallel because they are mapped to different SMs.Task parallelism is achieved, and the transformations at thesource code level are explained in more detail next.

The code for the merged kernel includes three differ-ences from the original versions of the kernels. First, themerged kernel combines the data pointers from the twooriginal kernels; to prevent name collisions the variableshave to be renamed, as detailed in subsection 4.2. Next,as in Algorithm 1, the merged kernel includes the code ofthe independent kernels by separating these using a singleif-else clause. Lastly, the indexes for all data accesses inthesecond kernel are updated when offsets are calculated basedon thread ID. This is a common method of writing kernelsin the CUDA programming model as it allows offsets tobe calculated by all threads executing one instruction andhence prevents control flow divergence. Hence, the logicfor new thread ID indexing needs to replace all of the orig-inal indexing found in the second kernel. As in Algorithm1, the logic for having the firsta blocks work on one kerneland the nextb blocks work on the other is a good choiceas it is straightforward and independent of the size of eachkernel.

Alternatives to block partitioning that would allow twokernels to be issued as one with the current CUDA sched-uler are thread interleaving and block interleaving. The firstuses the thread identifiers instead of the block indexes to in-terleave the kernels across the threads. The occupancy ofthe SMs would be increased and it would still be possibleto achieve performance improvement by reducing the over-heads associated with a kernel launch, yet in the strictestsense of efficiency, thread interleaving violates the SIMDlockstep execution of threads in the same warp by insertingcontrol flow divergence. The latter alternative interleavesthe kernels at the block level, avoiding intra-warp threaddivergence at runtime; but the indexing pattern for memoryaccesses in each kernel would have to be updated to accountfor the even versus odd block indexes, increasing the com-plexity of extending the technique to more than two kernelscompared to block partitioning.

4.2. Statically Merged Kernels

As shown in Section 4.1, the modifications to the ker-nel source code to support merging are simple and gen-eral. Thus, a single pass compiler can create the mergedkernel source code by renaming the variables, adding theif-else control flow, and adding an instruction to updateindexing in the kernel that is executed by blocks that areoffset fromblockID0. Renaming the variables following

Figure 2. Block Partitioning. Each kernel con-tains n blocks.

a naming scheme, such as appending variable names withkernelID, prevents naming collisions. The kernel dis-

tribution and the indexing update are similar additions thatcan be done at the source code level by a source-to-sourceoptimization prior to full compilation. For the experimentspresented in this paper, however, the microbenchmarks andthe application kernels were merged by hand, the evaluationis explored in Section 5.

The time it takes for the compiler to produce the mergedkernel is not vital to the goal of this research, as it can bedone statically for different combinations of merged ker-nels. Thus, the kernel issue queue can then look up theappropriate merged kernel from a library of merged kernelsand execute it as detailed in Section 4.3 in place of the twoindividual kernels. A more advanced implementation could,for example, exploit nvcc’s just-in-time compilation of ker-nels to merge the kernels at runtime.

4.3. Issue QueueAnalogous to common multiprocessor scheduling

queues, we have developed a simple issue queue for appli-cations that are contending for use of the GPU. Applicationsthat will launch CUDA kernels are themselves instantiatedvia cusub, the queue handling program.Cusub monitorsthe applications and “hijacks” the CUDA function callsthat request memory from the device, transfer data to andfrom the device, and launch the kernels. Figure 3 shows anexample of the static merge and the flow ofcusub. Notethat the kernels may belong to separate processes or thesame; in the case of merging the kernels from the sameprocess, an explicitcudaThreadSynchronize() be-Algorithm 1 Code excerpt for thread interleaving.

mergedkernel<<<dimGrid,dimBlock>>>(ptr1 1, ptr2 1, ...,ptr1 2, ptr2 2, ...,int dimGridKernel1 ) {

int index← blockID∗dimBlock + threadIDif blockID <= dimGridKernel1 then

kernel 1()else if index < dimGrid then

//Update the indexingindex← index− dimGridKernel1kernel 2()

end if

Figure 3. Static merge and cusub operation.

tween kernels will enforce remove these from considerationto be merged.

As described in Section 4.2,cusub needs informationabout the amount of memory requested by each kernel, andthe parameters for the number of blocks that will be neededon the GPU. If it is the case that two kernels should bemerged, a previously-merged kernel is substituted for thetwo separate kernels, andcusub handles the data transferfrom the merged kernel to the original applications. Thisalso implies that merging the kernels does not actually re-move a kernel from the issue queue, and in order to not de-mote any kernels, the new merged kernel replaces the ker-nel with the highest priority that it has merged. This leadsto kernel promotion but not demotion, and would have to beconsidered in an actual implementation of kernel mergingto prevent priority inversion.

Ideally, cusub-like functionality would be included inthe CUDA device driver, and the driver could provide real-time merging (as opposed to static compiler pre-merging).Ourcusub stands as a proof-of-concept demonstration ofthe decision and merged-kernel launch procedure. Knowl-edge of the GPU’s status would enhancecusub to simplylaunch the ready kernel at the head of the queue if the de-vice is free, yet there is currently no API function for deter-mining whether the GPU is busy. Hencecusub passes themerged kernel on to the driver and its own issue queue, onlyintercepting and possibly merging the kernels for applica-tions executed usingcusub. The technique, if housedin the GPU driver or ultimately the OS, extends to inter-cepting and considering all CUDA kernels launched.

4.4. The Decision to Merge

It is not always the case that multiple kernels shouldbe merged together; a single kernel could have enoughdata parallelism to fully utilize the GPU, in which caseit should be run alone. However, most applications, es-pecially those that cannot be termedembarrassingly par-

allel, will not be able to fully utilize the GPU. ThecudaGetDeviceProperties() function call can beused to retrieve the number of multiprocessors and theamount of global memory available on the GPU. The num-ber of blocks needed for the kernel can be calculated withthe following formula:

#blocks needed =

⌈

#threads requested

#threads/block

⌉

Once the number of blocks is known, that number can becompared to the sum of thedimGrid declarations in eachkernel; if the sum is greater than the number of blocks, thetwo (or more) kernels should not be merged.

Likewise, the amount of memory available can be com-pared to thecudaMemcpy() function parameters, and ifthe sum of the memory requests is larger than the amountof memory available, the kernels should not be combined.Note that by merging kernels, as detailed in Section 4.1,the superset of the global, texture, and constant memoryfootprint of the kernels needs to be considered. The sharedmemory structure that is private to each multiprocessor canbe used by the kernel independent of whether it is a mergedkernel or not. A decision flowchart showing the overallmethod is shown in Figure 3.

5. Performance AnalysisThe method for merging two independent kernels used in

the testing is block partitioning, as it achieves concurrencyand requires only a small number of modifications that acompiler can perform, as detailed in Section 4.1. The goalof this research is to increase the throughput of a GPU byexecuting more than one kernel at a time. Also of interestis the additional latency that a kernel would incur comparedto dedicated usage of the processing resource.

The experimental testbed is an NVIDIA GeForce 9800GT that has 14 SMs, 512 MB of global memory, 16 KB ofshared memory per block, and operates at 1.512 GHz. It ishosted on a system that runs the Ubuntu 8.10 distribution ofLinux on an Intel Pentium D CPU running at a clock fre-quency of 3.20 GHz. A second system was also used withan NVIDIA GeForce GTX 280 that has 30 SMs. The ob-served behavior of the merged kernels is qualitatively simi-lar on both cards and we include only data from the first forsimplicity. The test code is written in NVIDIA’s CUDA andexecuted using the CUDA 2.1 driver and toolkit.

5.1. BenchmarksWe primarily evaluate this prototype using microbench-

marks, which emphasize two opposite behaviors. First,the Compute kernel is a computationally expensive kerneladopted from a random number generator [8] that uses aseed and varying number of iterations based on the threadidentifiers to generaten different random numbers. Thesecond kernel, termed Memory kernel, stresses the GPU

interconnect bandwidth with memory requests from eachthread, characteristic of several kernels where the threadsneed source data. The kernels for both microbenchmarkshave similar execution times, but this need not be the case,as any two kernels can be merged; related considerations arefurther explored in Section 5.3. Merging the microbench-marks allows a fine-grained analysis of penalties associatedwith using more GPU resources concurrently.

The Nearest Neighbor (NN) application is part of abenchmark suite of unstructured data applications [18] thathas been ported to CUDA. The benchmark calculates theEuclidean distance from a target location to every recordin a data set and finds thek nearest neighbors. With 512threads running on each multiprocessor, each NN kerneluses a little over 3 of the available SMs. The Gaussian Elim-ination (GE) application solves a set of linear equations us-ing a parallel algorithm that has been ported to CUDA.

5.2. Throughput

A kernel launched on the GPU executes until it is com-pleted and the results are explicitly copied back to the hostCPU in the source code. By executing independent tasksin parallel our aim is to make greater use of the process-ing units and achieve lower execution times for the mergedkernels than the current issue mechanism. Ideally, if the de-vice has idle processing cores for each of the independentkernels, executing them concurrently should take as longas the most time consuming kernel. In our experiments wediscovered this is not necessarily the case.

Figure 4 shows the execution times of merged mi-crobenchmark kernels compared to executing them serially.In each case, one kernel is held at 3000 threads and the sec-ond kernel launches a varying number of threads; similarlyfor the sequential kernels, the first kernel uses 3000 threadsand the next kernel’s size is swept. The merged kernels in-deed execute in the same amount of time as the longest run-ning of the single kernels, and this is essentially half thetime on the GPU processing resources that it takes to exe-cute the kernels back-to-back as is done by the current is-sue queue. There are additional latencies that are amortizedby executing the kernels together instead of sequentially,such as the amount of time to actually launch the kernel atthe software and the hardware level. Figure 4 only com-pares the execution time on the GPU itself, and as long asthe number of blocks requested by the merged kernel is notgreater than the number of SMs available on the GPU, themerged kernel cuts down the execution of the two singlekernels by half.

Over-extending the processing cores on the GPU is alsoshown in Figure 5, where a jump in the execution time forthe merged kernel can be seen when the total number ofthreads launched exceeds the number of SMs available onthe 9800GT. When a kernel requestsm processing elementson a GPU withn SMs, wherem > n, the hardware exe-

!"

#!"

$!"

%!"

&!"

'!"

(!"

)!"

*!"

+!"

%" (" *" ##" #%" #(" #*" $#" $&"

!"#$%&'()*+,

#)-,./)

*'012)(%,3#4)'5)064#17.)-06'%.1(7./)

Single Compute

Sequential Compute

Single Memory

Sequential Memory

Merged Compute

Merged Memory

Merged Compute-Memory

Figure 4. Comparison of executing mergedmicrobenchmark kernels versus issuingthese serially. For single kernels that donot fill the SMs, the merged kernels performbetter than under sequential issue. At threadcounts large enough for the single kernels tofill the SMs, the merged kernels perform noworse than executing these serially.

cutes the kernel on the available SMs for the firstn blocksrequested, and the remainingm − n blocks of the kernelexecute once the SMs become available again. In the caseof a merged kernel that requests more processing elementsthan available on the GPU, the execution time for the kernelrequires two or more executions of the kernel. As in Figure5, over-extending the processing resources doubles the exe-cution time of the merged compute and memory kernel thanwhen it requests the same number of thread blocks or lessthan the available SMs. The increase in execution time isnot a clearly defined step, but rather a jagged and incremen-tal increase. We hypothesize that this behavior is due to thehardware mechanism for issuing thread blocks to multipro-cessors as soon as these are available; since thread blocksare not uniform in execution time, in particular when ac-cessing memory, the thread blocks that are generated in theinterval of the jump in execution time are not fully popu-lated and hence execute faster than two full iterations of thekernel.

The increase in execution time in Figure 5 would alsooccur when the kernels are executed serially, and the differ-ence in execution times would in fact be more than doubled.For example, if the compute kernel requesteda SMs and thememory kernel requestedb SMs wherea > n andb > n buta + b < 2n, and both kernels execute in one time unit, exe-cuting the kernels independently require four units of com-pute time whereas the merged version would complete inthree time units. Hence, a merged kernel that requests moreblocks than the GPU’s number of available SMs is bene-ficial as its worst case execution time will still outperform

17

19

21

23

25

27

29

31

33

35

3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0

Kernel Execution Time (ms)

Number of Threads for Memory Kernel (thousands)

4000 3500 3000

!"#$"%&'()*+"$#'",#-./0("1('2(3"

Figure 5. Execution times for the mergedCompute and Memory kernel, sweeping thenumber of threads of the memory kernel forthree sizes of the compute kernel . Thejagged increase in execution time reveals in-consistent execution times per thread blockas the thread count reaches full-SM utiliza-tion.

executing the single kernels independently, as can be seen inFigure 4. The kernels execute using 512 threads per block,thus one block executes per SM and up to 7,168 can exe-cute at a time on the 9800GT. In the sequential cases, thefirst kernel executes for 3000 threads followed by the sec-ond kernel executing anywhere between 64 to 7,168 threadsbefore launching a third wave if it requires more than 7,168threads. Executing the same merged kernels with a smallerblock-size of 256 threads per block rather than 512 threadsper block results in an equivalent jump also occurring at ablock usage of fifteen or greater. As expected, the behavioris qualitatively similar on the GTX280; when the number ofblocks being used on the device exceeds thirty, the execu-tion time jumps.

Figure 6 compares the execution time of running twokernels (NN and GE) independently versus the same ker-nels merged using block partitioning. The execution timesin the figures reflect the cumulative time spent executing thekernel code. Table 1 shows the average execution time of asingle NN kernel and that of two NN kernels merged as one.The merged kernel takes more time to execute than a singleNN kernel, but less time than the two kernels issued inde-pendently. For the different number of records used in thetests, the execution time of the merged kernel is from 12%to 20% less than that of the independent kernels. This re-sult is a success but not as close to the original expectationthat the kernels will execute in the same amount of time asa single kernel.

In NN, the threads load data from global memory onthe device prior to computing the Euclidean distances, and

!"

#!!"

$!!"

%!!"

&!!"

16 21 43 86 171 342 684

Execution Time (msec)

Number of Record (thousands)

Two Single NN Kernels

Merged NN Kernel

(a) Nearest Neighbors ExecutionTime

0

200

400

600

800

1000

400 500 600 700 800 900 1000

Execution Time (msec)

Matrix Dimension (n x n)

Two Single GE Kernels

Merged GE Kernel

(b) Gaussian Elimination ExecutionTime

Figure 6. Execution Times

MergedNN Kernel

SingleNN Kernel

Execution time per kernel 34µs 21µsThroughput per kernel 37.7MB/s 30.5MB/s

Table 1. Average Execution Time for Singleand Merged Nearest Neighbor Kernels

!"#

$%#

$$#

$&#

$'#

%(# !%# )$# *(# %'%# $)!# (*)# !!+++#

!"#$%&"'%())

*+,-./01)

2%34/#)$5)6/0$#7.)*89)("$%.:97.1)

,-./01#22#314.10# 514/16#22#314.10#

Figure 7. Nearest Neighbors Throughput

the bandwidth of the global memory to shared memory be-comes a limiting factor. Running a bandwidth test includedin the NVIDIA SDK shows a transfer rate of 44.2 GB/s ondevice memory, compared to the peak bandwidth listed at86.4 GB/s[14]. Even executing at peak bandwidth, the num-ber of threads executing the same fetch from global memorycan easily saturate the interconnect. The 44.2 GB/s transferrate from the bandwidth test can be divided by the clock fre-quency and expressed as 29 bytes/clock cycle. In compar-ison, the merged NN kernel has 3,000 threads each tryingto load 32 bytes, and although different warps may executethe load at different times, even within a warp the requestedtransfer rate is 1024 KB/clock cycle which is far beneath thetransfer rate perceived from the bandwidth test. It is com-mon and recommended to move data from global memoryto shared memory to prevent repeated expensive latencies.Hence, this is both standard behavior and also a well knownperformance bottleneck[17]. In fact, a single NN kernel isdeemed bandwidth limited as most of its execution time isdue to global memory access, and thus merging the kernelsonly aggravates the problem. A possible solution is to uti-lize faster memory for input data, such as constant or texturecache[14], though that is an optimization of the kernel itselfand out of the scope of this project. The merging of kernelsis still a success as evidenced by Figure 6(a) for a bandwidth

limited kernel, and in the case of kernels not limited by thisbottleneck we have seen the execution time of the mergedkernel to be almost the same as that of a single kernel.

The Gaussian Elimination (GE) algorithm launches twoback-to-back kernels, in order to perform a forward substi-tution to transform a square,n × n matrix into triangularform. The merged-kernel version of the application mergesthe two kernels from one GE problem with the two kernelsof a second GE problem. Figure 6(b) shows that the mergedapplication completes in less time than two single applica-tions running sequentially, averaging a1.3x speedup.

Even though the GPU does not provide results at a con-stant rate, throughput normalizes the amount of data re-ceived from the device with time. Thus, Figure 7 comparesthe throughput achieved by a single NN kernel with thatof the merged kernel. Although the motivation for this re-search led us to hypothesize that the throughput could bedoubled, the bandwidth bottleneck also affects the through-put. In the worst case that a single kernel has fully saturatedthe global memory bandwidth, merging this kernel with an-other will still output at the same throughput as the fullysaturated kernel. Hence, our approach does not degrade thethroughput of a given memory-bound kernel and for certainkernels could increase it given a kernel that makes full usageof caches with higher bandwidth than global memory.

5.3. Latency

Although the motivation for this project is to increase thethroughput of the GPU by increasing the usage of availablecores, the effect of this approach on application latency isalso a concern, since the GPU does not return results un-til the execution of the entire kernel has completed. Thismeans that a low latency kernel could be merged with onethat runs much longer and hence its results will be availablelater than had it been executed alone.

Using NN this phenomenon can be seen by merging akernel that executes the full Euclidean distance calculationwith one that always returns a constant value. Indepen-dently, one execution of the low latency NN kernel returnsin 16µsec, whereas a full NN kernel returns in28µsec.When merged the latency of one kernel execution is34µsec.This result once again shows the effect on performance ofthe limited global memory bandwidth as both kernels needto store their results to device memory before it is trans-ferred back to the host. Thus, the CPU thread that is waitingon the low latency kernel results sees a much larger latency- longer even than the28µs that is the maximum of the in-dividual kernels.

The best case latency for a low latency kernel that ismerged with a high latency kernel is the execution time ofthe high latency kernel. In the worst case, the executiontime achieved will degrade to the sum of the two indepen-dent execution times if both kernels fully saturate the globalto shared memory interconnect. There is no method to fully

prevent a low latency kernel from being merged with a highlatency one as the execution time of a given kernel is notknown prior to execution. However, the static merging ap-proach can be tuned to take into consideration number andtypes of operations in each kernel before merging them. Ap-plications that are latency intolerant could also signal theissue queue to not consider their kernels for merging. Al-ternatively, a history of execution times for kernels couldbeused to approximate future behavior of the same kernel andconsider it in the merging decision.

6. Conclusions and Future WorkThis paper analyzed methods for running general pur-

pose task parallel workloads on a graphics processing unitdesigned for data parallel workloads. We demonstrate howmerged CUDA kernels can increase the throughput on acomputing platform that utilizes an NVIDIA GPU for gen-eral purpose computing tasks, and we implement an issuequeue that decides when to merge kernels for concurrentprocessing. We also provide a description of a simple kernelqueue manager that makes the decision to run merged ker-nels based on the selection of a single or multiple kernels.Using microbenchmarks, a Nearest Neighbor search, and aGaussian Elimination, we showed that a merged kernel ex-ecutes in the same amount of time as the longest running ofthe single kernels. An exception proves to be kernels wherethe runtime is dominated by global-to-local memory trans-fers, and merging such kernels increases the stress on thedevice interconnect such that an improvement of only 12-20% over executing the kernels sequentially is observed. Inthe case of merging kernels beyond the processing resourcesof the GPU, our experiments confirm that some of the workmust be serialized on the device, yet this alternative can en-hance schedulers as it can be beneficial when, for example,two single kernels that over-extend the processing resourcesthemselves are merged.

Future work includes a more thorough benchmarking us-ing a larger selection of CUDA applications, and an im-proved queue manager. Research is still needed regardinghow to merge kernels programmatically, and whether it isfeasible to do so in real time. Further investigation of howto generalize the process for generic task-parallel to data-parallel translation is also needed. Scheduling general pur-pose work on a system with heterogeneous cores is an ex-isting problem that will become more prevalent in next gen-eration processors, and our dynamic approach to increasingconcurrency and efficiency on GPUs improves the ability ofa system-wide scheduler to achieve more optimal execution.

AcknowledgementsThis work was supported in part by NSF grants IIS-0612049,

CSR-0615277, CNS-0747273, and CCF-0811302, grants fromSRC GRC under task nos. 1607.001 and 1790.001, and grantsfrom Microsoft, Intel Research and NVIDIA Research. We wouldalso like to thank the anonymous reviewers for their helpfulcom-ments.

References[1] AccelerEyes.Jacket User Guide, May 2009.[2] A. Aji and W. Feng. Accelerating Data-Serial Applications

on Data-Parallel GPGPUs: A Systems Approach. 2008.[3] M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Acceler-

ating leukocyte tracking using CUDA: A case study in lever-aging manycore coprocessors.23rd IEEE International Par-allel and Distributed Processing Symposium, May 2009.

[4] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,M. Houston, and P. Hanrahan. Brook for GPUs: stream com-puting on graphics hardware. InInternational Conference onComputer Graphics and Interactive Techniques, pages 777–786. ACM New York, NY, USA, 2004.

[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, andK. Skadron. A performance study of general-purpose appli-cations on graphics processors using CUDA.Journal of Par-allel and Distributed Computing, 68(10):1370–1380, 2008.

[6] G. F. Diamos and S. Yalamanchili. Harmony: an execu-tion model and runtime for heterogeneous many core sys-tems. InHPDC ’08: 17th International Symposium on High-Performance Distributed Computing, pages 197–200, NewYork, NY, USA, 2008. ACM.

[7] A. Duran, J. Perez, E. Ayguade, R. Badia, and J. Labarta.Extending the OpenMP tasking model to allow dependenttasks.Lecture Notes in Computer Science, 5004:111, 2008.

[8] W. Langdon. A fast high quality pseudo random numbergenerator for graphics processing units. InIEEE Congresson Evolutionary Computation., pages 459–465, 2008.

[9] M. D. Linderman, J. D. Collins, H. Wang, and T. H.Meng. Merge: a programming model for heterogeneousmulti-core systems.SIGARCH Computer Architecture News,36(1):287–296, 2008.

[10] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym.NVIDIA Tesla: A unified graphics and computing architec-ture. IEEE Micro, pages 39–55, 2008.

[11] M. McCool. Scalable programming models for massivelymulticore processors.Proceedings of the IEEE, 96(5):816–831, May 2008.

[12] A. Munshi. OpenCL. SIGGRAPH 2008: Beyond Pro-grammable Shading Course.

[13] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scal-able parallel programming with CUDA.Queue, 6(2):40–53,2008.

[14] NVIDIA. CUDA Programming Guide 2.1, 2008.[15] Penguin Computing. Penguin Computing Offers Op-

timized NVIDIA Tesla GPU Computing Clustersfor Technical Workgroup Computing at UnparalleledPrice/Performance. Press Release. Reviewed on 2009-09-03.http://www.penguincomputing.com/products/TeslaSolutions.

[16] S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng,J. Stratton, and W. Wen-mei. Program optimization spacepruning for a multithreaded GPU. In6th annual IEEE/ACMinternational symposium on Code generation and optimiza-tion, pages 195–204. ACM New York, NY, USA, 2008.

[17] J. Seland. CUDA Programming. Winter School in ParallelComputing, Geilo Winter School, Geilo, Norway, Jan 2008.

[18] C. Smullen, S. Tarapore, and S. Gurumurthi. A BenchmarkSuite for Unstructured Data Processing. InStorage NetworkArchitecture and Parallel I/Os, 2007. SNAPI. InternationalWorkshop on, pages 79–83, 2007.

enabling task parallelism in the cuda...

Documents