generic system calls for gpus · c/c++ heritage such as opencl [22], c++amp [23], and cuda [24]....
Post on 25-Apr-2020
0 Views
Preview:
TRANSCRIPT
Generic System Calls for GPUs
Jan VeselyRutgers University
janveselycsrutgersedu
Gabriel H LohAdvanced Micro Devices Inc
gabriellohamdcom
Arkaprava BasuIndian Institute of Science
arkapravabiiscacin
Mark OskinUniversity of Washington Advanced Micro Devices Inc
markoskinamdcom
Abhishek BhattacharjeeRutgers University
abhibcsrutgersedu
Steven K ReinhardtMicrosoft
stevermicrosoftcom
AbstractmdashGPUs are becoming first-class compute citizensand increasingly support programmability-enhancing featuressuch as shared virtual memory and hardware cache coherenceThis enables them to run a wider variety of programs Howevera key aspect of general-purpose programming where GPUs stillhave room for improvement is the ability to invoke system calls
We explore how to directly invoke system calls fromGPUs We examine how system calls can be integrated withGPGPU programming models where thousands of threadsare organized in a hierarchy of execution groups To answerquestions on GPU system call usage and efficiency we im-plement GENESYS a generic GPU system call interface forLinux Numerous architectural and OS issues are consideredand subtle changes to Linux are necessary as the existingkernel assumes that only CPUs invoke system calls We assessthe performance of GENESYS using micro-benchmarks andapplications that exercise system calls for signals memorymanagement filesystems and networking
Keywords-accelerators and domain-specific architecturegraphics oriented architecture multicore and parallel archi-tectures virtualization and OS GPUs
I INTRODUCTION
GPUs have evolved from fixed function 3D acceleratorsto fully programmable units [1ndash3] and are now widelyused for high-performance computing (HPC) machine learn-ing and data-analytics Increasing deployments of general-purpose GPUs (GPGPUs) have been partly enabled byprogrammability enhancing features like virtual memory [4ndash9] and cache coherence [10ndash14]
GPU programming models have evolved with these hard-ware changes Early GPGPU programmers [15ndash19] adaptedgraphics-oriented programming models such as OpenGL [20]and DirectX [21] for general-purpose usage Since thenGPGPU programming has become increasingly accessible totraditional CPU programmers with graphics APIs givingway to computation-oriented languages with a familiarCC++ heritage such as OpenCL [22] C++AMP [23] andCUDA [24] However access to privileged OS services viasystem calls is an important aspect of CPU programmingthat remains out of reach for GPU programs
Author contributed to this work while working at AMD Research
Designers have begun exploring ways to fill this researchvoid Studies on filesystem IO (GPUfs [25]) networking IO(GPUnet [26]) and GPU-to-CPU callbacks [27] establishedthat direct GPU invocation of some specific system callscan improve GPU performance and programmability Thesestudies have two themes First they focus on specific systemcalls (ie for filesytems and networking) Second theyreplace the traditional POSIX-based APIs of these systemcalls with custom APIs to drive performance In this studywe ask ndash can we design an interface for invoking any systemcall from GPUs and can we do so with standard POSIXsemantics to enable seamless and wider adoption To answerthese questions we design the first framework for genericsystem call invocation on GPUs or GENESYS GENESYSoffers several concrete benefits
First GENESYS can support implementation of most ofLinuxrsquos 300+ system calls As a proof of concept we gobeyond the specific set of system calls from prior work[25ndash27] and implement not only filesystem and networkingsystem calls but also those for asynchronous signals memorymanagement resource querying and device control
Second GENESYSrsquos use of POSIX allows programmersto reap the benefits of standard APIs developed over decadesof real-world usage Recent work on SPIN [28] takes a stepin this direction by considering how to modify the specificsystem calls in GPUfs to match traditional POSIX semanticsBut GENESYSrsquos generality in supporting all system callsmeans that it goes further enabling among other thingsbackwards compatibility ndash GENESYS makes it possible todeploy on GPUs the vast body of legacy software written toinvoke OS-managed services
Third GENESYSrsquos generality enables GPU acceleration ofprograms that were previously considered a poor match forGPUs For example it allows applications to directly managetheir memory query the system for resource informationemploy signals interface with the terminal etc in a mannerthat lowers programming effort for GPU deployment Theseexamples underscore GENESYSrsquos ability to support newprogramming strategies and even legacy applications (egusing terminalsignals)
Finally GENESYS can leverage the benefits of support forimportant OS features that prior work cannot For exampleGPUnetrsquos use of custom APIs precludes the use trafficshaping and firewalls that are already built into the OS
When designing GENESYS we ran into several designquestions For example what system calls make sense forGPUs System calls such as preadpwrite to file(s) orsendrecv to and from the network stack are useful becausethey underpin many IO activities required to complete atask But system calls such as fork and execv do not for nowseem necessary for GPU threads In the middle are manysystem calls that need adaptation for GPU execution andare heavily dependent on architectural features that couldbe supported by future GPUs For example getrusage canbe adapted to return information about GPU resource usageWe summarize the conclusions from this qualitative study
We then perform a detailed design space study on theperformance benefits of GENESYS Key questions are
How does the GPUrsquos hardware execution model impactsystem call invocation strategies To manage parallelismthe GPUrsquos underlying hardware architecture decomposeswork into a hierarchy of execution groups The granularityof these groups ranges from work-items (or GPU threads)to work-groups (composed of hundreds of work-items) tokernels (composed of hundreds of work-groups)1 Thisnaturally presents the following research question ndash at whichof these granularities should GPU system calls be invoked
How should GPU system calls be ordered CPU systemcalls are implemented such that instructions prior to the sys-tem call have completed execution while code following thesystem call remains unexecuted This model is a good fit forCPUs which generally target single-threaded execution Butsuch ldquostrong orderingrdquo semantics may be overly conservativefor GPUs It acts as implicit synchronization barriers acrossthousands of work-items compromising performance Similarquestions arise as to whether GPU system calls should beldquoblockingrdquo or ldquonon-blockingrdquo
Where and how should GPU system calls be processedLike all prior work we assume that system calls invoked byGPU programs need to ultimately be serviced by the CPUThis makes efficient GPU-CPU communication and CPU-side processing fundamental to GPU system calls We findthat careful use of modern GPU features like shared virtualaddressing [4] and page fault support [8 9] coupled withtraditional interrupt mechanisms can enable efficient CPU-GPU communication of system call requests and responses
To explore these questions we study GENESYS withmicrobenchmarks and end-to-end applications Overall our
1 Without loss of generality we use AMDrsquos terminology of work-itemswork-groups kernels and compute unit (CU) although our work appliesequally to the NVIDIA threads thread-blocks kernels and streamingmultiprocessors (SMs) respectively
contributions are
1copy We take a step toward realizing truly heterogeneousprogramming by enabling GPUs to directly invoke OS-managed services just like CPUs This builds on the promiseof recent work [25ndash28] but goes further by enabling directinvocation of any system call through standard POSIX APIsThis permits GPUs to use the entire ecosystem of OS-managed system services developed over decades of research
2copy As a proof-of-concept we use GENESYS to realize systemcalls previously unavailable on GPUs to directly invoke OSservices for memory management signals and specializedfile-system use Additionally we continue supporting all thesystem services made available by prior work (ie GPUsGPUnet SPIN) but do so with standard POSIX APIs
3copy We shed light on several novel OS and architecturaldesign issues in supporting GPU system calls We also offerthe first set of design guidelines for practitioners on how todirectly invoke system calls in a manner that meshes withthe execution hierarchy of GPUs to maximize performanceWhile we use Linux as a testbed to evaluate our concepts ourdesign choices are applicable more generally across OSes
4copy We publicly release GENESYS hosted under the RadeonOpen Compute stack [29ndash33] offering its benefits broadly
II MOTIVATION
A reasonable question to ponder is why equip GPUs withsystem call invocation capabilities at all Conceptually OSeshave traditionally provided a standardized abstract machinein the form of a process to user programs executing onthe CPU Parts of this process abstraction such as memorylayout the location of program arguments and ISA havebenefited GPU programs Other aspects however such asstandardized and direct protected access to the filesystemnetwork and memory allocation are extremely important forprocesses but are yet lacking for GPU code Allowing GPUcode to invoke system calls is a further step to providing amore complete process abstraction to GPU code
Unfortunately GPU programs can currently only invokesystem calls indirectly and thereby suffer from performancechallenges Consider the diagram on the left in Figure 1Programmers are currently forced to delay system callrequests until the end of the GPU kernel invocation Thisis not ideal because developers have to take what was asingle conceptual GPU kernel and partition it into two ndash onebefore the system call and one after it This model which isakin to continuations is notoriously difficult to program [34]Compiler technologies can assist the process [35] but theeffect of ending the GPU kernel and restarting another isthe same as a barrier synchronization across all GPU threadsand adds unnecessary round-trips between the CPU and theGPU both of which incur significant overhead
Table IGENESYS ENABLES NEW CLASSES OF APPLICATIONS AND SUPPORTS ALL PRIOR WORK
Type Application Syscalls Description
PreviouslyUnrealizable
Memory Management miniAMR madvise getrusage Uses madvise to return unused memory to the OS(Sec VIII-A)
Signals signal-search rt sigqueueinfo Uses signals to notify the host aboutpartial work completion (Sec VIII-B)
Filesystem grep read open close Work-item invocations not supported by prior workprints to terminal (Sec VIII-C)
Device Control (ioctl) bmp-display ioctl mmap Kernel granularity invocation to query and setupframebuffer properties (Sec VIII-E)
PreviouslyRealizable
Filesystem wordsearch pread read Supports the same workloads as prior work (GPUfs)(Sec VIII-C)
Network memcached sendto recfrom Possible with GPUnet but we do not needRDMA for performance (Sec VIII-D)
process_data(buf)
load_data(buf)
request data
load_data(buf)
request data
CPU GPU
process_data(buf)
process_data(buf)
process_data(buf)
load_data(buf)
load_data(buf)
kernel finish
kernel finish
kernel start
kernel start
CPU GPU
GENESYSConventional
Figure 1 (Left) Timeline of events when the GPU has to rely on a CPU tohandle system services and (right) when the GPU has system call invocationcapabilities
In response recent studies invoke OS services from GPUs[25ndash28] as shown on the right in Figure 1 This approacheliminates the need for repeated kernel launches enablingbetter GPU efficiency System calls (eg request data)still require CPU processing as they often require accessto hardware resources that only the CPU interacts withHowever CPUs only need to schedule tasks in responseto GPU system calls as and when needed CPU system callprocessing also overlaps with the execution of other GPUthreads Studies on GPUfs GPUnet GPU-to-CPU callbacksand SPIN are seminal in demonstrating such direct invocationof OS-managed services but suffer from key drawbacks
Lack of generality They target specific OS-managed ser-vices and therefore realize only specific APIs for filesystemor networking services These interfaces are not readily
extensible to other system callsOS services
Lack of flexibility They focus on specific system callinvocation strategies Consequently there has been no designspace exploration on the general GPU system call interfaceQuestions such as the best invocation granularity (iewhether system calls should be invoked per work-item work-group or kernel) or ordering remain unexplored and as weshow can affect performance in subtle and important ways
Reliance on non-standard APIs Their use of customAPIs precludes the use of many OS-managed services (egmemory management signals processthread managementscheduler) Further custom APIs do not readily take advan-tage of existing OS code-paths that enable a richer set ofsystem features Recent work on SPIN points this out forfilesystems where using custom APIs causes issues withpage caches filesystem consistency and incompatibility withvirtual block devices such as software RAID
While these past efforts broke new ground and demon-strated the value of OS services for GPU programs they didnot explore the question we pose ndash why not simply providegeneric access from the GPU to all POSIX system calls
III HIGH-LEVEL DESIGN
Figure 2 High-level overview of how GPU system calls are invoked andprocessed on CPUs
Figure 2 outlines the steps used by GENESYS When theGPU invokes a system call it has to rely on the CPU toprocess system calls on its behalf Therefore in step 1 the GPU places system call arguments and information ina portion of system memory also visible to the CPU Wedesignate a portion of memory as the syscall area to storethis information In step 2 the GPU interrupts the CPUconveying the need for system call processing Within the
interrupt message the GPU also sends the ID number of thewavefront issuing the system call This triggers the executionof an interrupt handler on the CPU The CPU uses thewavefront ID to read the system call arguments from thesyscall area in step 3 Subsequently in step 4 the CPUprocesses the interrupt and writes the results back into thesyscall area Finally in step 5 the CPU notifies the GPUwavefront that its system call has been processed
We rely on the ability of the GPU to interrupt the CPUand use readily-available hardware [36ndash38] for this Howeverthis is not a fundamental design requirement in fact priorwork [25 27] uses a CPU polling thread to service alimited set of GPU system service requests instead Furtherwhile increasingly widespread features such as shared virtualmemory and CPU-GPU cache coherence [4 8 9 13] arebeneficial to our design they are not necessary CPU-GPUcommunication can also be achieved via atomic readswritesin system memory or GPU device memory [39]
IV ANALYZING SYSTEM CALLS
While designing GENESYS we classified all of Linuxrsquosover 300 system calls and assessed which ones to supportSome of the classifications were subjective and were debatedeven among ourselves Many of the classification issues relateto the unique nature of the GPUrsquos execution model
Recall that GPUs use SIMD execution on thousandsof concurrent threads To keep such massive parallelismtractable GPGPU programming languages like OpenCL [22]and CUDA [24] expose hierarchical groups of concurrentlyexecuting threads to programmers The smallest granularityof execution is the GPU work-item (akin to a CPU thread)Several work-items (eg 32-64) operate in lockstep in theunit of wavefronts the smallest hardware-scheduled unit ofexecution Many wavefronts (eg 16) constitute programmervisible work-groups and execute on a single GPU computeunit (CU) Work-items in a work-group can communicateamong themselves using local CU caches andor scratchpadsHundreds of work-groups comprise a GPU kernel The CPUdispatches work to a GPU at the granularity of a kernel Eachwork-group in a kernel can execute independently Furtherit is possible to synchronize just the work-items within asingle work-group [12 22] This avoids the cost of globallysynchronizing across thousands of work-items in a kernelwhich is often unnecessary in a GPU program and might notbe possible under all circumstances2
The bottom-line is that GPUs rely on far greater formsof parallelism than CPUs This implies the followingOSarchitectural considerations in designing system calls
Level of OS visibility into the GPU When a CPU threadinvokes the OS that thread has a representation within the
2 Although there is no single formally-specified barrier to synchronizeacross work-groups today recent work shows how to achieve the sameeffect by extending existing non-portable GPU inter-work-group barriers touse OpenCL 20 atomic operations [40]
kernel The vast majority of modern OS kernels maintaina data-structure for each thread for several common tasks(eg kernel resource use permission checks auditing) GPUtasks however have traditionally not been represented in theOS kernel We believe this should not change As discussedabove GPU threads are numerous and short lived Creatingand managing a kernel structure for thousands of individualGPU threads would vastly slow down the system Thesestructures are also only useful for GPU threads that invokethe OS and represent wasted overhead for the rest Hencewe process system calls in OS worker threads and switchCPU contexts if necessary (see Section VI) As more GPUssupport system calls this is an area that will require carefulconsideration by kernel developers
Evolution of GPU hardware Many system calls are heavilydependent on architectural features that could be supportedby future GPUs For example consider that modern GPUsdo not expose their thread scheduler to software Thismeans that system calls to manage thread affinity (egsched setaffinity) are not implementable on GPUs todayHowever a wealth of research has shown the benefits ofGPU warp scheduling [41ndash45] so should GPUs requiremore sophisticated thread scheduling support appropriate forimplementation in software such system calls may ultimatelybecome valuable
With these design themes in mind we discuss ourclassification of Linuxrsquos system calls
1copy Readily-implementable Examples include pread pwritemmap munmap etc This group is also the largest subsetcomprising nearly 79 of Linuxrsquos system calls In GENESYSwe implemented 14 such system calls for filesystems (readwrite pread pwrite open close lseek) networking (sendtorecvfrom) memory management (mmap munmap madvise)system calls to query resource usage (getrusage) and signalinvocation (rt sigqueueinfo) Furthermore we also implementdevice control ioctls Some of these system calls like readwrite lseek are stateful Thus GPU programmers must usethem carefully the current value of the file pointer determineswhat value is read or written by the read or write systemcall This can be arbitrary if invoked at work-item or work-group granularity for the same file descriptor because manywork-itemswork-groups can execute concurrently
An important design issue is that GENESYSrsquos supportfor standard POSIX APIs allows GPUs to read write andmmap any file descriptor Linux provides This is particularlybeneficial because of Linuxrsquos ldquoeverything is a filerdquo philosophyndash GENESYS readily supports features like terminal for userIO pipes (including redirection of stdin stdout and stderr)files in proc to query process environments files in sys toquery and manipulate kernel parameters etc Although ourstudies focus on Linux the broader domains of OS-servicesrepresented by the specific system calls generalize to other
Table IIEXAMPLES OF SYSTEM CALLS THAT REQUIRE HARDWARE CHANGES TO BE IMPLEMENTABLE ON GPUS IN TOTAL THIS GROUP CONSISTS OF 13 OF
ALL LINUX SYSTEM CALLS IN CONTRAST WE BELIEVE THAT 79 OF LINUX SYSTEM CALLS ARE READILY-IMPLEMENTABLE
Type Examples Reason that it is not currently implementablecapabilities capget capset Needs GPU thread representation in the kernelnamespace setns Needs GPU thread representation in the kernelpolicies set mempolicy Needs GPU thread representation in the kernelthread scheduling sched yield set cpu affinity Needs better control over GPU schedulersignals sigaction Signals require the target thread to be paused and then resumed after signal
suspend action has been completed GPU threads cannot be targeted It is currentlysigreturn not possible to independently set program counters of individual threadssigprocmask Executing signal actions in newly spawned threads might require freeing
of GPU resourcesarchitecture specific ioperm Not accessible from GPU
OSes like FreeBSD and SolarisAt the application level implementing this array of system
calls opens new domains of OS managed services for GPUsIn Table I we summarize previously unimplementable systemcalls realized and studied in this paper These includeapplications that use madvise for memory management andrt sigqueueinfo for signals We also go beyond prior workon GPUfs by supporting filesystem services that requiremore flexible APIs with work-item invocation capabilities forgood performance Finally we continue to support previouslyimplementable system calls
2copy Useful but implementable only with changes to GPUhardware Several system calls (13 of the total) seem use-ful for GPU code but are not easily implementable becauseof Linuxrsquos existing design Consider sigsuspendsigactionndash there is no kernel representation of a GPU work-item tomanage and dispatch a signal to Additionally there is nolightweight method to alter the GPU program counter of awork-item from the CPU kernel One approach is for signalmasks to be associated with the GPU context and for signalsto be delivered as additional work-items This works aroundthe absence of GPU work-item representation in the kernelHowever POSIX requires threads that process signals topause execution and resume only after the signal has beenprocessed Targeting the entire GPU context would meanthat all GPU execution needs to halt while the work-itemprocessing the signal executes which goes against the parallelnature of GPU execution Recent work has however shownthe benefits of hardware support for dynamic kernel launchthat allows on-demand spawning of kernels on the GPUwithout any CPU intervention [46] Should such approachesbe extended to support thread recombination assemblingmultiple signal handlers into a single warp (akin to priorapproaches on control flow divergence [42]) sigsuspend orsigaction may become implementable Table II presents moreexamples of currently not implementable system calls (intheir original semantics)
3copy Requires extensive modification to be supportedThis group (8 of the total) contains perhaps the mostcontroversial set of system calls At this time we do not
believe that it is worth the implementation effort to supportthese system calls For example fork necessitates cloning acopy of the executing callerrsquos GPU state Technically thiscan be done (eg it is how GPGPU code is context switchedwith the graphics shaders) but it seems unnecessary at thistime
V DESIGN SPACE EXPLORATION
A GPU-Side Design Considerations
Invocation granularity In assessing how best to use GPUsystem calls several questions arise The first and mostimportant question is ndash how should system calls be alignedwith the hierarchy of GPU execution groups Should a GPUsystem call be invoked separately for each work-item oncefor every work-group or once for the entire GPU kernel
Consider a GPU program that writes sorted integers to asingle output file One might at first blush invoke the writesystem call at each work-item This can present correctnessissues however because write is position-relative and re-quires access to a global file pointer Without synchronizingthe execution of write across work-items the output file willbe written in a non-deterministic unsorted order
Using different system call invocation granularities can fixthis issue One could for example use a memory locationto temporarily buffer the sorted output Once all work-itemshave updated this buffer a single write system call at theend of the kernel can be used to write the contents of thebuffer to the output file This approach loses some benefits ofthe GPUrsquos parallel resources because the entire system calllatency is exposed on the programrsquos critical path and mightnot be overlapped with the execution of other work-itemsAlternatively one could use pwrite system calls instead ofwrite system calls Because pwrite allows programmers tospecify the absolute file position where the output is to bewritten per-work-item invocations present no correctnessissue However per-work-item pwrites result in a flood ofsystem calls potentially harming performance
Overly coarse kernel-grained invocations also restrict per-formance by reducing the possibility of overlapping systemcall processing with GPU execution A compromise may beto invoke a pwrite system call per work-group buffering
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Finally GENESYS can leverage the benefits of support forimportant OS features that prior work cannot For exampleGPUnetrsquos use of custom APIs precludes the use trafficshaping and firewalls that are already built into the OS
When designing GENESYS we ran into several designquestions For example what system calls make sense forGPUs System calls such as preadpwrite to file(s) orsendrecv to and from the network stack are useful becausethey underpin many IO activities required to complete atask But system calls such as fork and execv do not for nowseem necessary for GPU threads In the middle are manysystem calls that need adaptation for GPU execution andare heavily dependent on architectural features that couldbe supported by future GPUs For example getrusage canbe adapted to return information about GPU resource usageWe summarize the conclusions from this qualitative study
We then perform a detailed design space study on theperformance benefits of GENESYS Key questions are
How does the GPUrsquos hardware execution model impactsystem call invocation strategies To manage parallelismthe GPUrsquos underlying hardware architecture decomposeswork into a hierarchy of execution groups The granularityof these groups ranges from work-items (or GPU threads)to work-groups (composed of hundreds of work-items) tokernels (composed of hundreds of work-groups)1 Thisnaturally presents the following research question ndash at whichof these granularities should GPU system calls be invoked
How should GPU system calls be ordered CPU systemcalls are implemented such that instructions prior to the sys-tem call have completed execution while code following thesystem call remains unexecuted This model is a good fit forCPUs which generally target single-threaded execution Butsuch ldquostrong orderingrdquo semantics may be overly conservativefor GPUs It acts as implicit synchronization barriers acrossthousands of work-items compromising performance Similarquestions arise as to whether GPU system calls should beldquoblockingrdquo or ldquonon-blockingrdquo
Where and how should GPU system calls be processedLike all prior work we assume that system calls invoked byGPU programs need to ultimately be serviced by the CPUThis makes efficient GPU-CPU communication and CPU-side processing fundamental to GPU system calls We findthat careful use of modern GPU features like shared virtualaddressing [4] and page fault support [8 9] coupled withtraditional interrupt mechanisms can enable efficient CPU-GPU communication of system call requests and responses
To explore these questions we study GENESYS withmicrobenchmarks and end-to-end applications Overall our
1 Without loss of generality we use AMDrsquos terminology of work-itemswork-groups kernels and compute unit (CU) although our work appliesequally to the NVIDIA threads thread-blocks kernels and streamingmultiprocessors (SMs) respectively
contributions are
1copy We take a step toward realizing truly heterogeneousprogramming by enabling GPUs to directly invoke OS-managed services just like CPUs This builds on the promiseof recent work [25ndash28] but goes further by enabling directinvocation of any system call through standard POSIX APIsThis permits GPUs to use the entire ecosystem of OS-managed system services developed over decades of research
2copy As a proof-of-concept we use GENESYS to realize systemcalls previously unavailable on GPUs to directly invoke OSservices for memory management signals and specializedfile-system use Additionally we continue supporting all thesystem services made available by prior work (ie GPUsGPUnet SPIN) but do so with standard POSIX APIs
3copy We shed light on several novel OS and architecturaldesign issues in supporting GPU system calls We also offerthe first set of design guidelines for practitioners on how todirectly invoke system calls in a manner that meshes withthe execution hierarchy of GPUs to maximize performanceWhile we use Linux as a testbed to evaluate our concepts ourdesign choices are applicable more generally across OSes
4copy We publicly release GENESYS hosted under the RadeonOpen Compute stack [29ndash33] offering its benefits broadly
II MOTIVATION
A reasonable question to ponder is why equip GPUs withsystem call invocation capabilities at all Conceptually OSeshave traditionally provided a standardized abstract machinein the form of a process to user programs executing onthe CPU Parts of this process abstraction such as memorylayout the location of program arguments and ISA havebenefited GPU programs Other aspects however such asstandardized and direct protected access to the filesystemnetwork and memory allocation are extremely important forprocesses but are yet lacking for GPU code Allowing GPUcode to invoke system calls is a further step to providing amore complete process abstraction to GPU code
Unfortunately GPU programs can currently only invokesystem calls indirectly and thereby suffer from performancechallenges Consider the diagram on the left in Figure 1Programmers are currently forced to delay system callrequests until the end of the GPU kernel invocation Thisis not ideal because developers have to take what was asingle conceptual GPU kernel and partition it into two ndash onebefore the system call and one after it This model which isakin to continuations is notoriously difficult to program [34]Compiler technologies can assist the process [35] but theeffect of ending the GPU kernel and restarting another isthe same as a barrier synchronization across all GPU threadsand adds unnecessary round-trips between the CPU and theGPU both of which incur significant overhead
Table IGENESYS ENABLES NEW CLASSES OF APPLICATIONS AND SUPPORTS ALL PRIOR WORK
Type Application Syscalls Description
PreviouslyUnrealizable
Memory Management miniAMR madvise getrusage Uses madvise to return unused memory to the OS(Sec VIII-A)
Signals signal-search rt sigqueueinfo Uses signals to notify the host aboutpartial work completion (Sec VIII-B)
Filesystem grep read open close Work-item invocations not supported by prior workprints to terminal (Sec VIII-C)
Device Control (ioctl) bmp-display ioctl mmap Kernel granularity invocation to query and setupframebuffer properties (Sec VIII-E)
PreviouslyRealizable
Filesystem wordsearch pread read Supports the same workloads as prior work (GPUfs)(Sec VIII-C)
Network memcached sendto recfrom Possible with GPUnet but we do not needRDMA for performance (Sec VIII-D)
process_data(buf)
load_data(buf)
request data
load_data(buf)
request data
CPU GPU
process_data(buf)
process_data(buf)
process_data(buf)
load_data(buf)
load_data(buf)
kernel finish
kernel finish
kernel start
kernel start
CPU GPU
GENESYSConventional
Figure 1 (Left) Timeline of events when the GPU has to rely on a CPU tohandle system services and (right) when the GPU has system call invocationcapabilities
In response recent studies invoke OS services from GPUs[25ndash28] as shown on the right in Figure 1 This approacheliminates the need for repeated kernel launches enablingbetter GPU efficiency System calls (eg request data)still require CPU processing as they often require accessto hardware resources that only the CPU interacts withHowever CPUs only need to schedule tasks in responseto GPU system calls as and when needed CPU system callprocessing also overlaps with the execution of other GPUthreads Studies on GPUfs GPUnet GPU-to-CPU callbacksand SPIN are seminal in demonstrating such direct invocationof OS-managed services but suffer from key drawbacks
Lack of generality They target specific OS-managed ser-vices and therefore realize only specific APIs for filesystemor networking services These interfaces are not readily
extensible to other system callsOS services
Lack of flexibility They focus on specific system callinvocation strategies Consequently there has been no designspace exploration on the general GPU system call interfaceQuestions such as the best invocation granularity (iewhether system calls should be invoked per work-item work-group or kernel) or ordering remain unexplored and as weshow can affect performance in subtle and important ways
Reliance on non-standard APIs Their use of customAPIs precludes the use of many OS-managed services (egmemory management signals processthread managementscheduler) Further custom APIs do not readily take advan-tage of existing OS code-paths that enable a richer set ofsystem features Recent work on SPIN points this out forfilesystems where using custom APIs causes issues withpage caches filesystem consistency and incompatibility withvirtual block devices such as software RAID
While these past efforts broke new ground and demon-strated the value of OS services for GPU programs they didnot explore the question we pose ndash why not simply providegeneric access from the GPU to all POSIX system calls
III HIGH-LEVEL DESIGN
Figure 2 High-level overview of how GPU system calls are invoked andprocessed on CPUs
Figure 2 outlines the steps used by GENESYS When theGPU invokes a system call it has to rely on the CPU toprocess system calls on its behalf Therefore in step 1 the GPU places system call arguments and information ina portion of system memory also visible to the CPU Wedesignate a portion of memory as the syscall area to storethis information In step 2 the GPU interrupts the CPUconveying the need for system call processing Within the
interrupt message the GPU also sends the ID number of thewavefront issuing the system call This triggers the executionof an interrupt handler on the CPU The CPU uses thewavefront ID to read the system call arguments from thesyscall area in step 3 Subsequently in step 4 the CPUprocesses the interrupt and writes the results back into thesyscall area Finally in step 5 the CPU notifies the GPUwavefront that its system call has been processed
We rely on the ability of the GPU to interrupt the CPUand use readily-available hardware [36ndash38] for this Howeverthis is not a fundamental design requirement in fact priorwork [25 27] uses a CPU polling thread to service alimited set of GPU system service requests instead Furtherwhile increasingly widespread features such as shared virtualmemory and CPU-GPU cache coherence [4 8 9 13] arebeneficial to our design they are not necessary CPU-GPUcommunication can also be achieved via atomic readswritesin system memory or GPU device memory [39]
IV ANALYZING SYSTEM CALLS
While designing GENESYS we classified all of Linuxrsquosover 300 system calls and assessed which ones to supportSome of the classifications were subjective and were debatedeven among ourselves Many of the classification issues relateto the unique nature of the GPUrsquos execution model
Recall that GPUs use SIMD execution on thousandsof concurrent threads To keep such massive parallelismtractable GPGPU programming languages like OpenCL [22]and CUDA [24] expose hierarchical groups of concurrentlyexecuting threads to programmers The smallest granularityof execution is the GPU work-item (akin to a CPU thread)Several work-items (eg 32-64) operate in lockstep in theunit of wavefronts the smallest hardware-scheduled unit ofexecution Many wavefronts (eg 16) constitute programmervisible work-groups and execute on a single GPU computeunit (CU) Work-items in a work-group can communicateamong themselves using local CU caches andor scratchpadsHundreds of work-groups comprise a GPU kernel The CPUdispatches work to a GPU at the granularity of a kernel Eachwork-group in a kernel can execute independently Furtherit is possible to synchronize just the work-items within asingle work-group [12 22] This avoids the cost of globallysynchronizing across thousands of work-items in a kernelwhich is often unnecessary in a GPU program and might notbe possible under all circumstances2
The bottom-line is that GPUs rely on far greater formsof parallelism than CPUs This implies the followingOSarchitectural considerations in designing system calls
Level of OS visibility into the GPU When a CPU threadinvokes the OS that thread has a representation within the
2 Although there is no single formally-specified barrier to synchronizeacross work-groups today recent work shows how to achieve the sameeffect by extending existing non-portable GPU inter-work-group barriers touse OpenCL 20 atomic operations [40]
kernel The vast majority of modern OS kernels maintaina data-structure for each thread for several common tasks(eg kernel resource use permission checks auditing) GPUtasks however have traditionally not been represented in theOS kernel We believe this should not change As discussedabove GPU threads are numerous and short lived Creatingand managing a kernel structure for thousands of individualGPU threads would vastly slow down the system Thesestructures are also only useful for GPU threads that invokethe OS and represent wasted overhead for the rest Hencewe process system calls in OS worker threads and switchCPU contexts if necessary (see Section VI) As more GPUssupport system calls this is an area that will require carefulconsideration by kernel developers
Evolution of GPU hardware Many system calls are heavilydependent on architectural features that could be supportedby future GPUs For example consider that modern GPUsdo not expose their thread scheduler to software Thismeans that system calls to manage thread affinity (egsched setaffinity) are not implementable on GPUs todayHowever a wealth of research has shown the benefits ofGPU warp scheduling [41ndash45] so should GPUs requiremore sophisticated thread scheduling support appropriate forimplementation in software such system calls may ultimatelybecome valuable
With these design themes in mind we discuss ourclassification of Linuxrsquos system calls
1copy Readily-implementable Examples include pread pwritemmap munmap etc This group is also the largest subsetcomprising nearly 79 of Linuxrsquos system calls In GENESYSwe implemented 14 such system calls for filesystems (readwrite pread pwrite open close lseek) networking (sendtorecvfrom) memory management (mmap munmap madvise)system calls to query resource usage (getrusage) and signalinvocation (rt sigqueueinfo) Furthermore we also implementdevice control ioctls Some of these system calls like readwrite lseek are stateful Thus GPU programmers must usethem carefully the current value of the file pointer determineswhat value is read or written by the read or write systemcall This can be arbitrary if invoked at work-item or work-group granularity for the same file descriptor because manywork-itemswork-groups can execute concurrently
An important design issue is that GENESYSrsquos supportfor standard POSIX APIs allows GPUs to read write andmmap any file descriptor Linux provides This is particularlybeneficial because of Linuxrsquos ldquoeverything is a filerdquo philosophyndash GENESYS readily supports features like terminal for userIO pipes (including redirection of stdin stdout and stderr)files in proc to query process environments files in sys toquery and manipulate kernel parameters etc Although ourstudies focus on Linux the broader domains of OS-servicesrepresented by the specific system calls generalize to other
Table IIEXAMPLES OF SYSTEM CALLS THAT REQUIRE HARDWARE CHANGES TO BE IMPLEMENTABLE ON GPUS IN TOTAL THIS GROUP CONSISTS OF 13 OF
ALL LINUX SYSTEM CALLS IN CONTRAST WE BELIEVE THAT 79 OF LINUX SYSTEM CALLS ARE READILY-IMPLEMENTABLE
Type Examples Reason that it is not currently implementablecapabilities capget capset Needs GPU thread representation in the kernelnamespace setns Needs GPU thread representation in the kernelpolicies set mempolicy Needs GPU thread representation in the kernelthread scheduling sched yield set cpu affinity Needs better control over GPU schedulersignals sigaction Signals require the target thread to be paused and then resumed after signal
suspend action has been completed GPU threads cannot be targeted It is currentlysigreturn not possible to independently set program counters of individual threadssigprocmask Executing signal actions in newly spawned threads might require freeing
of GPU resourcesarchitecture specific ioperm Not accessible from GPU
OSes like FreeBSD and SolarisAt the application level implementing this array of system
calls opens new domains of OS managed services for GPUsIn Table I we summarize previously unimplementable systemcalls realized and studied in this paper These includeapplications that use madvise for memory management andrt sigqueueinfo for signals We also go beyond prior workon GPUfs by supporting filesystem services that requiremore flexible APIs with work-item invocation capabilities forgood performance Finally we continue to support previouslyimplementable system calls
2copy Useful but implementable only with changes to GPUhardware Several system calls (13 of the total) seem use-ful for GPU code but are not easily implementable becauseof Linuxrsquos existing design Consider sigsuspendsigactionndash there is no kernel representation of a GPU work-item tomanage and dispatch a signal to Additionally there is nolightweight method to alter the GPU program counter of awork-item from the CPU kernel One approach is for signalmasks to be associated with the GPU context and for signalsto be delivered as additional work-items This works aroundthe absence of GPU work-item representation in the kernelHowever POSIX requires threads that process signals topause execution and resume only after the signal has beenprocessed Targeting the entire GPU context would meanthat all GPU execution needs to halt while the work-itemprocessing the signal executes which goes against the parallelnature of GPU execution Recent work has however shownthe benefits of hardware support for dynamic kernel launchthat allows on-demand spawning of kernels on the GPUwithout any CPU intervention [46] Should such approachesbe extended to support thread recombination assemblingmultiple signal handlers into a single warp (akin to priorapproaches on control flow divergence [42]) sigsuspend orsigaction may become implementable Table II presents moreexamples of currently not implementable system calls (intheir original semantics)
3copy Requires extensive modification to be supportedThis group (8 of the total) contains perhaps the mostcontroversial set of system calls At this time we do not
believe that it is worth the implementation effort to supportthese system calls For example fork necessitates cloning acopy of the executing callerrsquos GPU state Technically thiscan be done (eg it is how GPGPU code is context switchedwith the graphics shaders) but it seems unnecessary at thistime
V DESIGN SPACE EXPLORATION
A GPU-Side Design Considerations
Invocation granularity In assessing how best to use GPUsystem calls several questions arise The first and mostimportant question is ndash how should system calls be alignedwith the hierarchy of GPU execution groups Should a GPUsystem call be invoked separately for each work-item oncefor every work-group or once for the entire GPU kernel
Consider a GPU program that writes sorted integers to asingle output file One might at first blush invoke the writesystem call at each work-item This can present correctnessissues however because write is position-relative and re-quires access to a global file pointer Without synchronizingthe execution of write across work-items the output file willbe written in a non-deterministic unsorted order
Using different system call invocation granularities can fixthis issue One could for example use a memory locationto temporarily buffer the sorted output Once all work-itemshave updated this buffer a single write system call at theend of the kernel can be used to write the contents of thebuffer to the output file This approach loses some benefits ofthe GPUrsquos parallel resources because the entire system calllatency is exposed on the programrsquos critical path and mightnot be overlapped with the execution of other work-itemsAlternatively one could use pwrite system calls instead ofwrite system calls Because pwrite allows programmers tospecify the absolute file position where the output is to bewritten per-work-item invocations present no correctnessissue However per-work-item pwrites result in a flood ofsystem calls potentially harming performance
Overly coarse kernel-grained invocations also restrict per-formance by reducing the possibility of overlapping systemcall processing with GPU execution A compromise may beto invoke a pwrite system call per work-group buffering
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Table IGENESYS ENABLES NEW CLASSES OF APPLICATIONS AND SUPPORTS ALL PRIOR WORK
Type Application Syscalls Description
PreviouslyUnrealizable
Memory Management miniAMR madvise getrusage Uses madvise to return unused memory to the OS(Sec VIII-A)
Signals signal-search rt sigqueueinfo Uses signals to notify the host aboutpartial work completion (Sec VIII-B)
Filesystem grep read open close Work-item invocations not supported by prior workprints to terminal (Sec VIII-C)
Device Control (ioctl) bmp-display ioctl mmap Kernel granularity invocation to query and setupframebuffer properties (Sec VIII-E)
PreviouslyRealizable
Filesystem wordsearch pread read Supports the same workloads as prior work (GPUfs)(Sec VIII-C)
Network memcached sendto recfrom Possible with GPUnet but we do not needRDMA for performance (Sec VIII-D)
process_data(buf)
load_data(buf)
request data
load_data(buf)
request data
CPU GPU
process_data(buf)
process_data(buf)
process_data(buf)
load_data(buf)
load_data(buf)
kernel finish
kernel finish
kernel start
kernel start
CPU GPU
GENESYSConventional
Figure 1 (Left) Timeline of events when the GPU has to rely on a CPU tohandle system services and (right) when the GPU has system call invocationcapabilities
In response recent studies invoke OS services from GPUs[25ndash28] as shown on the right in Figure 1 This approacheliminates the need for repeated kernel launches enablingbetter GPU efficiency System calls (eg request data)still require CPU processing as they often require accessto hardware resources that only the CPU interacts withHowever CPUs only need to schedule tasks in responseto GPU system calls as and when needed CPU system callprocessing also overlaps with the execution of other GPUthreads Studies on GPUfs GPUnet GPU-to-CPU callbacksand SPIN are seminal in demonstrating such direct invocationof OS-managed services but suffer from key drawbacks
Lack of generality They target specific OS-managed ser-vices and therefore realize only specific APIs for filesystemor networking services These interfaces are not readily
extensible to other system callsOS services
Lack of flexibility They focus on specific system callinvocation strategies Consequently there has been no designspace exploration on the general GPU system call interfaceQuestions such as the best invocation granularity (iewhether system calls should be invoked per work-item work-group or kernel) or ordering remain unexplored and as weshow can affect performance in subtle and important ways
Reliance on non-standard APIs Their use of customAPIs precludes the use of many OS-managed services (egmemory management signals processthread managementscheduler) Further custom APIs do not readily take advan-tage of existing OS code-paths that enable a richer set ofsystem features Recent work on SPIN points this out forfilesystems where using custom APIs causes issues withpage caches filesystem consistency and incompatibility withvirtual block devices such as software RAID
While these past efforts broke new ground and demon-strated the value of OS services for GPU programs they didnot explore the question we pose ndash why not simply providegeneric access from the GPU to all POSIX system calls
III HIGH-LEVEL DESIGN
Figure 2 High-level overview of how GPU system calls are invoked andprocessed on CPUs
Figure 2 outlines the steps used by GENESYS When theGPU invokes a system call it has to rely on the CPU toprocess system calls on its behalf Therefore in step 1 the GPU places system call arguments and information ina portion of system memory also visible to the CPU Wedesignate a portion of memory as the syscall area to storethis information In step 2 the GPU interrupts the CPUconveying the need for system call processing Within the
interrupt message the GPU also sends the ID number of thewavefront issuing the system call This triggers the executionof an interrupt handler on the CPU The CPU uses thewavefront ID to read the system call arguments from thesyscall area in step 3 Subsequently in step 4 the CPUprocesses the interrupt and writes the results back into thesyscall area Finally in step 5 the CPU notifies the GPUwavefront that its system call has been processed
We rely on the ability of the GPU to interrupt the CPUand use readily-available hardware [36ndash38] for this Howeverthis is not a fundamental design requirement in fact priorwork [25 27] uses a CPU polling thread to service alimited set of GPU system service requests instead Furtherwhile increasingly widespread features such as shared virtualmemory and CPU-GPU cache coherence [4 8 9 13] arebeneficial to our design they are not necessary CPU-GPUcommunication can also be achieved via atomic readswritesin system memory or GPU device memory [39]
IV ANALYZING SYSTEM CALLS
While designing GENESYS we classified all of Linuxrsquosover 300 system calls and assessed which ones to supportSome of the classifications were subjective and were debatedeven among ourselves Many of the classification issues relateto the unique nature of the GPUrsquos execution model
Recall that GPUs use SIMD execution on thousandsof concurrent threads To keep such massive parallelismtractable GPGPU programming languages like OpenCL [22]and CUDA [24] expose hierarchical groups of concurrentlyexecuting threads to programmers The smallest granularityof execution is the GPU work-item (akin to a CPU thread)Several work-items (eg 32-64) operate in lockstep in theunit of wavefronts the smallest hardware-scheduled unit ofexecution Many wavefronts (eg 16) constitute programmervisible work-groups and execute on a single GPU computeunit (CU) Work-items in a work-group can communicateamong themselves using local CU caches andor scratchpadsHundreds of work-groups comprise a GPU kernel The CPUdispatches work to a GPU at the granularity of a kernel Eachwork-group in a kernel can execute independently Furtherit is possible to synchronize just the work-items within asingle work-group [12 22] This avoids the cost of globallysynchronizing across thousands of work-items in a kernelwhich is often unnecessary in a GPU program and might notbe possible under all circumstances2
The bottom-line is that GPUs rely on far greater formsof parallelism than CPUs This implies the followingOSarchitectural considerations in designing system calls
Level of OS visibility into the GPU When a CPU threadinvokes the OS that thread has a representation within the
2 Although there is no single formally-specified barrier to synchronizeacross work-groups today recent work shows how to achieve the sameeffect by extending existing non-portable GPU inter-work-group barriers touse OpenCL 20 atomic operations [40]
kernel The vast majority of modern OS kernels maintaina data-structure for each thread for several common tasks(eg kernel resource use permission checks auditing) GPUtasks however have traditionally not been represented in theOS kernel We believe this should not change As discussedabove GPU threads are numerous and short lived Creatingand managing a kernel structure for thousands of individualGPU threads would vastly slow down the system Thesestructures are also only useful for GPU threads that invokethe OS and represent wasted overhead for the rest Hencewe process system calls in OS worker threads and switchCPU contexts if necessary (see Section VI) As more GPUssupport system calls this is an area that will require carefulconsideration by kernel developers
Evolution of GPU hardware Many system calls are heavilydependent on architectural features that could be supportedby future GPUs For example consider that modern GPUsdo not expose their thread scheduler to software Thismeans that system calls to manage thread affinity (egsched setaffinity) are not implementable on GPUs todayHowever a wealth of research has shown the benefits ofGPU warp scheduling [41ndash45] so should GPUs requiremore sophisticated thread scheduling support appropriate forimplementation in software such system calls may ultimatelybecome valuable
With these design themes in mind we discuss ourclassification of Linuxrsquos system calls
1copy Readily-implementable Examples include pread pwritemmap munmap etc This group is also the largest subsetcomprising nearly 79 of Linuxrsquos system calls In GENESYSwe implemented 14 such system calls for filesystems (readwrite pread pwrite open close lseek) networking (sendtorecvfrom) memory management (mmap munmap madvise)system calls to query resource usage (getrusage) and signalinvocation (rt sigqueueinfo) Furthermore we also implementdevice control ioctls Some of these system calls like readwrite lseek are stateful Thus GPU programmers must usethem carefully the current value of the file pointer determineswhat value is read or written by the read or write systemcall This can be arbitrary if invoked at work-item or work-group granularity for the same file descriptor because manywork-itemswork-groups can execute concurrently
An important design issue is that GENESYSrsquos supportfor standard POSIX APIs allows GPUs to read write andmmap any file descriptor Linux provides This is particularlybeneficial because of Linuxrsquos ldquoeverything is a filerdquo philosophyndash GENESYS readily supports features like terminal for userIO pipes (including redirection of stdin stdout and stderr)files in proc to query process environments files in sys toquery and manipulate kernel parameters etc Although ourstudies focus on Linux the broader domains of OS-servicesrepresented by the specific system calls generalize to other
Table IIEXAMPLES OF SYSTEM CALLS THAT REQUIRE HARDWARE CHANGES TO BE IMPLEMENTABLE ON GPUS IN TOTAL THIS GROUP CONSISTS OF 13 OF
ALL LINUX SYSTEM CALLS IN CONTRAST WE BELIEVE THAT 79 OF LINUX SYSTEM CALLS ARE READILY-IMPLEMENTABLE
Type Examples Reason that it is not currently implementablecapabilities capget capset Needs GPU thread representation in the kernelnamespace setns Needs GPU thread representation in the kernelpolicies set mempolicy Needs GPU thread representation in the kernelthread scheduling sched yield set cpu affinity Needs better control over GPU schedulersignals sigaction Signals require the target thread to be paused and then resumed after signal
suspend action has been completed GPU threads cannot be targeted It is currentlysigreturn not possible to independently set program counters of individual threadssigprocmask Executing signal actions in newly spawned threads might require freeing
of GPU resourcesarchitecture specific ioperm Not accessible from GPU
OSes like FreeBSD and SolarisAt the application level implementing this array of system
calls opens new domains of OS managed services for GPUsIn Table I we summarize previously unimplementable systemcalls realized and studied in this paper These includeapplications that use madvise for memory management andrt sigqueueinfo for signals We also go beyond prior workon GPUfs by supporting filesystem services that requiremore flexible APIs with work-item invocation capabilities forgood performance Finally we continue to support previouslyimplementable system calls
2copy Useful but implementable only with changes to GPUhardware Several system calls (13 of the total) seem use-ful for GPU code but are not easily implementable becauseof Linuxrsquos existing design Consider sigsuspendsigactionndash there is no kernel representation of a GPU work-item tomanage and dispatch a signal to Additionally there is nolightweight method to alter the GPU program counter of awork-item from the CPU kernel One approach is for signalmasks to be associated with the GPU context and for signalsto be delivered as additional work-items This works aroundthe absence of GPU work-item representation in the kernelHowever POSIX requires threads that process signals topause execution and resume only after the signal has beenprocessed Targeting the entire GPU context would meanthat all GPU execution needs to halt while the work-itemprocessing the signal executes which goes against the parallelnature of GPU execution Recent work has however shownthe benefits of hardware support for dynamic kernel launchthat allows on-demand spawning of kernels on the GPUwithout any CPU intervention [46] Should such approachesbe extended to support thread recombination assemblingmultiple signal handlers into a single warp (akin to priorapproaches on control flow divergence [42]) sigsuspend orsigaction may become implementable Table II presents moreexamples of currently not implementable system calls (intheir original semantics)
3copy Requires extensive modification to be supportedThis group (8 of the total) contains perhaps the mostcontroversial set of system calls At this time we do not
believe that it is worth the implementation effort to supportthese system calls For example fork necessitates cloning acopy of the executing callerrsquos GPU state Technically thiscan be done (eg it is how GPGPU code is context switchedwith the graphics shaders) but it seems unnecessary at thistime
V DESIGN SPACE EXPLORATION
A GPU-Side Design Considerations
Invocation granularity In assessing how best to use GPUsystem calls several questions arise The first and mostimportant question is ndash how should system calls be alignedwith the hierarchy of GPU execution groups Should a GPUsystem call be invoked separately for each work-item oncefor every work-group or once for the entire GPU kernel
Consider a GPU program that writes sorted integers to asingle output file One might at first blush invoke the writesystem call at each work-item This can present correctnessissues however because write is position-relative and re-quires access to a global file pointer Without synchronizingthe execution of write across work-items the output file willbe written in a non-deterministic unsorted order
Using different system call invocation granularities can fixthis issue One could for example use a memory locationto temporarily buffer the sorted output Once all work-itemshave updated this buffer a single write system call at theend of the kernel can be used to write the contents of thebuffer to the output file This approach loses some benefits ofthe GPUrsquos parallel resources because the entire system calllatency is exposed on the programrsquos critical path and mightnot be overlapped with the execution of other work-itemsAlternatively one could use pwrite system calls instead ofwrite system calls Because pwrite allows programmers tospecify the absolute file position where the output is to bewritten per-work-item invocations present no correctnessissue However per-work-item pwrites result in a flood ofsystem calls potentially harming performance
Overly coarse kernel-grained invocations also restrict per-formance by reducing the possibility of overlapping systemcall processing with GPU execution A compromise may beto invoke a pwrite system call per work-group buffering
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
interrupt message the GPU also sends the ID number of thewavefront issuing the system call This triggers the executionof an interrupt handler on the CPU The CPU uses thewavefront ID to read the system call arguments from thesyscall area in step 3 Subsequently in step 4 the CPUprocesses the interrupt and writes the results back into thesyscall area Finally in step 5 the CPU notifies the GPUwavefront that its system call has been processed
We rely on the ability of the GPU to interrupt the CPUand use readily-available hardware [36ndash38] for this Howeverthis is not a fundamental design requirement in fact priorwork [25 27] uses a CPU polling thread to service alimited set of GPU system service requests instead Furtherwhile increasingly widespread features such as shared virtualmemory and CPU-GPU cache coherence [4 8 9 13] arebeneficial to our design they are not necessary CPU-GPUcommunication can also be achieved via atomic readswritesin system memory or GPU device memory [39]
IV ANALYZING SYSTEM CALLS
While designing GENESYS we classified all of Linuxrsquosover 300 system calls and assessed which ones to supportSome of the classifications were subjective and were debatedeven among ourselves Many of the classification issues relateto the unique nature of the GPUrsquos execution model
Recall that GPUs use SIMD execution on thousandsof concurrent threads To keep such massive parallelismtractable GPGPU programming languages like OpenCL [22]and CUDA [24] expose hierarchical groups of concurrentlyexecuting threads to programmers The smallest granularityof execution is the GPU work-item (akin to a CPU thread)Several work-items (eg 32-64) operate in lockstep in theunit of wavefronts the smallest hardware-scheduled unit ofexecution Many wavefronts (eg 16) constitute programmervisible work-groups and execute on a single GPU computeunit (CU) Work-items in a work-group can communicateamong themselves using local CU caches andor scratchpadsHundreds of work-groups comprise a GPU kernel The CPUdispatches work to a GPU at the granularity of a kernel Eachwork-group in a kernel can execute independently Furtherit is possible to synchronize just the work-items within asingle work-group [12 22] This avoids the cost of globallysynchronizing across thousands of work-items in a kernelwhich is often unnecessary in a GPU program and might notbe possible under all circumstances2
The bottom-line is that GPUs rely on far greater formsof parallelism than CPUs This implies the followingOSarchitectural considerations in designing system calls
Level of OS visibility into the GPU When a CPU threadinvokes the OS that thread has a representation within the
2 Although there is no single formally-specified barrier to synchronizeacross work-groups today recent work shows how to achieve the sameeffect by extending existing non-portable GPU inter-work-group barriers touse OpenCL 20 atomic operations [40]
kernel The vast majority of modern OS kernels maintaina data-structure for each thread for several common tasks(eg kernel resource use permission checks auditing) GPUtasks however have traditionally not been represented in theOS kernel We believe this should not change As discussedabove GPU threads are numerous and short lived Creatingand managing a kernel structure for thousands of individualGPU threads would vastly slow down the system Thesestructures are also only useful for GPU threads that invokethe OS and represent wasted overhead for the rest Hencewe process system calls in OS worker threads and switchCPU contexts if necessary (see Section VI) As more GPUssupport system calls this is an area that will require carefulconsideration by kernel developers
Evolution of GPU hardware Many system calls are heavilydependent on architectural features that could be supportedby future GPUs For example consider that modern GPUsdo not expose their thread scheduler to software Thismeans that system calls to manage thread affinity (egsched setaffinity) are not implementable on GPUs todayHowever a wealth of research has shown the benefits ofGPU warp scheduling [41ndash45] so should GPUs requiremore sophisticated thread scheduling support appropriate forimplementation in software such system calls may ultimatelybecome valuable
With these design themes in mind we discuss ourclassification of Linuxrsquos system calls
1copy Readily-implementable Examples include pread pwritemmap munmap etc This group is also the largest subsetcomprising nearly 79 of Linuxrsquos system calls In GENESYSwe implemented 14 such system calls for filesystems (readwrite pread pwrite open close lseek) networking (sendtorecvfrom) memory management (mmap munmap madvise)system calls to query resource usage (getrusage) and signalinvocation (rt sigqueueinfo) Furthermore we also implementdevice control ioctls Some of these system calls like readwrite lseek are stateful Thus GPU programmers must usethem carefully the current value of the file pointer determineswhat value is read or written by the read or write systemcall This can be arbitrary if invoked at work-item or work-group granularity for the same file descriptor because manywork-itemswork-groups can execute concurrently
An important design issue is that GENESYSrsquos supportfor standard POSIX APIs allows GPUs to read write andmmap any file descriptor Linux provides This is particularlybeneficial because of Linuxrsquos ldquoeverything is a filerdquo philosophyndash GENESYS readily supports features like terminal for userIO pipes (including redirection of stdin stdout and stderr)files in proc to query process environments files in sys toquery and manipulate kernel parameters etc Although ourstudies focus on Linux the broader domains of OS-servicesrepresented by the specific system calls generalize to other
Table IIEXAMPLES OF SYSTEM CALLS THAT REQUIRE HARDWARE CHANGES TO BE IMPLEMENTABLE ON GPUS IN TOTAL THIS GROUP CONSISTS OF 13 OF
ALL LINUX SYSTEM CALLS IN CONTRAST WE BELIEVE THAT 79 OF LINUX SYSTEM CALLS ARE READILY-IMPLEMENTABLE
Type Examples Reason that it is not currently implementablecapabilities capget capset Needs GPU thread representation in the kernelnamespace setns Needs GPU thread representation in the kernelpolicies set mempolicy Needs GPU thread representation in the kernelthread scheduling sched yield set cpu affinity Needs better control over GPU schedulersignals sigaction Signals require the target thread to be paused and then resumed after signal
suspend action has been completed GPU threads cannot be targeted It is currentlysigreturn not possible to independently set program counters of individual threadssigprocmask Executing signal actions in newly spawned threads might require freeing
of GPU resourcesarchitecture specific ioperm Not accessible from GPU
OSes like FreeBSD and SolarisAt the application level implementing this array of system
calls opens new domains of OS managed services for GPUsIn Table I we summarize previously unimplementable systemcalls realized and studied in this paper These includeapplications that use madvise for memory management andrt sigqueueinfo for signals We also go beyond prior workon GPUfs by supporting filesystem services that requiremore flexible APIs with work-item invocation capabilities forgood performance Finally we continue to support previouslyimplementable system calls
2copy Useful but implementable only with changes to GPUhardware Several system calls (13 of the total) seem use-ful for GPU code but are not easily implementable becauseof Linuxrsquos existing design Consider sigsuspendsigactionndash there is no kernel representation of a GPU work-item tomanage and dispatch a signal to Additionally there is nolightweight method to alter the GPU program counter of awork-item from the CPU kernel One approach is for signalmasks to be associated with the GPU context and for signalsto be delivered as additional work-items This works aroundthe absence of GPU work-item representation in the kernelHowever POSIX requires threads that process signals topause execution and resume only after the signal has beenprocessed Targeting the entire GPU context would meanthat all GPU execution needs to halt while the work-itemprocessing the signal executes which goes against the parallelnature of GPU execution Recent work has however shownthe benefits of hardware support for dynamic kernel launchthat allows on-demand spawning of kernels on the GPUwithout any CPU intervention [46] Should such approachesbe extended to support thread recombination assemblingmultiple signal handlers into a single warp (akin to priorapproaches on control flow divergence [42]) sigsuspend orsigaction may become implementable Table II presents moreexamples of currently not implementable system calls (intheir original semantics)
3copy Requires extensive modification to be supportedThis group (8 of the total) contains perhaps the mostcontroversial set of system calls At this time we do not
believe that it is worth the implementation effort to supportthese system calls For example fork necessitates cloning acopy of the executing callerrsquos GPU state Technically thiscan be done (eg it is how GPGPU code is context switchedwith the graphics shaders) but it seems unnecessary at thistime
V DESIGN SPACE EXPLORATION
A GPU-Side Design Considerations
Invocation granularity In assessing how best to use GPUsystem calls several questions arise The first and mostimportant question is ndash how should system calls be alignedwith the hierarchy of GPU execution groups Should a GPUsystem call be invoked separately for each work-item oncefor every work-group or once for the entire GPU kernel
Consider a GPU program that writes sorted integers to asingle output file One might at first blush invoke the writesystem call at each work-item This can present correctnessissues however because write is position-relative and re-quires access to a global file pointer Without synchronizingthe execution of write across work-items the output file willbe written in a non-deterministic unsorted order
Using different system call invocation granularities can fixthis issue One could for example use a memory locationto temporarily buffer the sorted output Once all work-itemshave updated this buffer a single write system call at theend of the kernel can be used to write the contents of thebuffer to the output file This approach loses some benefits ofthe GPUrsquos parallel resources because the entire system calllatency is exposed on the programrsquos critical path and mightnot be overlapped with the execution of other work-itemsAlternatively one could use pwrite system calls instead ofwrite system calls Because pwrite allows programmers tospecify the absolute file position where the output is to bewritten per-work-item invocations present no correctnessissue However per-work-item pwrites result in a flood ofsystem calls potentially harming performance
Overly coarse kernel-grained invocations also restrict per-formance by reducing the possibility of overlapping systemcall processing with GPU execution A compromise may beto invoke a pwrite system call per work-group buffering
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Table IIEXAMPLES OF SYSTEM CALLS THAT REQUIRE HARDWARE CHANGES TO BE IMPLEMENTABLE ON GPUS IN TOTAL THIS GROUP CONSISTS OF 13 OF
ALL LINUX SYSTEM CALLS IN CONTRAST WE BELIEVE THAT 79 OF LINUX SYSTEM CALLS ARE READILY-IMPLEMENTABLE
Type Examples Reason that it is not currently implementablecapabilities capget capset Needs GPU thread representation in the kernelnamespace setns Needs GPU thread representation in the kernelpolicies set mempolicy Needs GPU thread representation in the kernelthread scheduling sched yield set cpu affinity Needs better control over GPU schedulersignals sigaction Signals require the target thread to be paused and then resumed after signal
suspend action has been completed GPU threads cannot be targeted It is currentlysigreturn not possible to independently set program counters of individual threadssigprocmask Executing signal actions in newly spawned threads might require freeing
of GPU resourcesarchitecture specific ioperm Not accessible from GPU
OSes like FreeBSD and SolarisAt the application level implementing this array of system
calls opens new domains of OS managed services for GPUsIn Table I we summarize previously unimplementable systemcalls realized and studied in this paper These includeapplications that use madvise for memory management andrt sigqueueinfo for signals We also go beyond prior workon GPUfs by supporting filesystem services that requiremore flexible APIs with work-item invocation capabilities forgood performance Finally we continue to support previouslyimplementable system calls
2copy Useful but implementable only with changes to GPUhardware Several system calls (13 of the total) seem use-ful for GPU code but are not easily implementable becauseof Linuxrsquos existing design Consider sigsuspendsigactionndash there is no kernel representation of a GPU work-item tomanage and dispatch a signal to Additionally there is nolightweight method to alter the GPU program counter of awork-item from the CPU kernel One approach is for signalmasks to be associated with the GPU context and for signalsto be delivered as additional work-items This works aroundthe absence of GPU work-item representation in the kernelHowever POSIX requires threads that process signals topause execution and resume only after the signal has beenprocessed Targeting the entire GPU context would meanthat all GPU execution needs to halt while the work-itemprocessing the signal executes which goes against the parallelnature of GPU execution Recent work has however shownthe benefits of hardware support for dynamic kernel launchthat allows on-demand spawning of kernels on the GPUwithout any CPU intervention [46] Should such approachesbe extended to support thread recombination assemblingmultiple signal handlers into a single warp (akin to priorapproaches on control flow divergence [42]) sigsuspend orsigaction may become implementable Table II presents moreexamples of currently not implementable system calls (intheir original semantics)
3copy Requires extensive modification to be supportedThis group (8 of the total) contains perhaps the mostcontroversial set of system calls At this time we do not
believe that it is worth the implementation effort to supportthese system calls For example fork necessitates cloning acopy of the executing callerrsquos GPU state Technically thiscan be done (eg it is how GPGPU code is context switchedwith the graphics shaders) but it seems unnecessary at thistime
V DESIGN SPACE EXPLORATION
A GPU-Side Design Considerations
Invocation granularity In assessing how best to use GPUsystem calls several questions arise The first and mostimportant question is ndash how should system calls be alignedwith the hierarchy of GPU execution groups Should a GPUsystem call be invoked separately for each work-item oncefor every work-group or once for the entire GPU kernel
Consider a GPU program that writes sorted integers to asingle output file One might at first blush invoke the writesystem call at each work-item This can present correctnessissues however because write is position-relative and re-quires access to a global file pointer Without synchronizingthe execution of write across work-items the output file willbe written in a non-deterministic unsorted order
Using different system call invocation granularities can fixthis issue One could for example use a memory locationto temporarily buffer the sorted output Once all work-itemshave updated this buffer a single write system call at theend of the kernel can be used to write the contents of thebuffer to the output file This approach loses some benefits ofthe GPUrsquos parallel resources because the entire system calllatency is exposed on the programrsquos critical path and mightnot be overlapped with the execution of other work-itemsAlternatively one could use pwrite system calls instead ofwrite system calls Because pwrite allows programmers tospecify the absolute file position where the output is to bewritten per-work-item invocations present no correctnessissue However per-work-item pwrites result in a flood ofsystem calls potentially harming performance
Overly coarse kernel-grained invocations also restrict per-formance by reducing the possibility of overlapping systemcall processing with GPU execution A compromise may beto invoke a pwrite system call per work-group buffering
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Figure 3 Work-items in a work-group (shown as a blue box) executestrongly ordered system calls
Figure 4 Work-group invocations can be relax-ordered by removing oneof the two barriers
the sorted output of the work-items until the per-work-groupbuffers are fully populated Section VII demonstrates thatthese decisions can lead to a 175times performance difference
System call ordering semantics When programmers invokesystem calls on CPUs they expect that all program instruc-tions before the system call will complete execution beforethe system call executes They also expect that instructionsafter the system call will only commence once the system callreturns We call this ldquostrong orderingrdquo For GPUs howeverwe introduce ldquorelaxed orderingrdquo semantics The notion ofrelaxed ordering is tied to the hierarchical execution scopes ofthe GPU and is needed for both correctness and performance
Figure 3 shows a programmer-visible work-group (in blue)consisting of four wavefronts A B C and D Each wavefronthas two work-items (eg A0 and A1) If system calls (SysCs)are invoked per work-item they are strongly ordered Anotherapproach is depicted in Figure 4 where one work-item A0invokes a system call on behalf of the entire work-groupStrong ordering is achieved by placing work-group barriers(Bar1 Bar2) before and after system call execution One canremove these barriers with relaxed ordering allowing threadsin wavefronts B C and D to execute overlapping with theCPUrsquos processing of Arsquos system call
For correctness we need to allow programmers to userelaxed ordering when system calls are invoked at kernelgranularity (across work-groups) This is because kernels canconsist of more work-items than can concurrently execute onthe GPU For strongly ordered system calls at the kernel-levelall kernel work-items must finish executing pre-invocationinstructions prior to invoking the system call But all work-items cannot execute concurrently because GPU runtimesdo not preemptively context switch work-items of the samekernel It is not always possible for all work-items to executeall instructions prior to the system call Strong ordering atkernel granularity risks deadlocking the GPU
At the work-group invocation granularity relaxed order-
ing can improve performance by avoiding synchronizationoverheads and overlapping CPU-side system call processingwith the execution of other work-items The key is to removethe barriers Bar1Bar2 in Figure 4 To do this consider thatfrom the application point of view system calls are usuallyproducers or consumers of data Take a consumer call likewrite invoked at the work-group level Real-world GPUprograms may use multiple work-items to generate the datafor the write but instructions after the write call typicallydo not depend on the write outcome Therefore other work-items in the group need not wait for the completion of thesystem call meaning that we can remove Bar2 improvingperformance Similarly producer system calls like readtypically require system calls to return before other work-items can start executing program instructions post-read butdo not necessarily require other work-items in the work-groupto finish executing all instructions before the read invocationBar1 in Figure 4 becomes unnecessary in these cases Thesame observations apply to per-kernel system call invocationsthat need relaxed ordering for correctness anyway
In summary work-item invocations imply strong orderingProgrammers balance performanceprogrammability for work-group invocations by choosing strong or relaxed orderingFinally programmers must use relaxed ordering for kernelinvocations so that the GPU does not deadlock
Blocking versus non-blocking approaches Most tradi-tional CPU system calls ndash barring those for asynchronousIO (eg aio read aio write) ndash return only after the systemcall completes execution Blocking system calls have adisproportionate impact on performance because GPUs useSIMD execution In particular if even a single work-item isblocked on a system call no other work-item in the wavefrontcan make progress either We find that GPUs can oftenbenefit from non-blocking system calls that can immediatelyreturn before system call processing completes Non-blockingsystem calls can therefore overlap system call processing onthe CPU with useful GPU work improving performance byas much as 30 in some of our studies (see Section VII)
The concepts of blockingnon-blocking and strongrelaxedordering are related but orthogonal The strictness of orderingrefers to the question of when a system call can beinvoked with respect to the progress of work-items withinits granularity of invocation System call blocking refers tohow the return from a system call relates to the completionof its processing Relaxed ordering and non-blocking can becombined in several useful ways For example consider acase where a GPU program writes to a file at work-groupgranularity The execution may not depend upon the outputof the write but the programmer may want to ensure that thewrite successfully completes In such a scenario blockingwrites may be invoked with weak ordering Weak orderingpermits all but one wavefront in the work-group to proceedwithout waiting for completion of the write (see Section VI)
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Blocking invocation however ensures that one wavefrontwaits for the write to complete and can raise an alarm if thewrite fails Consider another scenario where a programmerwishes to prefetch data using read system calls but may notuse the results immediately Here weak ordering with non-blocking invocation is likely to provide the best performancewithout breaking the programrsquos semantics In short differentcombinations of blocking and ordering enable programmersto fine-tune performance and programmability tradeoffs
B CPU Hardware
GPUs rely on extreme parallelism for performance Thismeans there may be bursts of system calls that CPUs needto process System call coalescing is one way to increasethe throughput of system call processing The idea is tocollect several GPU system call requests and batch theirprocessing on the CPU The benefit is that CPUs canmanage multiple system calls as a single unit schedulinga single software task to process them This reduction intask management scheduling and processing overheadscan often boost performance (see Section VII) Coalescingmust be performed judiciously as it improves system callprocessing throughput at the expense of latency It alsoimplicitly serializes the processing of system calls within acoalesced bundle
To allow the GPGPU programmer to balance the benefitsof coalescing with its potential challenges GENESYS acceptstwo parameters ndash a time window length within which theCPU coalesces system calls and the maximum number ofsystem call invocations that can be coalesced within the timewindow Section VII shows that system call coalescing canimprove performance by as much as 10-15
C CPU-GPU Communication Hardware
Prior work implemented system calls using polling whereGPU wavefronts monitored predesignated memory locationspopulated by CPUs upon system call completion But recentadvances in GPU hardware enable alternate modes of CPU-GPU communication For example AMD GPUs now allowwavefronts to relay interrupts to CPUs and then halt executionrelinquishing SIMD hardware resources [36] CPUs can inturn message the GPU to wake up halted wavefronts
We have implemented polling and halt-resume approachesin GENESYS With polling if the number of memorylocations that needs to be polled by the GPU exceeds itscache size frequent cache misses lower performance On theother hand halt-resume has its own overheads namely thelatency to resume a halted wavefront
We have found that polling yields better performance whensystem calls are invoked at coarser work-group granularitiessince fewer memory locations are needed to convey informa-tion at the work-group versus work-item level Consequentlythe memory locations easily fit in the GPUrsquos L2 data cacheWhen system calls are invoked per work-item however the
Table IIISYSTEM CONFIGURATION USED FOR OUR STUDIES
SoC Mobile AMD FX-9800PTM
CPU 4times 27GHzAMD Family 21h Model 101h
Integrated-GPU 758 MHz AMD GCN 3 GPUMemory 16 GB Dual-Channel DDR4-1066MHz
Operating system Fedora 26 usingROCm stack 16
(based on Linux 411)Compiler HCC-01017166 + LLVM 50
C++AMP with HC extensions
Figure 5 Content of each slot in syscall area
Figure 6 State transition diagram for a slot in syscall area Green showsGPU stateactions blue shows that of the CPU
sheer number of such memory locations becomes so highthat cache thrashing becomes an issue In these situationshalt-resume approaches outperform polling We quantify theimpact of this in the following sections
VI IMPLEMENTING GENESYS
We implemented GENESYS on the system in Table III Weused an AMD FX-9800P processor with an integrated GPUand ran the open-source Radeon Open Compute (ROCm)software stack [47] Although we use a system with integratedGPU GENESYS is not specific to integrated GPUs andgeneralizes to discrete GPUs We modified the GPU driverand Linux kernel to enable GPU system calls We alsomodified the HCC compiler to permit GPU system callinvocations in C++AMP
Invoking GPU system calls GENESYS permits work-itemwork-group and kernel-level invocation At work-group orkernel-level invocations a single work-item is designated asthe caller For strongly ordered work-group invocations weuse work-group scope barriers before and after system callinvocations For relaxed ordering a barrier is placed eitherbefore (for consumer system calls) or after (for producercalls) system call invocation
GPU to CPU communication GENESYS facilitates effi-cient GPU to CPU communication of system call requests
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
GENESYS uses a preallocated shared memory syscall areato allow the GPU to convey parameters to the CPU (seeSection III) The syscall area maintains one slot for eachactive GPU work-item The OS and driver code can identifythe desired slot by using the hardware ID of the active work-item which is available to GPU runtimes This hardware IDis distinct from the programmer-visible work-item ID Eachof the work-items has a unique work-item ID visible to theapplication programmer At any one point in time howeveronly a subset of these work-items executes (as permitted bythe GPUrsquos CU count supported work-groups per CU andSIMD width) The hardware ID distinguishes among theseactive work-items Overall our system uses 64 bytes per slottotaling 125 MBs of syscall area
Figure 5 shows the contents in a slot ndash the requestedsystem call number the request state system call arguments(as many as 6 in Linux) and padding to avoid false sharingThe field for arguments is also re-purposed for the returnvalue of the system call When a GPU programrsquos work-iteminvokes a system call it atomically updates the state of thecorresponding slot from free to populating (Figure 6) If theslot is not free system call invocation is delayed Once thestate is populating the invoking work-item populates theslot with system call information and changes the state toready The work-item also adds one bit of information aboutwhether the invocation is blocking or non-blocking Thework-item then interrupts the CPU using a scalar wavefrontGPU instruction 3 (s sendmsg on AMD GPUs) For blockinginvocation the work-item either waits and polls the state ofthe slot or suspends itself using halt-resume
CPU-side system call processing Once the GPU interruptsthe CPU system call processing commences The interrupthandler creates a new kernel task and adds it to Linuxrsquoswork-queue This task is also passed the hardware ID of therequesting wavefront At an expedient future point in timean OS worker thread executes this task The task scans the64 syscall slots of the given hardware ID and atomicallyswitches any ready system call requests to the processingstate The task then carries out the system call work
A challenge is that Linuxrsquos traditional system call routinesimplicitly assume that they are to be executed within thecontext of the original process invoking the system callConsequently they use context specific variables (eg thecurrent variable used to identify the process context) Thispresents a challenge for GPU system calls which are insteadserviced purely in the context of the OSrsquo worker threadGENESYS overcomes this challenge in two ways ndash it eitherswitches to the context of the original CPU program thatinvoked the GPU kernel or it provides additional contextinformation in the code performing system call processing
3 Scalar wavefront instructions are part of the Graphics Core Next ISAand are executed once for the entire wavefront rather than for each activework-item See Chapter 41 in the manual [36]
The exact strategy is determined on a case-by-case basisGENESYS implements coalescing by waiting for a prede-
termined amount of time in the interrupt handler beforeenqueueing a task to process a system call If multiplerequests are made to the CPU during this time windowthey are coalesced with such that they can be handled as asingle unit of work by the OS worker-thread GENESYS usesLinuxrsquos sysfs interface to communicate coalescing parameters
Communicating results from the CPU to the GPU Oncethe CPU worker-thread finishes processing the system callthe results are put in the field for arguments in the slotfor blocking requests Further the thread also changes thestate of the slot to finished for blocking system calls Fornon-blocking invocations the state is changed to free Theinvoking GPU work-item is then re-scheduled (on hardwarethat supports wavefront suspension) or automatically restartedbecause it was polling on this state The work-item canconsume the result and continue execution
Other architectural design considerations GENESYS re-quires two key data structures to be exchanged betweenGPUs and CPUs ndash the syscall area and for some systemcalls syscall buffers that maintain data required for systemcall processing We discovered that carefully leveragingarchitectural support for CPU-GPU cache coherence inthe context of these data structures was vital to overallperformance
Consider for example the syscall area As described inSection III every GPU work-item invoking a system call isallocated space in the syscall area Like many GPUs the oneused as our experimental platform supports L2 data cachesthat are coherent with CPU cachesmemory but integratesnon-coherent L1 data caches At first blush one may decideto manually invalidate L1 data cache lines However wesidestepped this issue by restricting per-work-item slots inthe syscall area to individual cache lines This design permitsus to use atomic instructions to access memory ndash these atomicinstructions by design force lookups of the L2 data cacheand guarantee visibility of the entire cacheline sidesteppingthe coherence challenges of L1 GPU data caches We quantifythe measured overheads of the atomic operations we usefor GENESYS in Table IV comparing them to the latencyof a standard load operation Through experimentation weachieved good performance using cmp-swap atomics to claima slot in the syscall area when GPU work-items invokedsystem calls atomic-swaps to change state and atomic-loadsto poll for completion
Unfortunately the same approach of using atomics does notyield good performance for accesses to syscall buffers Thekey issue is that syscall buffers can be large and span multiplecache lines Using atomics here meant that we suffered thelatency of several L2 data cache accesses to syscall buffersWe found that a better approach was to eschew atomics infavor of manual software L1 data cache coherence This
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Table IVPROFILED PERFORMANCE OF GPU ATOMIC OPERATIONS
Op cmp-swap swap atomic-load loadTime(us) 1352 0782 0360 0243
0
01
02
03
04
05
06
07
08
09
1
11
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wi wg kernel
000
005
010
015
020
025
030
035
040
045
32 128512
10242048
GPU
Tim
e (
s)
File size (MB)
wg64wg256
wg512wg1024
Figure 7 Impact of system call invocation granularity The graphs showaverage and standard deviation of 10 runs
meant for example that we preceded sys write system callswith L1 data cache flush
VII MICROBENCHMARK EVALUATIONS
Invocation granularity Figure 7 quantifies performance fora microbenchmark that uses pread on files in tmpfs4 Thex-axis plots the file size and y-axis shows read time withlower values being better Within each cluster we separateruntimes for different pread invocation granularities
Figure 7(left) shows that work-item invocation granularitiestend to perform worst This is not surprising as it is the finestgranularity of system call invocation and leads to a flood ofindividual system calls that overwhelm the CPU On the otherend of the granularity spectrum kernel-level invocation isalso challenging as it generates a single system call and failsto leverage any potential parallelism in processing of systemcall requests This is particularly pronounced at large filesizes (eg 2GB) A good compromise is to use work-groupinvocation granularities It does not overwhelm the CPUwith system call requests while still being able to exploitparallelism in servicing system calls
When using work-group invocation an important questionis how many work-items should constitute a work-groupWhile Figure 7(left) uses 64 work-items in a work-groupFigure 7(right) quantifies the performance impact of preadsystem calls as we vary work-group sizes from 64 (wg64)to 1024 (wg1024) work-items In general larger work-groupsizes enable better performance as there are fewer uniquesystem calls necessary to handle the same amount of work
Blocking and ordering strategies To quantify the impactof blockingnon-blocking with strongrelaxed ordering we
4 Tmpfs is a filesystem without permanent backing storage In otherwords all structures and data are memory-resident
45
5
55
6
65
7
75
0 5 10 15 20 25 30 35 40
GPU
Tim
e p
er
perm
uati
on (
ms)
Compute permutations
weak-blockweak-non-block
strong-blockstrong-non-block
Figure 8 Performance implications of system call blocking and orderingsemantics The graph shows average and standard deviation of 80 runs
designed a GPU microbenchmark that performs block permu-tation on an array similar to the permutation steps performedin DES encryption The input data array is preloaded withrandom values and divided into 8KB blocks Work-groupseach execute 1024 work-items independently permute blocksThe results are written to a file using pwrite at work-groupgranularity The pwrite system calls for one block of dataare overlapped with permutations on other blocks of data Tovary the amount of computation per system call we permutemultiple times before writing the result
Figure 8 quantifies the impact of using blocking versus non-blocking system calls with strong and weak ordering Thex-axis plots the number of permutation iterations performedon each block by each work-group before writing the resultsThe y-axis plots execution time for one permutation (loweris better) Figure 8 shows that strongly ordered blockinginvocations (strong-block) hurt performance This is expectedas they require work-group scoped barriers to be placedbefore and after pwrite invocations The GPUrsquos hardwareresources for work-groups are stalled waiting for the pwriteto complete Not only does the inability to overlap GPU par-allelism hurt strongly ordered blocking performance it alsomeans that GPU performance is heavily influenced by CPU-side performance vagaries like synchronization overheadsservicing other processes etc This is particularly true atiteration counts where system call overheads dominate GPU-side computation ndash below 15 compute iterations Even whenthe application becomes GPU-compute bound performanceremains non-ideal
Figure 8 shows that when pwrite is invoked in a non-blocking manner (with strong ordering) performance im-proves This is because non-blocking pwrites permit theGPU to end the invoking work-group freeing GPU resourcesit was occupying CPU-side pwrite processing can overlapwith the execution of another work-group permuting ona different block of data Figure 8 shows that latenciesgenerally drop by 30 compared to blocking calls at lowiteration counts At higher iteration counts (beyond 16) thesebenefits diminish because the latency to perform repeated
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
0
1
2
3
4
5
6
0 5000 10000 15000 20000
CPU
mem
cpy t
hro
ughput
(GB
s)
GPU atomically polled cachelines
AMD FX-9800P DDR4 1066MHz
Figure 9 Impact of polling on memory contention
permutations dominates any system call processing timesFor relaxed ordering with blocking (weak-block) the post-
system-call work-group-scope barrier is eliminated Oneout of every 16 wavefronts5 in the work-group waits forthe blocking system call while others exit freeing GPUresources The GPU can use these freed resources to runother work-items to hide CPU-side system call latencyConsequently the performance trends follow those of strong-non-block with minor differences in performance arisingfrom differences GPU scheduling of the work-groups forexecution Finally Figure 8 shows system call latency is besthidden using relaxed and non-blocking approaches (weak-non-block)
Pollinghalt-resume and memory contention As previ-ously discussed polling at the work-item invocation gran-ularity leads to memory reads of several thousands of per-work-item memory locations We quantify the resultingmemory contention in Figure 9 which showcases how thethroughput of CPU accesses decreases as the number ofpolled GPU cache lines increases Once the number of polledmemory locations outstrips the GPUrsquos L2 cache capacity(roughly 4096 cache lines in our platform) the GPU pollslocations spilled to the DRAM This contention on thememory controllers shared between GPUs and CPUs Weadvocate using GENESYS with halt-resume approaches atsuch granularities of system call invocation
Interrupt coalescing Figure 10 shows the performanceimpact of coalescing system calls We use a microbenchmarkthat invokes pread We read data from files of differentsizes using a constant number of work-items More data isread per pread system call from the larger files The x-axisshows the amounts of data read and quantifies the latencyper requested byte We present two bars for each point on thex-axis illustrating the average time needed to read one bytewith the system call in the absence of coalescing and whenup to eight system calls are coalesced Coalescing is mostbeneficial when small amounts of data are read Readingmore data takes longer the overhead reduction is negligiblecompared to the significantly longer time to process thesystem call
5Each wavefront has 64 work-items Thus a 1024 work-item work-grouphas 16 wavefronts
0
05
1
15
2
25
3
35
4
45
5
64 2561024
20484096
GPU
Tim
e p
er
request
byte
(m
s)
Size per syscall(B)
uncoalesced coalesced (8 interrupts)
Figure 10 Implications of system call coalescing The graph shows averageand standard deviation of 20 runs
VIII CASE STUDIES
A Memory Workload
We now assess the end-to-end performance of workloadsthat use GENESYS Our first application requires memorymanagement system calls We studied miniAMR [48] andused the madvise system call directly from the GPU to bettermanage memory MiniAMR performs 3D stencil calculationsusing adaptive mesh refinement and is a candidate for memorymanagement because it varies its memory needs in a data-model-dependent manner For example when simulatingregions experiencing turbulence miniAMR needs to modelwith higher resolution However if lower resolution is possi-ble without compromising simulation accuracy miniAMRreduces memory and computation usage making it possible tofree memory While relinquishing excess memory in this wayis not possible in traditional GPU programs without explicitlydividing the offload into multiple kernels interspersed withCPU code (see Figure 1) GENESYS permits this with directGPU madvise invocations We invoke madvise using work-group granularities with non-blocking and weak orderingWe leverage GENESYSrsquo generality by also using getrusageto read the application resident set size (RSS) When theRSS exceeds a watermark madvise relinquishes memory
0
05
1
15
2
25
3
35
4
45
0 20 40 60 80 100 120 140 160 180 200
RSS (
GB
)
Time (s)
rss-3GB rss-4GB
Figure 11 Memory footprint of miniAMR using getrusage and madviseto hint at unused memory
We execute miniAMR with a dataset of 41GB ndash justbeyond the hard limit we put on physical memory availableto GPU Without using madvise memory swapping increases
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
so dramatically that it triggers GPU timeouts causing theexisting GPU device driver to terminate the applicationBecause of this there is no baseline to compare to as thebaseline simply does not complete
Figure 11 shows two results one for a 3GB RSS water-mark and one for 4GB Not only does GENESYS enableminiAMR to complete it also permits the programmer tobalance memory usage and performance While rss-3GBlowers memory utilization it also worsens runtime comparedto rss-4GB This performance gap is expected the morememory is released the greater the likelihood that the GPUprogram suffers from page faults when the memory is touchedagain in the future and the more frequent the madvise systemcall is invoked Overall Figure 11 illustrates that GENESYSallows programmers to perform memory allocation to tradememory usage and performance on GPUs analogous to CPUs
B Workload Using Signals
0
2
4
6
8
10
12
Runnin
g T
ime (
s)
baseline genesys
Figure 12 Runtime of CPU- GPUmap reduce workload
GENESYS also enables sys-tem calls that permit theGPU to send asynchronousnotifications to the CPUThis is useful in manyscenarios We study onesuch scenario and imple-ment a map-reduce appli-cation called signal-searchThe application runs in twophases The first phase per-forms parallel lookup in a
data array while the second phase processes blocks ofretrieved data and computes sha512 checksums on themThe first phase is a good fit for GPUs since the highlyparallel lookup fits its execution model while the secondphase is more appropriate for CPUs many of which supportperformance acceleration of sha checksums via dedicatedinstructions
Without support for signal invocation on GPUs program-mers would launch a kernel with the entire parallel lookup onthe GPU and wait for it to conclude before proceeding withthe sha512 checksums GENESYS overcomes this restrictionand permits a heterogeneous version of this code whereGPU work-groups can emit signals using rt sigqueueinfo tothe CPU indicating per-block completions of the parallelsearch As a result the CPU can begin processing theseblocks immediately permitting overlap with GPU execution
Operationally rt sigqueueinfo is a generic signal systemcall that allows the caller to fill a siginfo structure thatis passed along with the signal In our workload we findthat work-group level invocations perform best so the GPUpasses the identifier of this work-group through the siginfostructure Figure 12 shows that using work-group invocationgranularity and non-blocking invocation results in roughly14 performance speedup over the baseline
C Storage Workloads
We have also studied how to use GENESYS to supportstorage workloads In some cases GENESYS permits theimplementation of workloads supported by GPUfs but inmore efficient ways
0
2
4
6
8
10
Runnin
g T
ime (
s)
a)
20
22
24
26
CPU-grepGPU-GENESYS (WG)GPU-GENESYS (WI)
GPU-GENESYS (WI)-suspendCPU-openmp
0
5
10
15
20
25
30
35
40
45
Runnin
g T
ime (
s)
b)
GPU-GENESYSGPU-NOSYS
CPU
Figure 13 a) Standard grep OpenMP grep versus work-itemwork-groupinvocations with GENESYS b) Comparing the performance of CPU GPUwith no system call and GENESYS implementations of wordcount Graphsshow average and standard deviation of 10 runs
Supporting storage workloads more efficiently thanGPUfs We implement a workload that takes as input alist of words and a list of files It then performs grep -F -lon the GPU identifying which files contain any of the wordson the list As soon as these files are found they are printedto the console This workload cannot be supported by GPUfswithout significant code refactoring because of its use ofcustom APIs Instead since GENESYS naturally adheres tothe UNIX ldquoeverything is a filerdquo philosophy porting grep toGPUs requires only a few hours of programming
Figure 13 shows the results of our GPU grep exper-iments We compare a standard CPU implementation aparallelized OpenMP CPU implementation and two GPUimplementations with GENESYS with non-blocking systemcalls invoked at work-item (WI) and work-group (WG)granularities Furthermore since work-item invocations cansometimes achieve better performance using halt-resume(versus work-group and kernel invocations which alwaysachieve better performance with polling) we separate resultsfor WI-polling and WI-halt-resume GENESYS enables ourGPU grep implementation to print output lines to console orfiles using standard OS output GENESYS achieves 20-23timesspeedups over an OpenMP version of grep
Figure 13 also shows that GENESYSrsquo flexibility with invo-cation granularity blockingnon-blocking and strongrelaxedordering can boost performance (see Section V) For ourgrep example because a file only needs to be searcheduntil the first instance of a matching word a work-itemcan immediately invoke a write of the filename rather thanwaiting for all matching files We find that WI-halt-resumeoutperforms both WG and WI-polling by roughly 3-4
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40 45
CPU
uti
lizati
on (
)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
0
40
80
120
160
200
0 5 10 15 20 25 30 35 40 45IO T
hro
ughput
(MB
s)
Time (s)
GPU-NOSYS GPU-GENESYS CPU
Figure 14 Wordcount IO and CPU utilization reading from SSD Graphsshow average and standard deviation of 10 runs
Workload from prior work GENESYS also supports aversion of the same workload that is evaluated in the originalGPUfs work traditional word count where using open readand close system calls Figure 13 shows our results Wecompare the performance of a parallelized CPU versionof the workload with OpenMP a GPU version of theworkload with no system calls and a GPU version withGENESYS Both CPU and GPU workloads are configuredto search for occurrences of 64 strings We found that withGENESYS these system calls were best invoked at work-group granularity with blocking and weak-ordering semanticsAll results are collected on a system with an SSD
We found that GENESYS achieves nearly 6times performanceimprovement over the CPU version Without system callsupport the GPU version is far worse than the CPU versionFigure 14 sheds light on these benefits We plot traces forCPU and GPU utilization and disk IO throughput GENESYSextracts much higher throughput from the underlying storagedevice (up to 170MBs compared to the CPUrsquos 30MBs)Offloading search string processing to the GPU frees up theCPU to process system calls effectively The change in CPUutilization between the GPU workload and CPU workloadreveals this trend In addition we found that the GPUrsquosability to launch more concurrent IO requests enabled theIO scheduler to make better scheduling decisions
D Network Workloads
We have studied the benefits of GENESYS for network IOin the context of memcached While this can technically beimplemented using GPUnet the performance implicationsare unclear because the original GPUnet paper used APIsthat extracted performance using dedicated RDMA hardwareWe make no such assumptions and focus on the two corecommands ndash SET and GET SET stores a key-value pair andGET retrieves a value associated with a key if it is presentOur implementation supports a binary UDP memcachedversion with a fixed-size hash table as a back-end storage Thehash table is shared between CPU and GPU and its bucketsize and count are configurable Further this memcached
0
05
1
15
2
25
misses
hits
milliseconds
Latency
CPUGPU-GENESYS
GPU-NOSYS
0
10000
20000
30000
40000
50000
60000
misses
hits
Opss
Throughput
CPUGPU-GENESYS
GPU-NOSYS
Figure 15 Latency and throughput of memcached Graphs show averageand standard deviation of 20 runs
implementation enables concurrent operations CPUs canhandle SETs and GETs while the GPU supports only GETsOur GPU implementation parallelizes the hash computationbucket lookup and data copy We use sendto and recvfromsystem calls for UDP network access These system calls areinvoked at work-group granularity with blocking and weakordering as this performs best
Figure 15 compares the performance of a CPU versionof this workload with GPU versions using GENESYS GPUsaccelerate memcached by parallelizing lookups on bucketswith more elements For example Figure 15 shows speedupswhen there are 1024 elements per bucket (with 1KB datasize) Without system calls GPU performance lags behindCPU performance However GENESYS achieves 30-40latency and throughput benefits over not just CPU versionsbut also GPU versions without direct system calls
E Device Control
Finally we also used GENESYS to implement ioctl systemcalls As an example we used ioctl to query and controlframebuffer settings The implementation is straightforwardthe GPU opens devfb0 and issues a series of ioctl commandsto query and set settings of the active frame buffer It thenproceeds to mmap the framebuffer memory and fill it withdata from a previously mmaped raster image This resultsin the image displayed on computer screen While not acritical GPGPU application this ioctl example demonstratesthe generality and flexibility of OS interfaces implementedby GENESYS
IX DISCUSSION
Asynchronous system call handling GENESYS enqueuesthe GPU system callrsquos kernel task and processes it outside ofthe interrupt context We do this because Linux is designedsuch that few operations can be processed in an interrupthandler A potential concern with this design however isit defers the system call processing to potentially past theend of the life-time of the GPU thread and potentially theprocess that created the GPU thread itself It is an example ofa more general problem with asynchronous system calls [49]
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Figure 16 Raster image copied to the framebuffer by the GPU
Our solution is to provide a new function call invoked bythe CPU that ensures all GPU system calls have completedbefore the termination of the process
Related work Beyond work already discussed [25 26 28]the latest generation of C++AMP [23] and CUDA [24]provide access to the memory allocator These efforts usea user-mode service thread on the CPU to proxy requestsfrom the GPU [27] Like GENESYS system call requests areplaced in a shared queue by the GPU From there howeverthe designs are different Their user-mode thread polls thisshared queue and ldquothunksrdquo the request to the libc runtimeor OS This incurs added overhead for entering and leavingthe OS kernel
Some studies provide network access to GPU code [26 50ndash52] NVidia provides GPUDirect [52] used by several MPIlibraries [53ndash55] that allows the NIC to bypass main memoryand communicate directly to memory on the GPU itselfGPUDirect does not provide a programming interface forGPU-side code The CPU must initiate communication withthe NIC Oden exposed the memory-mapped control interfaceof the NIC to the GPU and thereby allowed the GPU todirectly communicate with the NIC [51] This low-levelinterface however lacks the benefits of a traditional OSinterface (eg protection sockets TCP)
X CONCLUSIONS
We shed light on research questions fundamental to theidea of accessing OS services from accelerators by realizingan interface for generic POSIX system call support on GPUsEnabling such support requires subtle changes of existingkernels In particular traditional OSes assume that systemcall processing occurs in the same context as the invokingthread and this needs to change for accelerators We havereleased GENESYS to make these benefits accessible forbroader research on GPUs
XI ACKNOWLEDGMENTS
We thank the National Science Foundation which partiallysupported this work through grants 1253700 and 1337147We thank Guilherme Cox and Mark Silberstein for their
feedback AMD the AMD Arrow logo and combinationsthereof are trademarks of Advanced Micro Devices Inc PCIeis a registered trademark of PCI-SIG Corporation ARM is aregistered trademark of ARM Limited OpenCLTM is a trade-mark of Apple Inc used by permission by Khronos Otherproduct names used in this publication are for identificationpurposes only and may be trademarks of their respectivecompanies ccopy 2018 Advanced Micro Devices Inc All rightsreserved
REFERENCES
[1] S Mittal and J Vetter ldquoA Survey of CPU-GPU HeterogeneousComputing Techniquesrdquo in ACM Computing Surveys 2015
[2] J Owens D Luebke N Govindaraju M Harris J Kruger A Lefohnand T Purcell ldquoA Survey of General-Purpose Computation onGraphics Hardwarerdquo in Computer Graphics Forum Vol 26 Num1 pp 80-113 2007
[3] D Tarditi S Puri and J Oglesby ldquoAcelerator Using Data Paral-lelism to Program GPUs for General-Purpose Usesrdquo in InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) 2006
[4] J Vesely A Basu M Oskin G Loh and A Bhattacharjee ldquoObser-vations and Opportunities in Architecting Shared Virtual Memory forHeterogeneous Systemsrdquo in International Symposium on PerformanceAnalysis of Systems and Software (ISPASS) 2016
[5] J Obert J V Waveran and G Sellers ldquoVirtual Texturing in Softwareand Hardwarerdquo in SIGGRAPH Courses 2012
[6] G Cox and A Bhattacharjee ldquoEfficient Address Translation with Mul-tiple Page Sizesrdquo in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) 2017
[7] L Olson J Power M Hill and D Wood ldquoBorder Control Sandbox-ing Acceleratorsrdquo in International Symposium on Microarchitecture(MICRO) 2007
[8] B Pichai L Hsu and A Bhattacharjee ldquoArchitectural Supportfor Address Translation on GPUsrdquo in International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) 2014
[9] J Power M Hill and D Wood ldquoSupporting x86-64 AddressTranslation for 100s of GPU Lanesrdquo in International Symposiumon High Performance Computer Architecture (HPCA) 2014
[10] N Agarwal D Nellans E Ebrahimi T Wenisch J Danskin andS Keckler ldquoSelective GPU Caches to Eliminate CPU-GPU HWCache Coherencerdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2016
[11] H Foundation ldquoHSA Programmerrsquos Reference Manual [Online]rdquo 2015httpwwwhsafoundationcomddownload=4945
[12] G Kyriazis ldquoHeterogeneous System Architecture A Technical Reviewrdquo2012
[13] J Power A Basu J Gu S Puthoor B Beckmann M HillS Reinhardt and D Wood ldquoHeterogeneous System Coherencefor Integrated CPU-GPU Systemsrdquo in International Symposium onMicroarchitecture (MICRO) 2013
[14] S Sahar S Bergman and M Silberstein ldquoActive Pointers A Case forSoftware Address Translation on GPUsrdquo in International Symposiumon Computer Architecture (ISCA) 2016
[15] C J Thompson S Hahn and M Oskin ldquoUsing Modern GraphicsArchitectures for General-Purpose Computing A Framework andAnalysisrdquo in International Symposium on Microarchitecture (MICRO)2002
[16] J Owens U Kapasi P Mattson B Towles B Serebrin S Rixnerand W Dally ldquoMedia Processing Applications on the Imagine StreamProcessorrdquo in International Conference on Computer Design (ICCD)2002
[17] J Owens B Khailany B Towles and W Dally ldquoComparingReyes and OpenGL on a Stream Architecturerdquo in SIGGRAPHEUROGRAPHICS Conference on Graphics Hardware 2002
[18] S Rixner W Dally U Kapasi B Khailany A Lopez-LagunasP Mattson and J Owens ldquoA Bandwidth-Efficient Architecture for
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
Media Processingrdquo in International Symposium on Microarchitecture(MICRO) 1998
[19] B Serebrin J Owens B Khailany P Mattson U Kapasi C ChenJ Namkoong S Crago S Rixner and W Dally ldquoA Stream ProcessorDevelopment Platformrdquo in International Conference on ComputerDesign (ICCD) 2002
[20] K Group ldquoOpenGLrdquo 1992 httpswwwkhronosorgopengl[21] Microsoft ldquoProgramming Guide for DirectX [Online]rdquo 2016
httpsmsdnmicrosoftcomen-uslibrarywindowsdesktopff476455(v=vs85)aspx
[22] K Group ldquoOpenCLrdquo 2009 httpswwwkhronosorgopencl[23] Microsoft ldquoC++ AMP (C++ Accelerated Massive Parallelism) [On-
line]rdquo 2016 httpsmsdnmicrosoftcomen-uslibraryhh265137aspx[24] NVidia ldquoCUDA Toolkit Documentationrdquo 2016 rdquohttpdocsnvidia
comcudardquo[25] M Silberstein B Ford I Keidar and E Witchel ldquoGPUfs integrating
file systems with GPUsrdquo in International Conference on Architec-tural Support for Programming Languages and Operating Systems(ASPLOS) 2013
[26] S Kim S Huh X Zhang Y Hu A Wated E Witchel and M Sil-berstein ldquoGPUnet Networking Abstractions for GPU Programsrdquo inUSENIX Symposium on Operating Systems Design and Implementation(OSDI) 2014
[27] J Owens J Stuart and M Cox ldquoGPU-to-CPU Callbacksrdquo inWorkshop on Unconventional High Performance Computing 2010
[28] S Bergman T Brokhman T Cohen and M Silberstein ldquoSPINSeamless Operating System Integration of Peer-to-Peer DMA BetweenSSDs and GPUsrdquo in Usenix Technical Conference (ATC) 2017
[29] AMD ldquoRadeon Open Compute [Online]rdquo 2017 httpsrocmgithubio
[30] AMD ldquoROCK syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCK syscall
[31] AMD ldquoROCT syscallrdquo 2017 httpsgithubcomRadeonOpenComputeROCT syscall
[32] AMD ldquoHCC syscallrdquo 2017 httpsgithubcomRadeonOpenComputeHCC syscall
[33] AMD ldquoGenesys syscall testrdquo 2017 httpsgithubcomRadeonOpenComputeGenesys syscall test
[34] R Draves B Bershad R Rashid and R Dean ldquoUsing Continuationsto Implement Thread Management and Communication in OperatingSystemsrdquo in Symposium on Operating Systems Principles (SOSP)1991
[35] O Shivers and M Might ldquoContinuations and Transducer Compositionrdquoin International Symposium on Programming Languages Design andImplementation (PLDI) 2006
[36] AMD ldquoGraphics Core Next 30 ISA [Online]rdquo 2013httpamd-devwpenginenetdna-cdncomwordpressmedia201307AMD GCN3 Instruction Set Architecturepdf
[37] Intel ldquoIntel Open Source HD Graphics Programmers Reference Manual[Online]rdquo 2014 https01orgsitesdefaultfilesdocumentationintel-gfx-prm-osrc-bdw-vol02b-commandreference-instructions 0 0pdf
[38] NVidia ldquoNVIDIA Compute PTX Parallel Thread Executionrdquo 2009httpswwwnvidiacomcontentCUDA-ptx isa 14pdf
[39] PCI-SIG ldquoAtomic Operations [Online]rdquo 2008 httpspcisigcomsitesdefaultfilesspecification documentsECN Atomic Ops 080417pdf
[40] T Sorensen A F Donaldson M Batty G Gopalakrishnan andZ Rakamaric ldquoPortable inter-workgroup barrier synchronisation forgpusrdquo in Proceedings of the 2016 ACM SIGPLAN InternationalConference on Object-Oriented Programming Systems Languagesand Applications OOPSLA 2016 (New York NY USA) pp 39ndash58ACM 2016
[41] T Rogers M OrsquoConnor and T Aamodt ldquoCache-Conscious Wave-front Schedulingrdquo in International Symposium on Microarchitecture(MICRO) 2012
[42] T Rogers M OrsquoConnor and T Aamodt ldquoDivergence-Aware WarpSchedulingrdquo in International Symposium on Microarchitecture (MI-CRO) 2013
[43] A ElTantawy and T Aamodt ldquoWarp Scheduling for Fine-GrainedSynchronizationrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[44] H Wang F Luo M Ibrahim O Kayiran and A Jog ldquoEfficient and
Fair Multi-programming in GPUs via Pattern-Based Effictive Band-width Managementrdquo in International Symposium on High PerformanceComputer Architecture (HPCA) 2018
[45] O Kayiran A Jog M Kandemir and C Das ldquoNeither More Nor LessOptimizing Thread-Level Parallelism for GPGPUsrdquo in InternationalConference on Parallel Architectures and Compilation Techniques(PACT) 2013
[46] X Tang A Pattnaik H Jiang O Kayiran A Jog S Pai M IbrahimM T Kandemir and C R Das ldquoControlled Kernel Launch forDynamic Parallelism in GPUsrdquo in International Symposium on HighPerformance Computer Architecture (HPCA) 2017
[47] AMD ldquoROCm a New Era in Open GPU Computing [Online]rdquo 2016httpsradeonopencomputegithubioindexhtml
[48] A Sasidharan and M Snir ldquoMiniAMR - A miniapp for AdaptiveMesh Refinementrdquo 2014 httphdlhandlenet214291046
[49] Z Brown ldquoAsynchronous System Callsrdquo in Ottawa Linux Symposium2007
[50] J A Stuart P Balaji and J D Owens ldquoExtending mpi to acceleratorsrdquoPACT 2011 Workshop Series Architectures and Systems for Big DataOct 2011
[51] L Oden ldquoDirect Communication Methods for Distributed GPUsrdquo inDissertation Thesis Ruprecht-Karls-Universitat Heidelberg 2015
[52] NVidia ldquoNVIDIA GPUDirect [Online]rdquo 2013 httpsdevelopernvidiacomgpudirect
[53] H Wang S Potluri D Bureddy C Rosales and D K Panda ldquoGpu-aware mpi on rdma-enabled clusters Design implementation andevaluationrdquo IEEE Transactions on Parallel and Distributed Systemsvol 25 pp 2595ndash2605 Oct 2014
[54] E Gabriel G E Fagg G Bosilca T Angskun J J Dongarra J MSquyres V Sahay P Kambadur B Barrett A Lumsdaine R HCastain D J Daniel R L Graham and T S Woodall ldquoOpen MPIGoals concept and design of a next generation MPI implementationrdquoin Proceedings 11th European PVMMPI Usersrsquo Group Meeting(Budapest Hungary) pp 97ndash104 September 2004
[55] IBM ldquoIBM Spectrum MPIrdquo 2017 httpwww-03ibmcomsystemsspectrum-computingproductsmpi
top related