cct technical report series cct-tr-2008-1...n u m b er of b an ks in p ara llel d ata cach e 16 n u...

17
Center for Computation & Technology CCT Technical Report Series CCT-TR-2008-1 A General Relativistic Evolution Code on CUDA Architectures Burkhard Zink Center for Computation & Technology and Department of Physics & Astronomy Louisiana State University, Baton Rouge, LA 70803 USA

Upload: others

Post on 09-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Center for Computation & Technology CCT Technical Report Series CCT-TR-2008-1 A General Relativistic Evolution Code on CUDA Architectures Burkhard Zink Center for Computation & Technology and Department of Physics & Astronomy Louisiana State University, Baton Rouge, LA 70803 USA

Page 2: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Posted January 2008. cct.lsu.edu/CCT-TR/CCT-TR-2008-1 The author(s) retain all copyright privileges for this article and accompanying materials. Nothing may be copied or republished without the written consent of the author(s).

Page 3: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

A general relativistic evolution code on CUDAarchitectures

Burkhard ZinkCenter for Computation and Technology, and

Department of Physics and Astronomy,Louisiana State University,

Baton Rouge, LA 70803, [email protected]

ABSTRACTI describe the implementation of a finite-di!erencing codefor solving Einstein’s field equations on a GPU, and measurespeed-ups compared to a serial code on a CPU for di!erentparallelization and caching schemes. Using the most e"-cient scheme, the (single precision) GPU code on an NVIDIAQuadro FX 5600 is shown to be up to 26 times faster thanthe a serial CPU code running on an AMD Opteron 2.4 GHz.Even though the actual speed-ups in production codes willvary with the particular problem, the results obtained hereindicate that future GPUs supporting double-precision op-erations can potentially be a very useful platform for solvingastrophysical problems.

1. INTRODUCTIONThe high parallel processing performance of graphics pro-cessing units (GPUs), with current models achieving peakperformances of up to 350 GFlop/s (for single precisionfloating-point operations), has been used traditionally totransform, light and rasterize triangles in three-dimensionalcomputer graphics applications. In recent architectures, how-ever, the vectorized pipeline for processing triangles has beenreplaced by a unified scalar processing model based on alarge set of stream processors. This change has initiateda consideration of GPUs for solving general purpose com-puting problems, and triggered the field of general-purposecomputing on graphics-processing units (GPGPU).

High-performance, massively parallel computing is one ofthe major tools for the scientific community to understandand quantify problems not amenable to traditional analyti-cal techniques, and has led to ever-increasing hardware per-formance requirements for tackling more and more advancedquestions. Therefore, GPGPU appears to be a natural tar-get for scientific and engineering applications, many of whichadmit highly parallel algorithms which are already used oncurrent-generation supercomputers based on multi-core CPUs.

In this technical report, I will describe an implementationof one of the most challenging problems in computationalphysics, solving Albert Einstein’s field equations for generalrelativistic gravitation, on a graphics processing unit. Theprimary purpose is to make an estimate of potential perfor-mance gains in comparison to current CPUs, and gain anunderstanding of architectural requirements for middlewaresolutions serving the needs of the scientific community, mostnotably the Cactus Computational Toolkit [4, 3, 9].

The particular GPU used for these experiments is an NVIDIAG80 series card (Quadro FX 5600). NVIDIA has also re-leased a software development kit called CUDA (computeunified device architecture) [8] for development of GPU codeusing an extension of the C language. As opposed to earlierattempts to program GPUs with the shader language sup-plied for graphics applications, this makes it easier to portexisting general-purpose computation code to the target de-vice.

Section 2 contains a description of the G80 hardware, theCUDA architectural model, and the performance consider-ations important for GPU codes. Then, in Section 3, wewill turn to the particular problem of solving Einstein’s fieldequations, which we will approach from a mostly algorithmic(as opposed to physical) point of view. Section 4 describesthe structure and implementation of the code, and Section 5discusses the benchmarking results obtained. I give a dis-cussion of the results, and an outlook, in Section 6.

2. CUDA AND THE G80 ARCHITECTURE2.1 The NVIDIA G80 architectureThe NVIDIA G80 hardware [7] is the foundation of theGeForce 8 series consumer graphics cards, the Quadro FX4600 and 5600 workstation cards, and the new Tesla 870set of GPGPU boards. G80 represents the third major ar-chitectural change for NVIDIA’s line of graphics accelera-tors. Traditionally, GPU accelerators enhance the trans-formation and rendering of simple geometric shapes, usu-ally triangles. The processing pipeline consists of transfor-mation and lighting (now vertex shading), triangle setup,pixel shading, raster operations (blending, z-bu!ering, anti-aliasing) and the output to the frame bu!er for scan-out tothe display. First and second generation GPU architecturestypically process these steps with special purpose hardwarein a pipelined fashion.

1

Page 4: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Number of multiprocessors (MP) 16Number of stream processors per MP 8Warp size (see text) 32Parallel data cache 16 kBNumber of banks in parallel data cache 16Number of 32-bit registers per MP 8192Clock frequency of each MP 1.35 GHzFrame bu!er memory type GDDR3Frame bu!er interface width 384 bitsFrame bu!er size 1.5 GBConstants memory size 64 kBClock frequency of the board 800 MHzHost bus interface PCI Express

Table 1: Technical specifications of a Quadro FX5600 GPU.

The increase in demand for programmability of illuminationmodels and geometry modifiers, and also load-balancing re-quirements with respect to vertex and pixel shading opera-tions, has led to more generality in GPU design. The currentG80 architecture consists of a parallel set of stream proces-sors with a full set of integer and (up to FP32) floating pointinstructions. When processing triangles and textures, theindividual steps of the processing pipeline are dynamicallymapped to these processors, and since the operations arehighly parallelizable, the scheduler can consistently main-tain a high load.

Physically, eight stream processors (SP) are arranged in amultiprocessor with texture filtering and addressing units, atexture cache, a set of registers, a cache for constants, and aparallel data cache. Each multiprocessor is operated by aninstruction decoding unit which executes a particular com-mand in a warp: the same command is executed on all SPsfor a set of clock cycles (because the instruction units andthe SP ALUs have di!erent clock speeds). This constitutesa minimal unit of SIMD computation on the multiproces-sor called the warp size, and will be important later whenconsidering code e"ciency on the architecture.

The GPU card contains a set of such multiprocessors, withthe number depending on the particular model (e.g, one inthe GeForce 8400M G, 12 in the GeForce 8800 GTS, and16 in the GeForce 8800 GTX, Quadro FX 5600 and Tesla870 models). The multiprocessors are operated by a threadscheduling unit with fast switching capabilities. In addition,the board has frame bu!er memory and and an extra setof memory for constants. The particular numbers for theQuadro FX 5600 are listed in Table 1, and Fig. 1 shows adiagram of the architecture (cf. also [7]).

The actual peak performance of the card depends on howmany operations can be performed in one cycle. The streamprocessors technically support one MAD (multiply-add) andone MUL (multiply) per cycle, which would correspond to1.35 GHz * 3 * 128 = 518.4 GFlop/s. However, not allexecution units can be used simultaneously, so a conservativeestimate is to assume one MAD per cycle, leading to a peakperformance of 345.6 GFlop/s.

2.2 CUDA (compute unified device architec-ture)

Since the G80 is based on general purpose stream processorswith a high peak performance, it appears to be a naturaltarget for general purpose parallel computing, and in par-ticular scientific applications. NVIDIA has recently releasedthe CUDA SDK [8] for running parallel computations on thedevice hardware. While earlier attempts to use GPUs forgeneral-purpose computing had to use the various shadinglanguages, e.g. Microsoft’s HLSL or NVIDIA’s Cg, CUDAis based on an extension of the C language.

The SDK consists of a compiler (nvcc), host and device run-time libraries, and a driver API. Architecturally, the driverbuilds the primary layer on top of the device hardware,and provides interfaces which can be accessed either by theCUDA runtime or the application. Furthermore, the CUDAruntime can be accessed by the application and the servicelibraries (currently for BLAS and FFT).

CUDA’s parallelization model is a slight abstraction of theG80 hardware. Threads are arranged into blocks, whereeach block is executed on only one multiprocessor. There-fore, within a block, additional thread context and synchro-nization options exist (by use of the shared resources onthe chip), whereas no global synchronization is available be-tween blocks. A set of blocks constitute a SIMD computekernel. Kernel calls themselves are asynchronous to the hostCPU: they return immediately after issuance. Currentlyonly one kernel can be executed at any time on the device,which is a limitation of the driver software.

The thread-block model hides the particular number of mul-tiprocessors and stream processors from the CUDA kernelinsofar as the block grid is partially serialized into batches.Each multiprocessor can execute more than one block andmore threads than the warp size per multiprocessor to hidedevice memory access latency. However, the limitations ofthis model are implicit performance characteristics: If thenumber of blocks is lower than the number of multiproces-sors, or the number of threads is lower than the warp size,significant performance degradation will occur. Therefore,the physical configuration is only abstracted insofar as thekernel compiles and runs on di!erent GPU configurations,but not in terms of achieving maximal performance.

Within the kernel thread code, the context provides a log-ical arrangement of blocks in one- or two-dimensional, andthreads per block in one-, two-, and three-dimensional sets,which is convenient for the kind of numerical grid applica-tion we will present below since it avoids additional (costly)modulo operations inside the thread. Threads support lo-cal variables and access to the shared memory (which mapsto the data parallel cache in G80), which is common for allthreads within a block.

The kernel is called from the host code via a special syntaxwhich specifies the block grid size and threads per block, andcan be synchronized back to the host via a runtime function.Therefore, the host CPU can perform independent compu-tations or issue additional kernels while the GPU operates.This may be interesting for delayed analysis and output pro-cesses performed on the data, although the bandwidth for

2

Page 5: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 1: Simplified diagram of the G80 architecture. The GPU contains several multiprocessors (MPs)which are operated by a common thread scheduler with fast switching capabilities. Each multiprocessor canrun several threads using stream processors (SPs) which share a common instruction unit, a set of registersand a data parallel cache. A memory bus connects the MPs to the frame bu!er memory and a constantsmemory.

3

Page 6: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

transferring from the frame bu!er (where grid variables willbe stored for evolution) to the host memory has to be takeninto account.

2.3 Performance optimization considerationsTo achieve high performance of the GPU for a particularapplication problem, often approximately measured in termsof a speed-up compared to a serial code on the host CPU,the kernel would usually encompass those operations whichare highly data-parallelizable due to the SIMD concept ofCUDA.

An e"cient port needs to maximize the ratio of arithmeticover device memory operations, the arithmetic intensity,since access to the frame bu!er involves large latencies (upto 600 cycles on the G80) because it is not cached. Onecommon strategy is to first copy selected parts of the de-vice memory which is needed by all threads to the sharedmemory, then operating on it, and finally writing the resultsback to device memory. At the same time, many threads canbe started concurrently on the multiprocessors to hide theaccess latencies.

Each block issued to a multiprocessor reserves memory forthe local variables of all threads in the block, and sharedmemory according to the allocation made in the kernel. Thenumber of blocks and threads running concurrently on themultiprocessor is limited by the shared memory space (16 kBon the Quadro FX 5600), which implies a trade-o! betweenhiding device memory access latency by multi-threading ande!ective cache size per thread.

Since threads are operated on in warps, i.e. each of thestream processors in a multiprocessor operates sequentiallyon batches of SIMD instructions, the resulting warp sizeis the minimum number of threads per block for e"cientparallelization, and the actual number should be an integermultiple of it. In addition to the upper bound on the cachesize by concurrent issuance of blocks, this gives a problem-dependent lower bound on the number of threads in a block.NVIDIA also states [8] that the optimal number of threadsper block is actually 64 in the G80 architecture to allow thenvcc compiler to reduce the number of register bank conflictsin an optimal way.

On a G80 card, actual FP32 operations like multiply (MUL)and multiply+add (MAD) cost 4 cycles per warp, i.e. onecycle per thread, reciprocals cost 16 cycles, and for certainmathematical functions like sine and cosine microcode isavailable which performs the evaluation in 32 cycles. Con-trol flow instructions are to be avoided if possible, since inmany cases they may have to be serialized.

Accessing shared memory by threads should avoid bank con-flicts to ensure maximal bandwidth. This requires the threadsof a warp to read and write to the shared memory in a certainway, since consecutive addresses (e.g. obtained from a baseaddress and an o!set given by the thread id) are distributedinto several independent partial warps for maximum band-width. There is, however, also a broadcast mechanism whichcan distribute data from one bank to all threads in the warp.

Finally, accessing device memory can be optimized by co-

alescing operations, usually again in the form base addressplus thread dependent o!set. In addition, the base addressshould be aligned to a multiple of the half-warp size timesthe type size in bytes for coalescing the read/write opera-tion.

All these performance requirements come into play when op-timizing a code for high speed-ups. The experiments belowwill demonstrate that the actual speed-up can be very sen-sitive to the particular implementation of the memory man-agement. While non-optimized implementations for prob-lems with high arithmetic intensity may already achieve asignificant speed-up, a full use of the GPU’s floating pointperformance requires some thought. We will see that the na-ture of the problem also imposes trade-o!s between di!erentperformance requirements with strongly a!ect the maximumspeed-up, so that e.g. comparing even implementations ofdi!erent finite-di!erence problems can produce completelydi!erent speed-ups, since there is a trade-o! between localcache space (i.e. number of grid variables) and the numberof threads in a block.

3. THE ALGORITHM TO SOLVEEINSTEIN’S FIELD EQUATIONS

In its best-known form, Einstein’s field equation for generalrelativity can be written as

Gµ! = 8!Tµ! (1)

where µ, " = 0 . . . 3. Gµ! are the sixteen components ofthe Einstein tensor, and Tµ! are the sixteen components ofthe energy-momentum tensor. Both functionally depend onthe spacetime metric tensor gµ! which described the localcurvature of spacetime and is therefore the central object ofgeneral relativity, since curvature relates to physical e!ectslike light bending and attraction of bodies by gravitation.

We will, however, not be concerned with the physical in-terpretation of these equations here (for an introduction see[11]), but only with the requirements to formulate a finite-di!erencing evolution algorithm from them. The field equa-tions as formulated in eqn. 1 can be transformed into a reg-ular initial-boundary value problem (i.e., a set of partial dif-ferential equations in time and space) for twelve variables:the six components of the three-metric, gij , i, j = 1 . . . 3,and the six components of the extrinsic curvature, Kij . Inaddition, the equations contain the four free gauge functions# and $i as parameters, which are usually also treated asevolutionary variables.

The usual approach to solving these equations proceeds asfollows:

1. Define a spatial domain to solve on, e.g. give a rangeof coordinates [xlow, xhigh]! [ylow, yhigh]! [zlow, zhigh]on which to solve the problem.

2. Discretize the domain in some appropriate way. Wewill only discuss uniform Cartesian discretizations here(also known as uni-grids), but there are supercomputer

4

Page 7: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

implementations to use adaptive mesh refinement andmultiple blocks, e.g. in the Cactus ComputationalToolkit.

3. On each discrete cell, specify the initial data by settingvalues for the evolutionary variables. Usually, a set ofcells directly outside of the domain, called ghost zones,is used for setting boundary conditions.

4. The evolution proper proceeds by a loop over timesteps. In each time step, the right hand side of theequation %tA(x, y, z) = RHS needs to be evaluated,where A is a grid variable and RHS is a function whichdepends on A and other grid variables, their spatialderivatives, and potentially free parameters and thecoordinate location.

5. Since the right hand side depends on the spatial deriva-tives, finite-di!erencing operations have to be performedon the grid. This usually involves some local sten-cil of variables around the cell, e.g. %xAi " (Ai+1 #Ai!1)/2(#x) for the second-order accurate central ap-proximation to the first x derivative of A at positioni.

6. Having obtained the right hand side, the set of evo-lution equations of advanced with a technique to dis-cretely approximate ordinary di!erential equations, e.g.a Runge-Kutta step. This obtains the values of gridfunctions at time t + #t from their values at t. TheRunge-Kutta algorithm generates partial time approx-imations during the course of its operation, and there-fore requires a set of scratch levels having the size ofthe main grid variables.

7. Finally, new boundary conditions need to be imposedafter the time update. This usually involves operationson the ghost zones and their immediate neighbors, e.g.by extrapolation.

The actual representation of Einstein’s field equations usedhere is the so-called ADM formalism (after Arnowitt, Deserand Misner [1]). This is not a commonly used method nowa-days, since it tends to produce numerical instabilities, butit will su"ce for demonstrating how to port such a code toa CUDA environment, and it is a good choice for a proto-type implementation due to its simplicity. More advancedschemes (NOK-BSSN [6, 10, 2] and the generalized har-monic formalism [5]) contain more variables, which has con-sequences for the available set of caching schemes (see be-low), but should fundamentally be portable to CUDA in asimilar manner. The particular test cases below operate on asimple dynamical test problem, a gauge wave, and use staticboundaries.

4. IMPLEMENTATION OF EINSTEIN’SFIELD EQUATIONS IN CUDA

While it is possible to develop a GPU code directly, thereare advantages to first implementing a stand-alone CPU im-plementation in C, and then porting this to CUDA. In par-ticular, debugging the host code is easier, and it will yield a

fairer assessment of speed-up as compared to a device codein emulation mode1.

This CPU code is a C language implementation of the al-gorithm mentioned in Section 3 for three spatial dimen-sions, with a second-order Runge-Kutta scheme for time in-tegration and second-order accurate first and second centralfinite-di!erencing operators. The actual right hand side wasextracted from a particular implementation of the ADM sys-tem in Cactus. The code performs these steps:

1. Allocate memory for all grid variables, and the sameamount for the scratch space variables needed by theRunge-Kutta scheme.

2. Write initial data into the grid variables. The exactdata is irrelevant unless it produces NaN’s by discreteinstabilities during the evolution. The data used hereis a Gaussian gauge pulse in x direction.

3. Perform the main loop:

(a) Swap the pointers between the grid variables andtheir scratch counterparts.

(b) Call the evolution function. This function loopsover all grid cells (three nested loops, one for eachdirection), and in each cell (i) calculates all partialderivatives by finite di!erencing in second order,(ii) calculates a number of additional temporaryvariables only needed locally, and (iii) writes theevolved grid functions into the scratch space.

(c) The swapping and evolution is repeated, now forthe second half step involved in the RK2 scheme.

(d) Output the evolved data to the disk.

4. Release allocated storage.

The only relevant target for parallelization is the evolutionroutine inside the main loop, since it employs most of thewall clock time in almost all situations (unless there is 3doutput at every time step), and because it naturally lendsitself to parallel computation. Therefore, we will take acloser look at its operations.

Einstein’s field equations are some of the most complex clas-sical field equations in nature, and therefore codes solvingthem naturally have a high arithmetic intensity. For eval-uating a single right hand side, hundreds of floating pointoperations act on only a few grid functions in device memory- and their locally obtained partial derivatives. Therefore, assoon as the partial derivatives are obtained, parallelizationcan easily proceed by assigning each cell to a single threadand performing the evaluation locally. The resulting writeoperations to device memory only involve that particular celland therefore do not produce concurrency conflicts (thoughthey may produce bank conflicts).

For the finite di!erence evaluation, each cell accesses its im-mediate neighbors and reads grid variables from that cell,

1The CUDA C compiler nvcc admits to compile code whichactually executes on the host processor, mostly for debug-ging purposes.

5

Page 8: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

operates on them, and writes the result into local variableswhich are only defined inside the innermost loop. While itis possible to calculate all finite di!erences outside of theloop, in an extra step, to logically decouple the (seman-tically di!erent) finite di!erence and evolution operations,this implies multiplying the required storage by more than afactor of four2. Translated to a CUDA device, this would in-volve additional operations on device memory and thereforereduce performance.

This approach can be translated directly into a GPU devicecode by using these changes to the CPU implementation:

1. Allocate the grid functions on the device memory (framebu!er) in addition to allocating them on the host mem-ory. The host memory allocation is still useful for set-ting up initial data and output of data to the disk.

2. After writing initial data on the host memory, copy itto the device memory using runtime functions.

3. In the main loop, perform the swap operations on thedevice memory pointers.

4. Replace the evolution function by a CUDA kernel whichis distributed to the device multiprocessors.

5. Since every evolution half-step depends on results fromthe previous one, synchronize to the CUDA device af-ter calling the kernel. This time span could potentiallyalso be used for asynchronous output.

6. If output is requested in a particular iteration, copythe relevant grid variables back to host memory, andthen perform the output as usual.

4.1 Stage 1 parallelizationThe CUDA kernel function could in principle be portedfrom the CPU code by using this scheme: Divide any two-dimensional orthogonal projection of the three-dimensionalCartesian grid, i.e. any of the planes (x, y), (x, z) or (y, z),into independent blocks of the same size. Each thread inthe block then operates on a one-dimensional slab in the re-maining direction, i.e. the three nested loops in the evolutioncode are replaced by one loop over the remaining direction.The cell index calculation needed in the kernel is made easyby CUDA’s support of two-dimensional block and threadindices, which directly correspond to the two-dimensionaldomain decomposition. With this algorithm, the problem isequally and independently parallelized, and were it not forthe additional considerations imposed by the memory accesslatencies, the approach would be already optimal. For futurereference, I will call it a Stage 1 approach, since it is a directtranslation of the CPU code while reducing the appropriatenumber of nested loops, and since it does not take advantageof the data parallel cache of the G80 multiprocessors. Fig. 2illustrates the parallelization technique.

A first e"ciency improvement for the G80 can be obtainedby adjusting the block size, i.e. the number of threads in

2Three additional variables for each directional derivative,and storage for second derivatives for some variables.

each block, to a multiple of 643, which is a multiple of thewarp size (a strong requirement for high performance) andalso allows the compiler to avoid register conflicts. Evenwithout considering shared memory at this stage, this putsa significant constraint on how much registers are actuallyavailable per thread. On the G80, there are 8192 registersper multiprocessor, i.e. we can store 128 32-bit words lo-cally per thread. The partial derivatives and other helpervariables used in the ADM code already need about 100per thread. More complicated codes may need to either (i)reduce the number of local helper variables at the cost of in-creased arithmetic operations, or (ii) reduce the block size.Usually, experiments need to be done to establish which op-tion is preferable in practice. It would seem intuitive thatreducing the number of threads to the warp size is better,but the high latency of device memory accesses can be hid-den much more e!ectively by more threads, and thereforeeasily outweigh the additional local arithmetic cost involvedin implicitly repeating operations usually mapped to helpervariables.

Instead of operating on one-dimensional columns, it is alsopossible to decompose the grid into cubes (as is usual forMPI parallelization schemes), and therefore have each threadevolve exactly one cell. Since the block decomposition byCUDA is logically either one- or two-dimensional, there arethree options for doing this: (i) Use the block decompositionas before, but start a thread for each cell separately, (ii) letthe kernel only operate on a grid slab of defined thicknessand call it repeatedly from the host. Option (i) is imprac-tical due to the limitations of available registers, so we areleft with option (ii), which is illustrated in Fig. 3. From theonset, it is unclear how this compares to the column decom-position used before. However, the stage 2 parallelizationwill need to operate on blocks as opposed to columns, asdiscussed below.

4.2 Stage 2 parallelizationSo far, the kernel has not made use of the shared memory(or data parallel cache in terms of the G80 architecture).Since the frame bu!er accesses are un-cached, it is neces-sary to implement a manual cache algorithm for this. Un-fortunately, the shared memory is even more limited thanthe register space: 16 kB per multiprocessor on the G80,which translates into only 256 bytes per thread when usingthe recommended block size of 64.

The most obvious target for a caching algorithm are thefinite di!erence operations, since they repeatedly use thesame data (the value of quantity A at cell (i, j, k) is accessedby six neighboring cells for second-order accurate first par-tial derivatives, and by an additional eight cells for secondderivatives). To perform these operations from the cachewhile avoiding conditional statements in the kernel, a num-ber of ghost zones outside the operated cube need to bestored as well. Generally, this number will be reduced forequally sized cubes, which suggest to take 4 ! 4 ! 4 as theinterior cube and one ghost cell, resulting in 64 threads perblock (double the warp size, and the suggested block size for

3This restricts the possible grid sizes to a multiple of 8 ineach direction of the two-dimensional decomposition for op-timal parallelization. We will later see that the best schemeactually has a multiple of 4 as a requirement.

6

Page 9: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 2: Stage I column parallelization. The grid, here as an example of 16 ! 16 ! 16cells (boundary ghostzones are not drawn) is decomposed in a plane into blocks, which correspond to the conceptual thread blocksin CUDA. Each block is 8 ! 8 cells large to obtain a block size (number of threads per block) which is amultiple of the warp size and also optimized for reducing register bank conflicts. Each thread operates on aone-dimensional column, i.e. the kernel contains a loop in the remaining direction. As a Stage I scheme, theoperations do not use the data parallel cache on the multiprocessors.

7

Page 10: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 3: Stage I/II block parallelization. Instead of using columns, the grid is first decomposed into slabsof a certain height, here marked by blue lines, each slab is further distributed into blocks corresponding tothe CUDA blocks, and finally each cube of size 4 ! 4 ! 4 is parallelized into 64 threads, which is optimal forthe same reasons stated in Fig. 2. In contrast to the column scheme, the kernel does not contain a loop overcells for purposes of calculating the evolution, though a loop is used to copy data to the parallel cache forthe Stage II parallelization (see text and Fig. 4). The kernel is called by the host repeatedly to operate onall slabs in the grid.

8

Page 11: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

e"cient register usage) and 6! 6! 6 = 216 cells to store inthe cache. Therefore, with this scheme, we can store at most18 FP32 variables per cell. This is enough for the ADM gridfunctions (depending on the choice of gauge, there are 12 to16), but not enough for more complicated schemes or moregeneral systems. In those cases, the only option would beto reduce the size of the cube and the number of threads,which likely incurs a loss of performance.

The transfer of data from the frame bu!er to the parallelcache is handled by each thread separately. Were it only forthe interior cube without ghost zones, an e"cient schemewould be to copy all data for exactly the cell which is oper-ated on by this thread. However, the ghost zones also needto be set up, for which the most obvious scheme is to useconditional statements which act on the boundary surfaces.This is obviously not very e"cient, since (i) many threadswill not participate in these operations, and (ii) conditionalstatements are often serialized into the minimal instructionsize (the warp size) and will therefore reduce performanceeven more.

A better approach is this: Recompute the three-dimensionalthread index into a linear index, and then use integer divi-sion and subtraction operations to obtain the three-dimensionalindex corresponding to the cube including ghost zones; thetwo cache copy approaches are illustrated in Fig. 4. Theoperations for this are:

__shared__ CCTK_REAL g11S[SHARED_ARRAY_SIZE], ...

// Block indexbx = blockIdx.x;by = blockIdx.y;

// Thread indextx = threadIdx.x;ty = threadIdx.y;tz = threadIdx.z;

// Copy data to shared memoryiL = bx * THREADS_PER_BLOCK_X;jL = by * THREADS_PER_BLOCK_Y;kL = kstart - 1;tid = tx + THREADS_PER_BLOCK_X* (ty + THREADS_PER_BLOCK_Y * tz);

// -- first read/writeo = tid / (SHARED_ARRAY_X* SHARED_ARRAY_Y);

res = tid - o * SHARED_ARRAY_X* SHARED_ARRAY_Y;

n = res / SHARED_ARRAY_X;m = res - n * SHARED_ARRAY_X;index = (iL+m) + GRID_NX* ((jL+n) + GRID_NY * (kL+o));

indexS = m + SHARED_ARRAY_X* (n + SHARED_ARRAY_Y * o);

g11S[indexS] = g11[index];...

// -- second read/writetid += THREADS_PER_BLOCK_X

* THREADS_PER_BLOCK_Y* THREADS_PER_BLOCK_Z;

...

// -- third read/writetid += THREADS_PER_BLOCK_X* THREADS_PER_BLOCK_Y* THREADS_PER_BLOCK_Z;

...

// -- fourth read/writetid += THREADS_PER_BLOCK_X* THREADS_PER_BLOCK_Y* THREADS_PER_BLOCK_Z;

if (tid < SHARED_ARRAY_X* SHARED_ARRAY_Y* SHARED_ARRAY_Z) {

o = tid / (SHARED_ARRAY_X* SHARED_ARRAY_Y);

res = tid - o * SHARED_ARRAY_X* SHARED_ARRAY_Y;

n = res / SHARED_ARRAY_X;m = res - n * SHARED_ARRAY_X;index = (iL+m) + GRID_NX* ((jL+n) + GRID_NY * (kL+o));

indexS = m + SHARED_ARRAY_X* (n + SHARED_ARRAY_Y * o);

g11S[indexS] = g11[index];...

}

Here, g11 is the grid array for the variable g11, g11S is itsshared memory equivalent, and the preprocessor constantsdescribe the size of the shared array (SHARED_ARRAY_X/Y/Z= 6), the block size (THREADS_PER_BLOCK_X/Y/Z = 4), andthe global grid size (GRID_NX/NY/NZ).

Now every thread copies data for and evolves di!erent cells.Since there are more cells including ghost zones than threads,several such operations need to be performed (in the case ofa 4! 4! 4 cube and a ghost size of 1, we need four cycles).Each time, the e!ective thread index tid is increased bythe block size, so that the last read operation either (i) con-tains a conditional statement for excluding invalid indices, or(ii) the grid arrays are artificially enlarged to accommodatecopying the data without segmentation faults. Experimentshave shown that option (i) does not lead to a degradationin performance, and it is preferable due to simplicity andreduced memory usage.

To increase coalescence for device memory operations andreduce bank conflicts for shared memory writes in these op-erations, it is important to operate on consecutive addresses.This is easiest done by doing the block decomposition in the(x, y) plane, since then the index ordering scheme we useaccesses addresses in form base address plus index for theshared memory (avoiding bank conflicts), and within eachx direction read access access to the device memory can becoalesced. An even more e"cient scheme could possibly beobtained by (i) ordering device memory in a way that morereads can be coalesced at once, and (ii) adjust the base ad-dress to a multiple of the half warp size times the FP32 size.These have not been implemented here and would require

9

Page 12: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 4: Schemes for caching the frame bu!er data in the data parallel cache of a GPU multiprocessor. Thediagrams show two-dimensional slices of the 4! 4! 4 cubes for illustration purposes. Each thread is assignedto one interior cell, but it needs data from adjacent cells for calculating finite di!erences. The interior portionof the cube is delimited by red lines, while the cube including ghost cells is indicated by blue lines. A directscheme would first have each thread copy data for its cell into the cached array, and then, by conditionalstatements, assign threads to the ghost cells. However, this forces the compiler to serialized parts of the copyoperation. A more e"cient scheme operates on the data parallel cache arrays in a linear addressing mode,such that threads do not necessarily copy their assigned cells. The linear scheme also reduces cache bankconflicts more e"ciently.

10

Page 13: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

non-trivial changes to the data structures.

After the partial derivatives have been obtained using thecached data, all local operations are performed (calculationof helper variables, right hand sides, and time update). Theresults can be written directly into device memory, but, as afinal improvement, we can reuse the shared memory space byfirst writing the evolved variables into it, and then rewritingthe results back to device memory in an extra set of groupedinstructions. A direct write would look like

g11[index] = (local instructions)g12[index] = (local instructions)...K33[index] = (local instructions)

whereas a delayed write is obtained by

g11S[indexS] = (local instructions)g12S[indexS] = (local instructions)...K33S[indexS] = (local instructions)

// Coalesced write-outg11[index] = g11S[indexS];g12[index] = g12S[indexS];...K33[index] = K33S[indexS];

It would seem that the latter code contains more instruc-tions and is therefore at a disadvantage to the former one;however, in the latter case, the compiler can coalesce thewrite instruction across all threads in a warp, and since thedevice memory operation are more expensive by orders ofmagnitude than all other operations, this can lead to a gainin performance.

In summary, the parallelization scheme on the CUDA deviceinvolves:

1. Identifying the most computationally intensive and par-allelizable parts of the code. In the case discussed here,this the is the evolution step inside the main loop.

2. Restructuring the parallel problem in units which cor-respond to the block/thread scheme used by CUDA,with particular note of the warp size and the registerbank optimization requirements.

3. In the compute kernel, implementing a cache algorithmusing the shared memory. Since the minimum e!ectivenumber of threads is limited by the warp size, theremay be trade-o!s involved between the number of con-current threads on a multiprocessor and the number ofvariables to cache.

4. Implementing an e"cient scheme to copy the data fromdevice memory to the cache in a thread-based order,with as few memory bank conflicts as possible, and byusing coalesced device memory operations.

5. Synchronizing all threads in a block after the cacheoperation.

6. Performing all local operations, and writing the resultsback into shared memory.

7. Writing the shared memory data back into device mem-ory in a coalesced operation.

5. PERFORMANCE RESULTSThe performance measurements were all conducted on theNVIDIA Quadro Plex 1000 VCS cluster qp at NCSA. Thecluster consists of one login node and 16 compute nodes withtwo dual-core AMD Opteron 2.4 GHz processors, 8 GB ofmain memory, and four NVIDIA Quadro FX 5600 GPUswith 1.5 GB frame bu!er memory each, making a total of 6GB.

For comparison purposes, we measure the speed-up pro-vided by one Quadro GPU compared to a serial code run-ning on the CPU (which consequently only uses one coreof the CPU). Both codes run on single-precision floatingpoint arithmetics, and are compiled using the GNU C com-piler version 4.1.2 (revision 20070626), and the nvcc compilerprovided by CUDA version 1.0. In both cases, level threeoptimizations have been performed.

The first benchmark is illustrated in Fig. 5. The codes oper-ate on a 128! 128! 128 Cartesian grid (130! 130! 130 in-cluding boundary ghost zones), and perform 100 iterations.The boundary data is static, and analysis and output oper-ations are not performed during the measuring cycle. Onlythe main evolution is measured, which excluded allocation,deallocation and device setup operations; those take roughly3 seconds in total for CUDA and diminish in relation to themain loop for more iterations.

As the diagram shows, speed-ups of up to 23 can be achievedwith a cached algorithm and coalesced write-out to the framebu!er. Not using the data-parallel cache results in manyframe bu!er accesses during the evaluation of the finite dif-ferences, inducing large latencies which can only partiallybe hidden by the thread scheduler due to the large registercount needed by each thread.

The e!ect of the grid size is shown in Fig. 6. This demon-strates that larger local grid sizes lead to higher speed-ups.The frame bu!er on the Quadro FX 5600 (1.5 GB) can storeup to about 230!230!230 cells for this particular computa-tional problem; for these problem sizes, speed-ups of about26 can be achieved. This is a useful result for parallelizinglarger problems across multiple nodes (via MPI), since theratio of the number of the boundary ghost zones with re-spect to the internal cells diminishes with grid size, leadingto a reduction of the communication overhead relative tocomputational operations in the interior.

As mentioned in Section 4, the Stage IIb parallelizationscheme caches all local grid variables by combined read op-erations. If the available shared space is not su"cient tostore all variables, one could either (i) reduce the block size,or (ii) keep the block size, but read the variables separately.The latter option is possible because the finite di!erence op-erators are decoupled, and the local operations only act on

11

Page 14: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 5: Speed-up of the general relativistic evolution code on an NVIDIA Quadro FX 5600 GPU, comparedto the serial code running on an AMD Opteron 2.4 GHz. The code performs 100 iterations on a Cartesiangrid with 128 ! 128 ! 128 cells. The left-most bar is the reference result on the host CPU, with a wall clocktime of 430.3 seconds normalized to a speed-up of 1. The Stage Ia result is for the GPU code using a columndecomposition (cf. Section 4), and the Stage Ib result is using a block decomposition, both without makinguse of the data parallel cache. Stage IIa is using the data parallel cache for read operations, and Stage IIbfor read operations and a coalesced write operation.

12

Page 15: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 6: Speed-up for di!erent grid sizes. Each grid size is separately compared to the serial CPU code,which gives the normal sample for comparison. The GPU parallelization is more e"cient for larger grids,leading to speed-ups of up to 26.

13

Page 16: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

Figure 7: Speed-up with di!erent caching strategies to accommodate more grid variables than fit into the dataparallel cache. The reference point is the Stage IIb result for a 128! 128! 128 grid. Reducing the block size,i.e. the number of threads per block, increases the available shared memory per thread and thus allows tostore more variables. However, it also induces a significant performance hit, as shown in the figure. Anotherapproach is to treat all variables separately for purposes of taking finite di!erences, instead of caching all inadvance. This still introduces an increased overhead, but performs better than the former option.

14

Page 17: CCT Technical Report Series CCT-TR-2008-1...N u m b er of b an ks in p ara llel d ata cach e 16 N u m b er of 32-b it reg isters p er M P 8192 Clo ck frequ en cy of ea ch M P 1.35

the results stored in the registers. Fig. 7 shows the e!ectof these choices on the GPU code performance. In compar-ison to the Stage IIb 128! 128! 128 reference point with aspeed-up of 23, reducing the block size to the warp size (32)results in a performance reduction of about 45%, while theindidivual caching of variables reduce performance by about32%. Above all else, this shows how sensitively the actualspeed-up depends on the memory access scheme.

6. DISCUSSION AND OUTLOOKThe benchmark results in Section 5 clearly demonstrate thata GPU can potentially enhance the computation of finitedi!erence codes solving Einstein’s field equations by sig-nificant amounts. It should be noted that the architec-ture under review here is limited to single-precision floatingpoint operations, and it is expected that actual speed-upsin future GPUs performing double-precision operations arelower. However, even a speed-up in the order of 10 is quiteuseful for practical purposes, since it allows to increase pro-ductivity and reduce turn-around times for test problemsby an order of magnitude. Also, future parallel supercom-puters may include GPU hardware which should be takenadvantage of.

The speed-ups measured here have compared a serial codeto a single GPU parallel code. Clearly, the actual speed-up in a particular workstation setup would e!ectively com-pare CPU-parallelized (e.g. OpenMP, MPI) against GPU-parallelized situations, or even combinations where CPUsand GPUs are used at the same time. The ratio of CPUcores to GPUs is actually one in the cluster we have beenusing, so the general ratio of GPU vs CPU code should be ofa similar order of magnitude, assuming the synchronizationbetween the GPUs does not turn out to be overly expensive.

Another scenario is to use GPUs for the grid evolution code,but o!-loading tasks which could potentially be done asyn-chronously (analysis, output) to the host CPUs. In this case,both resources could be used e!ectively at the same time,and the associated speed-ups compared to a pure CPU codewould be even higher. All this requires to copy data from theframe bu!er to the main memory, however, which may turnout to be an additional bottleneck in the code’s operation.

A problem with porting codes to GPUs is that a certainamount of expertise and experimentation is required fromthe researcher, and the e"ciency of the parallel code isnot always easy to predict. Also, there are tight memoryconstraints on the multiprocessors, which will become evenmore important for double-precision floating point opera-tions. Since the performance of the code depends stronglyon the memory access scheme as demonstrated above, andsince limitations in memory determine the list of availablecaching schemes, there is a non-trivial influence of the par-ticular finite-di!erencing problem on the solution algorithm.

The importance of middleware solutions which are able toe"ciently use GPUs and clusters of combined CPUs/GPUsis therefore expected to increase, since they can hide mostof the data structures and optimization details from appli-cation scientists or business users. The Cactus Computa-tional Toolkit[4, 3, 9] is an example of a middleware whichalready provides abstractions for MPI-parallelized scientific

codes and is being used widely in production work. It isplanned to extend Cactus in a way that it can make use ofcombined CPU/GPU clusters with high performance, whileproviding the end user with a simple and unified interfaceto solve problems. In more general terms, a middleware likeCactus is useful to abstract di!erent hardware architecturesand even programming paradigms from the scientific prob-lem at hand.

7. ACKNOWLEDGMENTSThe author would like to thank Gabrielle Allen, Daniel Katz,John Michalakes, and Erik Schnetter for discussion and com-ments. Calculations have been performed on the QuadroPlex 1000 VCS cluster at NCSA, with special thanks toJeremy Enos for providing timely support with this machine.

8. REFERENCES[1] R. Arnowitt, S. Deser, and C. W. Misner. The

dynamics of general relativity. In L. Witten, editor,Gravitation: An introduction to current research,pages 227–265. John Wiley, New York, 1962.

[2] T. W. Baumgarte and S. L. Shapiro. On the numericalintegration of Einstein’s field equations. Phys. Rev. D,59:024007, 1999.

[3] Cactus Computational Toolkit home page,http://www.cactuscode.org/.

[4] T. Goodale, G. Allen, G. Lanfermann, J. Masso,T. Radke, E. Seidel, and J. Shalf. The Cactusframework and toolkit: Design and applications. InVector and Parallel Processing – VECPAR’2002, 5thInternational Conference, Lecture Notes in ComputerScience, Berlin, 2003. Springer.

[5] L. Lindblom, M. A. Scheel, L. E. Kidder, R. Owen,and O. Rinne. A new generalized harmonic evolutionsystem. Class. Quantum Grav., 23:S447–S462, 2006.

[6] T. Nakamura, K. Oohara, and Y. Kojima. Generalrelativistic collapse to black holes and gravitationalwaves from black holes. Prog. Theor. Phys. Suppl.,90:1–218, 1987.

[7] NVIDIA. NVIDIA GeForce 8800 GPU ArchitectureOverview, 2006.

[8] NVIDIA. CUDA Programming Guide, Version 1.0,2007.

[9] J. Shalf, E. Schnetter, G. Allen, and E. Seidel. Cactusas benchmarking platform. CCT Technical ReportSeries, CCT-TR-2006-3, 2006.

[10] M. Shibata and T. Nakamura. Evolution ofthree-dimensional gravitational waves: Harmonicslicing case. Phys. Rev. D, 52:5428, 1995.

[11] R. M. Wald. General relativity. The University ofChicago Press, Chicago, 1984.

15