accelerating s3d: a gpgpu case study

8/3/2019 Accelerating S3D: A GPGPU Case Study

1/16

Accelerating S3D: A GPGPU Case Study

Kyle Spafforda,, Jeremy Mereditha, Jeffrey Vettera, Jacqueline Chenb, RayGroutb, Ramanan Sankarana

aOak Ridge National Laboratory, 1 Bethel Valley Road MS 6173, Oak Ridge, TN 37831bCombustion Research Facility, Sandia National Laboratories, Livermore, CA 94551

Abstract

The graphics processor (GPU) has evolved into an appealing choice for high

performance computing due to its superior memory bandwidth, raw process-ing power, and flexible programmability. As such, GPUs represent an excel-lent platform for accelerating scientific applications. This paper explores amethodology for identifying applications which present significant potentialfor acceleration. In particular, this work focuses on experiences from accel-erating S3D, a high-fidelity turbulent reacting flow solver. The accelerationprocess is examined from a holistic viewpoint, and includes details that arisefrom different phases of the conversion. This paper also addresses the issueof floating point accuracy and precision on the GPU, a topic of immenseimportance to scientific computing. Several performance experiments are

conducted, and results are presented from the NVIDIA Tesla C1060 GPU.We generalize from our experiences to provide a roadmap for deploying ex-isting scientific applications on heterogeneous GPU platforms.

Keywords: Graphics Processors, Heterogeneous Computing,Computational Chemistry

1. Introduction

Strong market forces from the gaming industry and increased demand forhigh definition, real-time 3D graphics have been the driving forces behind

the GPUs incredible transformation. Over the past several years, increasesin the memory bandwidth and the speed of floating point computation of

Corresponding AuthorEmail address: [email protected] (Kyle Spafford)

Preprint submitted to Parallel Computing December 8, 2009


2/16

GPUs have steadily outpaced those of CPUs. In a relatively short period

of time, the GPU has evolved from an arcane, highly-specialized hardwarecomponent into a remarkably flexible and powerful parallel coprocessor.

1.1. GPU Hardware

Originally, GPUs were designed to perform a limited collection of opera-tions on a large volume of independent geometric data. These operations fellinto to only two main categories (vertex and fragment) and were highly paral-lel and computationally intense, resulting in a highly specialized design withmultiple cores and small caches. As graphical tasks became more diverse, thedemand for flexibility began to influence GPU designs. GPUs transitionedfrom a fixed function design, to one which allowed limited programmability

of its two specialized pipelines, and eventually to an approach where all itscores were of a unified, more flexible type, supporting much greater controlfrom the programmer.

The NVidia Tesla C1060 GPU platform was designed specifically for highperformance computing. It boasts an impressive thirty streaming multipro-cessors, each composed of eight stream processors for a total of 240 processorcores, running at 1.3Ghz. Each multiprocessor has 16KB of shared mem-ory, which can be accessed as quickly as a register if managed properly. TheTesla C1060 has 4GB of global memory, as well as supplementary cached con-stant and texture memory. Perhaps the most exciting feature of the Tesla

is its support for native double precision floating point operations, whichare tremendously important for scientific computing. Single precision com-putations were sufficient for the graphics computations which GPUs wereinitially intended to solve, and double precision, a relatively new feature inGPUs, is dramatically slower than single precision. In order to achieve highperformance on GPUS, one must use careful memory management and ex-ploit hardware specific features. This otherwise daunting task is simplifiedby CUDA.

1.2. CUDA

The striking performance numbers of modern GPUs have resulted in a

surge of interest in general-purpose computation on graphics processing units(GPGPU). GPGPU represents an inexpensive and power-efficient alternativeto more traditional HPC platforms. In the past, there has been a substantiallearning curve associated with GPGPU, and expert knowledge was requiredto attain impressive performance. This involved extensive modification of

2


3/16

traditional approaches in order to effectively scale to the large number of cores

per GPU. However, as the flexibility of the GPU has increased, there has beena welcomed decrease in the associated learning curve of the porting process.In this study, we utilize NVIDIAs Compute Unified Device Architecture(CUDA), a parallel programming model and software environment. CUDAexposes the power of the GPU to the programmer through a set of high levellanguage extensions, allowing for existing scientific codes to be more easilytransformed into GPU compatible applications.

Figure 1: CUDA Programming Model Image from NVIDIA CUDA ProgrammingGuide[1].

1.2.1. Programming Model

While a full introduction to CUDA is beyond the scope of this paper, this

section mentions the basic concepts required to understand the scope of theparallelism involved. CUDA views the GPU as a highly parallel coprocessor.Functions called kernels, are composed of a large number of threads, whichare organized into blocks. A group of blocks is known as a grid, see Figure1. Blocks contain a fast shared memory that is only available to threads

3


4/16

which belong to the block, while grids have access to the global GPU mem-

ory. Typical kernel launches involve one grid, which is composed of hundredsor thousands of individual threads, a much higher degree of parallelism thannormally occurs with traditional parallel approaches on the CPU. This highdegree of parallelism and unique memory architecture have drastic conse-quences for performance, which will be explored in a later section.

1.3. Domain and Algorithm Description

S3D is a massively parallel direct numerical solver (DNS) for the full com-pressible Navier-Stokes, total energy, species and mass continuity equationscoupled with detailed chemistry[2, 3]. It is based on a high-order accurate,non-dissipative numerical scheme solved on a three-dimensional structured

Cartesian mesh. Spatial differentiation is achieved through eighth-order fi-nite differences along with tenth-order filters to damp any spurious oscilla-tions in the solution. The differentiation and filtering require nine and elevenpoint centered stencils, respectively. Time advancement is achieved througha six-stage, fourth-order explicit Runge-Kutta (R-K) method. Navier Stokescharacteristic boundary condition (NSCBC) treatment[4, 5, 6] is used on theboundaries.

Fully coupled mass conservation equations for the different chemical speciesare solved as part of the simulation to obtain the chemical state of the sys-tem. Detailed chemical kinetics and molecular transport models are used. An

optimized and fine-tuned library has been developed to compute the chemi-cal reaction and species diffusion rates based on Sandias Chemkin package.While Chemkin-standard chemistry and transport models are readily usablewith S3D, special attention is paid to the efficiency and performance of thechemical models. Reduced chemical and transport models that are fine -tuned to the target problem are developed as a pre-processing step.

S3D is written entirely in Fortran. It is parallelized using a three dimen-sional domain decomposition and MPI communication. Each MPI processis responsible for a piece of the three-dimensional domain. All MPI pro-cesses have the same number of grid points and the same computationalload. Inter-processor communication is only between nearest neighbors in

a three-dimensional topology. A ghost-zone is constructed at the processorboundaries by non-blocking MPI sends and receives among the nearest neigh-bors in the three-dimensional processor topology. Global communications areonly required for monitoring and synchronization ahead of I/O.

4


5/16

S3Ds performance has been studied and optimized including I/O[7] and

control flow[8]. Still, further improvements allow for increased grid size, moresimulation timesteps, and more species equations. These are critical to thescientific goals of turbulent combustion simulations in that they help achievehigher Reynolds numbers, better statistics through larger ensembles, morecomplete temporal development of a turbulent flame, and the simulation offuels with greater chemical complexity.

Here we assess S3D code performance and parallel scaling through simu-lation of a small amplitude pressure wave propagating through the domainfor a short period of time. The test is conducted with detailed ethylene-air(C2H4) chemistry consisting of twenty-two chemical species and mixture-averaged molecular transport model. Due to the detailed chemical model,

the code solves for twenty-two species equations in addition to the five fluiddynamic variables.

1.4. Related Work

Recent work by a number of researchers has investigated GPGPU withimpressive results in a variety of domains. Owens et. al. provide an excellenthistory of the GPU [9], chronicling its transformation in great detail. It isnot uncommon to find researchers who achieve at least an order of magnitudeimprovement over reference implementations. GPUs have been used to accel-erate a variety of application kernels, including more traditional operations

like dense[10, 11, 12] and sparse[13] linear algebra as well as scatter-gathertechniques[14]. The GPU has been successfully applied to a wide variety offields including computational biophysics[15], molecular dynamics[16], andmedical imaging[17, 18]. Our work takes a slightly higher level approach.While we do present performance measurements from an accelerated versionof S3D, we examine the acceleration process as a whole, and endeavor toanswer why certain applications perform so well on GPUs, while others failto achieve significant performance improvements.

2. Identifying Candidates for Acceleration

2.1. ProfilingThe first step in identifying a scientific application for acceleration is

to identify the performance bottlenecks. The best case scenario involves asmall number of computationally intense functions which comprise most ofthe runtime. This is a fairly basic requirement and is a direct consequence

5


6/16

of Amdahls law. The CPU based profiling tool Tau identified S3Ds ge-

trates kernel as a major bottleneck[19]. This kernel involves calculating therates of chemical reactions occurring in the simulation at each point in space.This computation represents about half of the total runtime with the cur-rent chemistry model. As S3Ds chemical model becomes more complex, weanticipate that the getrates kernel will more strongly dominate runtime. Asthe kernels total percentage of runtime increases, the greater the potentialfor application speedup. Therefore, the first kernel to be examined shouldbe the most time consuming.

2.2. Parallelism and Data Dependency

One of the main advantages of the GPU is the high number of proces-

sors, so it follows that kernels must exhibit a high degree of parallelism to besuccessful on a heterogenous GPU platform. While this can correspond totask-based parallelism, GPUs have primarily been used for data-parallel op-erations. This makes it difficult for GPUs to handle unstructured kernels, orthose with intricate patterns of data dependency. Indeed, in situations withirregular control flow, individual threads can become serialized, which resultsin performance loss. Since the memory architecture of a GPU is dramaticallydifferent than most CPUs, memory access times can differ by several ordersof magnitude based on access pattern and type of memory. For example, onthe Tesla, an access to shared block memory is an two orders of magnitude

faster than an access to global memory. Therefore, kernels must often bechosen based on memory access pattern, or restructured such that memoryaccess is more uniform in nature. In S3D, the getrates kernel operates ona regular three dimensional mesh, so access patterns are fairly uniform, aneasy case for the GPU.

The following psuedocode outlines the general structure of the sequentialgetrates kernel. The outer three loops can be computed in parallel, sincepoints in the mesh are independent.

for x = 1 to length

for y = 1 to length

for z = 1 to lengthfor n = 1 to nspecies

grid[x][y][z][n] = F(grid[x][y][z][1:nspecies])

where length refers to the length of an edge of the cube, nspecies refers to

6


7/16

the number of chemical species involved, and function F is an abstraction of

the more complex chemical computations.In addition to the innate parallelism of GPUs, the systems intercon-nection bus can also have serious performance consequences for acceleratedapplications. Discrete GPUs are often connected via a PCI-e bus, whichintroduces a substantial amount of latency into computations. This makesGPUs more effective at problems in which bandwidth is much more impor-tant than latency or those which have a high ratio of computation to data.In these cases, the speedup in the calculations or the increased throughputis sufficient to overcome performance costs associated with transferring dataacross the bus. In ideal cases, a large amount of data can saturate the busand amortize the startup costs associated with bus. In effect, this hides

communication time with computation.

3. Results and Discussion

3.1. Kernel Acceleration

Once a suitable portion of the application has been identified, the ac-celeration process can begin. Parallel programming is inherently more diffi-cult than sequential programming, and developing high performance code forGPUs also incorporates complexity from architectural features. This mem-ory aware programming environment grants the programmer control over

low level memory movement, but demands meticulous data orchestration tomaximize performance.For S3D, the mapping between the getrates kernel and CUDA concepts is

fairly simple. Since getrates operates on a regular, three-dimensional mesh,each point in the mesh is handled by a single thread. A block is composedof a local region of the mesh. Block size varies between 64 and 128, basedon the available number of registers per GPU core, in order to maximizeoccupancy.

During the development of the accelerated version of the getrates kernel,the memory access pattern was the most important factor for performance.When threads read or write memory in a highly parallel fashion, CUDA

coalesces the memory access into a single operation, which has a dramaticand beneficial effect on performance. The optimized versions of the getrateskernel also use batched memory transfers and exploit block shared memory.This attention to detail pays offaccelerated versions of the getrates kernelexhibit promising speedups over the serial CPU version: up to 31.4x for the

7


8/16

single precision version, and 17.0x for the double precision version for a single

iteration of the kernel, see Figure 2. The serial CPU version was measuredon 2.3Ghz quad core AMD Opteron processor with 16GB of memory.

Figure 2: Accelerated Kernel Results

3.2. Accuracy

While the evolution of the GPU has been remarkable, architectural rem-nants of its original, specialized function remain. Perhaps the most relevantof these to the scientific community is the bias towards single precision float-ing point computations. Single precision arithmetic was sufficient for theGPUs original tasks (rasterization, etc.). GPU benchmarking traditionallyinvolved only these single precision computations, and performance demandshave clearly shaped the GPUs allocation of hardware resources. Many GPUsare incapable of double precision, and those that are typically pay a highperformance cost. This cost generally arises from the differing number of

floating point units, and it is almost always more than the performance dif-ference between single and double precision on a traditional CPU. In S3D,the cost can clearly be seen in the performance difference in the single versusdouble precision versions of the getrates kernel.

8


9/16

From a performance standpoint, single precision computations are favor-

able compared to double precision, but the computations in scientific ap-plications can be extremely sensitive to accuracy. Moreover, some doubleprecision operations are not always equivalent on the CPU and GPU. GPUsmay sacrifice fully IEEE compliant floating point operations for greater per-formance. For example, scientific applications frequently make extensive useof transcendental functions (sin, cos, etc.), and the Teslas hardware intrinsicsfor these functions are faster, but less accurate than their CPU counterparts.

3.2.1. Accuracy in S3D

In S3D, the reaction rates calculated by the getrates kernel are integratedover time as the simulation progresses, and error from inaccurate reaction

rates compounds and propagates to other simulation variables. While this isthe first comparison of double and single precision versions of S3D, the issueof accuracy has been previously studied, and some upper bounds for error areknown. S3D has an internal monitor for the estimated error from integration,and can take smaller timesteps in an effort to improve accuracy. Figure 3shows the estimated error from integration versus simulation time. In thisgraph, the CPU and GPU DP versions quickly begin to agree, while the singleprecision version is much more erratic. In both double precision versions, theinternal mechanism for timestep control succeeds in settling on a timestep ofappropriate size. The single precision version has a much weaker guaranteeon accuracy, and the monitor has a difficult time controlling the timestep,oscillating between large timesteps with high error (sometimes beyond theacceptable bounds), and short timesteps with very low error. The increasednumber of timesteps required by the GPU single precision version will haveconsequences for performance, which will be explored in a later section.

The error from low precision can also be observed in simulation variablessuch as temperature (see Figure 4) or in chemical species, such as H2O2(seeFigure 5). The current test essentially simulates a rapid ignition, and a rela-tively significant time gap can be seen between the rapid rise in temperaturein the GPU single precision kernel versus the other versions. In the sensitivetime scale of ignition, this gap represents a serious error. In Figure 5, the

error is much more pronounced, as the single precision version fails to predictthe sudden decrease in H2O2 which occurs roughly at time 4.00E-04.

A similar trend can be observed throughout many different simulationvariables in S3D. The CPU version tends to agree almost perfectly withthe GPU double precision version, while the single precision version deviates

9


10/16

Figure 3: Estimated Integrated Error. 1.00E-03 is the upper bound on acceptable error.The GPU DP and CPU versions completely overlap beginning roughly at time 4.00E-04.

substantially. Consequently, while the single precision version is much faster,it may be insufficient for sensitive simulations.

3.3. S3D Performance Results

In an ideal setting, the chosen kernel would strongly dominate the runtimeof the application. However, in S3D, the getrates kernel comprises roughlyhalf of the total runtime, with some variation based on problem size. Table 1shows how speedup in the getrates kernel scales to whole-code performanceimprovements. Amdahls limit is the theoretical upper bound on speedup,s

11fa

, where fa is the fraction of runtime that is accelerated.

Table 1: Performance Results - S - Single Precision D - Double Precision

Size Kernel Speedup % of Amdahls Actual SpeedupS D Total Limit S D

32 29.50x 14.98x 50.0% 2.00x 1.90x 1.84x48 31.44x 16.97x 51.0% 2.04x 1.91x 1.87x60 31.40x 16.08x 52.5% 2.11x 1.95x 1.90x

10


11/16

Figure 4: Simulation temperature. Note the time gap of the increase in temperature attime roughly 3.00E-04. This corresponds to a delay in the prediction of ignition time.

In S3D, there is a complex relationship between performance and accu-racy. When inaccuracy is detected, timestep size is reduced in an attemptto decrease error, see Figure 6. Since single precision is less accurate, one

can see erratic timestep sizes. This means that given the same number oftimesteps, a highly accurate computation can simulate more time. In orderto truly measure performance, it is important to normalize the wallclock timeto account for this effect. In Table 1, normalized cost is the wallclock time ittakes to simulate one nanosecond at one point in space. While the getrateskernel can be executed faster in single precision, the lack of accuracy causesthe simulation to take very small timesteps. In some cases (typically verylong simulations), the loss of accuracy in single precision calculations causesthe total amount of simulated time to decrease, potentially eliminating anyperformance benefits.

As mentioned in Section 1.3, S3D is distributed using MPI. The domain

sizes listed in Table 1 are representative of the work done by a single nodein a production level simulation. As such, it is important to characterizethe scaling behavior of the accelerated version. Figures 7 and 8 presentparallel speedup and efficiency results from the Lens cluster. The Lens cluster

11


12/16

Figure 5: Chemical Species H2O2. The CPU and GPU DP versions completely agree,while the GPU SP version significantly deviates, and fails to identify the dip at time4.00E-O4

is made up of 32 nodes, with each node containing four quad-core AMDOpteron processors, 64GB of memory, and two GPUsone Tesla C1060 and

one GeForce 8800GTX. In our experiments, we do not utilize the GeForce8800GTX because it lacks the ability to perform double precision operations.The accelerated version of S3D exhibited classic weak scaling, with parallelefficiency ranging between 84% and 98%.

4. Conclusions

Graphics processors are rapidly emerging as a viable platform for highperformance scientific computing. Improvements in the programming en-vironments and libraries for these devices are making them an appealing,cost-effective way to increase application performance. While the popularityof these devices has surged, GPUs may not be appropriate for all applica-tions. They offer the greatest benefit to applications with well structured,data-parallel kernels. Our study has described the strengths of GPUs, andprovided insights from our experience in accelerating S3D. We have also

12


13/16

Table 2: Performance Results - Normalized cost is the average time it takes to simulate a

single point in space for one nanosecond. S - Single Precision D - Double PrecisionSize Normalized Cost (microseconds)

CPU GPU DP GPU SP

32 12.3 6.67 6.4748 12.9 7.30 6.9860 12.0 6.31 6.12

examined one of the most important aspect of GPUs for the scientific com-munity, accuracy. The differences in accuracy between GPU and IEEE arith-metic resulted in drastic consequences for correctness in S3D. Despite this

relative weakness, the heterogeneous GPU version of the kernel still managesto outperform the more traditional CPU version and produce high qualityresults in a real scientific application.

[1] NVIDIA, CUDA Programming Guide 2.3 Downloaded June 1, 2009,www.nvidia.com/object/cudadevelop.html.

[2] E. R. Hawkes, R. Sankaran, J. C. Sutherland, J. H. Chen, Direct numer-ical simulation of turbulent combustion: fundamental insights towardspredictive models, Journal of Physics: Conference Series 16 (2005) 6579.

[3] J. C. Sutherland, Evaluation of mixing and reaction models for large-eddy simulation of nonpremixed combustion using direct numerical sim-ulation, Dept of Chemical and Fuels Engineering, PhD, University ofUtah.

[4] T. J. Poinsot, S. K. Lele, Boundary-conditions for direct simulations ofcompressible viscous flows, Journal of Computational Physics 101 (1992)104129.

[5] J. C. Sutherland, C. A. Kennedy, Improved boundary conditions forviscous, reacting, compressible flows, Journal of Computational Physics

191 (2003) 502524.

[6] C. S. Yoo, Y. Wang, A. Trouve, H. G. Im, Characteristic boundaryconditions for direct simulations of turbulent counterflow flames, Com-bustion Theory and Modelling 9 (2005) 617646.

13


14/16

Figure 6: Timestep Size This graph shows the size of the timesteps taken as the rapidignition simulation progressed. S3D reduces the timestep size when it detects integra-tion inaccuracy. While the double precision versions take timesteps of roughly equivalentsize, the single precision version quickly reduces timestep size in an attempt to preserveaccuracy.

[7] W. Yu, J. Vetter, H. Oral, Performance characterization and optimiza-

tion of parallel I/O on the Cray XT, Parallel and Distributed Processing,2008. IPDPS 2008. IEEE International Symposium on (2008) 111.

[8] J. Mellor-Crummey, Harnessing the power of emerging petascale plat-forms, Journal of Physics: Conference Series 78 (1) (2007) 1248.

[9] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, GPUcomputing, Proceedings of the IEEE 96 (5) (2008) 879899.

[10] S. Barrachina, M. Castillo, F. Igual, R. Mayo, Evaluation and tuning ofthe level 3 CUBLAS for graphics processors, Proceedings of the IEEESymposium on Parallel and Distributed Processing (IPDPS) (2008) 18.

[11] N. Fujimoto, Faster matrix-vector multiplication on GeForce 8800GTX,Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE Interna-tional Symposium on (2008) 18.

14


15/16

Figure 7: GPU Parallel Speedup This graph characterizes the parallel scaling of theaccelerated version of S3D. As the number of nodes increases, both the single and doubleprecision versions exhibit proportional increases in performance.

[12] G. Cummins, R. Adams, T. Newell, Scientific computation through aGPU, Southeastcon, 2008. IEEE (2008) 244246.

[13] J. Bolz, I. Farmer, E. Grinspun, P. Schrooder, Sparse matrix solvers onthe GPU: conjugate gradients and multigrid, in: SIGGRAPH 03: ACMSIGGRAPH 2003 Papers, 2003, pp. 917924.

[14] B. He, N. K. Govindaraju, Q. Luo, B. Smith, Efficient gather and scatteroperations on graphics processors, in: SC 07: Proceedings of the 2007ACM/IEEE conference on Supercomputing, 2007, pp. 112.

[15] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,K. Schulten, Accelerating molecular modeling applications with graphicsprocessors, Journal of Computational Chemistry 28 (2005) 26182640.

[16] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W.-M. W. Hwu,GPU acceleration of cutoff pair potentials for molecular modeling appli-cations, in: CF 08: Proceedings of the 2008 conference on Computingfrontiers, 2008, pp. 273282.

15


16/16

Figure 8: GPU Parallel Efficiency This graph shows the parallel efficiency (parallelspeedup divided by the number of processors) for the accelerated versions of S3D.

[17] J. Kruger, R. Westermann, Acceleration techniques for GPU-based vol-ume rendering, Visualization, 2003. VIS 2003. IEEE (2003) 287292.

[18] K. Mueller, F. Xu, Practical considerations for GPU-accelerated CT,

Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Sym-posium on (2006) 11841187.

[19] S. Shende, A. D. Malony, J. Cuny, P. Beckman, S. Karmesin, K. Lindlan,Portable profiling and tracing for parallel, scientific applications usingC++, in: SPDT 98: Proceedings of the SIGMETRICS symposium onParallel and distributed tools, 1998, pp. 134145.

16

accelerating s3d: a gpgpu case study

Documents