accelerating the finite difference time domain (fdtd) method with cuda

Accelerating the Finite Difference Time Domain (FDTD) Method with CUDA

James F. Stack, Jr.

Remcom, Inc.State College, PA

[email protected]

Abstract: The following paper discusses the migration of a traditional C implementation of a three dimensional FDTD method to NVIDIA's CUDA architecture. The FDTD technique provides a highly accurate algorithm for solving electromagnetic problems; however, it comes at the cost of increased computational time. The advent of NVIDIA's CUDA technology allows this time to be reduced by over two orders of magnitude. A brief study of the architecture of modern Graphics Processor Units (GPUs) demonstrates the unique requirements of efficient parallel programming for the GPU. The most pressing GPU challenges are enumerated and techniques are discussed to overcome these difficulties. Throughput of the calculation, measured in millions of cells per second, is used as a performance benchmark. Timing comparisons between optimized CPU and GPU implementations of a commercial FDTD code are presented.

Keywords: Finite Difference, FDTD, GPU, CUDA, Parallel Algorithms

1. Introduction

The accuracy, flexibility, and simplicity of the FDTD method makes it a popular choice for electromagnetic simulation. The algorithm is applicable to arbitrary shapes, works with dispersive and nonlinear materials, and accepts an arbitrary input signal. Unfortunately, this power comes at the cost of increased computational requirements. The traditional answer to this problem has been to split the problem domain over increasingly large numbers of CPUs – either through threading or Message Passing (MPI). The advent of NVIDIA's Compute Unified Device Architecture (CUDA) presents a new option.

This paper discusses the challenges and techniques involved in migrating the FDTD algorithm from a traditional C implementation to a form suitable for leveraging modern Graphics Processor Units (GPUs) through NVIDIA's CUDA framework. The GPU approach employs thousands of threads simultaneously, but this technique requires special consideration in program design in order to achieve maximum speed gains. Previous work [7][8][9] has concentrated on the challenges of implementing GPU-targeted software through the use of the OpenGL API or the Cg language. With proper understanding of CUDA, it is possible to reach speedups of beyond two orders of magnitude over traditional CPUs and increase performance by a factor of two or more over previous GPU implementations.

2. Finite Difference Time Domain

The FDTD method has become a popular approach for solving a broad spectrum of electromagnetic problems since its original introduction by Kane Yee in 1966 [4]. FDTD is a first principles technique that directly implements Maxwell's curl equations (1) and (2) shown below.

(1.)

(2.)

The curls are separated into their vector components and discretized over time and space [5]. The resulting update equation for the Ex component is shown below (3). Here, the Js is omitted, because the source is handled in a separate update. The other five components follow a similar form.

(3.)

The field components are arranged in a staggered manner as prescribed by the Yee cell, shown in Figure 1. Unique material identifiers may be assigned to each cell edge, and millions of these cells can be combined in a structured rectangular grid to form the computational domain.

Fig. #1: Yee Cell.

The electric and magnetic fields are staggered spatially by half of a cell edge which allows their solution to be staggered in time by half a time step. Thus, magnetic fields are held constant while electric fields are computed from the four magnetic fields surrounding it. The process is then reversed in order to solve for magnetic fields.

3. CUDA and Graphics Processor Units

Once used for the sole purpose of driving graphical displays, the GPU has evolved into a powerful computational device. The Tesla C1060 offers a peak performance of 933 GFLOPS and a memory bandwidth of 102 GB/s. This can yield significant performance gains over even a 2.66 GHz Intel Core 2 Quad processor which has a theoretical peak performance of 42.56 GFLOPS (using all four cores and full SSE2 optimizations) and a memory bandwidth of 10.7 GB/s. The soon-to-be-released Fermi GPU platform promises to extend these differences even further.

CUDA allows software developers to leverage this power without the need for special computer graphics knowledge [1][2]. The GPU is organized into a series of multiprocessors. Each of these multiprocessors contains a set of stream processors and a shared memory cache to facilitate thread cooperation.

CUDA compliant GPUs follow a Single Instruction Multiple Thread (SIMT) architecture [3]. In a SIMT model, threads are launched simultaneously in groups termed warps. Each thread in the warp can execute concurrently as long as they are performing the same instruction on different pieces of data. If threads of a warp diverge through a conditional branch, thread execution must be serialized. The greatest speed gains are achieved by designing software that minimizes thread divergence within a warp.

Algorithms such as FDTD perform small amounts of work on large amounts of data, so memory bandwidth tends to be a critical factor in application performance. The GPU offers another significant advantage in this area. GPU threads can switch contexts nearly instantaneously – without the need to store and restore thread state. This fact allows the GPU to hide memory latency by launching thousands of threads simultaneously and performing rapid context swapping while waiting for input data.

Instantaneous context switching is important for hiding memory latency. CUDA devices offer several types of memory. The three most important types for our purposes are shared, constant, and device (global) memory. Shared memory is cached on chip memory with low latency that can be accessed cooperatively by a group of threads. It is particularly useful for implementing user defined caches. Constant memory is a cached read-only section of memory currently limited to 64 KB. Device or global memory is relatively high latency but is currently available in amounts up to 4 GB per device.

4. Implementation

GPU-targeted functions are implemented as kernels in CUDA. Kernels are written in a very similar manner to the C programming language with certain extensions. This makes it particularly easy to make an initial naïve attempt to transition from C to CUDA and perform iterative improvements to reach desired performance gains. Consider the Ex field update presented in Figure 2. Implementing this as a CUDA kernel is as simple as eliminating the triple for loops and placing the core functionality into the kernel as demonstrated in Figure 3.

Fields, material identifiers, material constants, and other arrays can be allocated in global device memory just as they would be allocated in global system memory for CPU usage. A completed implementation based on this approach yielded correct answers; however, we found that the simulation takes longer to perform on the GPU than on the CPU. The CUDA profiler was instrumental in determining the cause of this poor performance. The output of the profiler showed a very large number of uncoalesced global memory reads. A thorough analysis of the profiler output revealed that over 300,000 GPU clock cycles were consumed by memory operations while fewer than 1,000 clock cycles were consumed by arithmetic operations.

Fig. #2: Example C Update Equation.

Fig. #3: Example CUDA Update Equation.

There are two methods that a CUDA capable GPU can read or write for global memory. Coalesced operations permit the GPU to perform a simultaneous memory transaction of 32, 64, or 128 bytes. Uncoalesced reads, on the other hand, must be serialized [3]. Each memory operation on the GPU requires between 400 and 600 clock cycles. This high memory latency means that optimizations needed to be focused on eliminating uncoalesced global memory operations [6].

The optimizations began by moving update coefficient constants from global device memory to constant memory. This step alone yielded a 65% reduction in memory operations and provides our first actual speedup. The next optimization is a technique that the author refers to as “aligned array strides.” This approach over allocates memory in such a way that memory reads from adjacent cells can coalesce. Finally, a shared memory cache was implemented to eliminate redundant memory fetches. Memory operations were reduced to less than 14% of their original total.

These fundamental optimizations, along with some others, were applied across the core functionality of the FDTD algorithm. The final implementation includes CUDA kernels for field updates, boundary conditions, convergence checks, and data I/O. Free space, PEC, and isotropic materials are supported along with PML and PEC boundary conditions. A near to far zone calculation was also included for computing far zone radiation patterns; however, this capability was not employed in the test below. The total functionality was combined into a library and integrated into Remcom's commercial product, XFdtd.

5. Test Problem and Setup

A variety of test cases and hardware platforms were used during the optimization process, but a single test case and platform was chosen for the final benchmarks due to time constraints. All tests were performed on the system shown in Table 1. Both the CPU and GPU versions of the software were shipping versions of the commercial product with appropriate optimizations.

Table 1: System Specifications.

Operating System: RedHat Linux 5

Processors: (2) Opteron 2218 (dual core, 2.6 GHz – 4 cores total)

Memory: 32 GB PC2-5300

GPU: (4) Tesla C1060's (4 GB GPU RAM each, 16 GB total GPU RAM)

The chosen test simulation was a modern cellular phone design. A leading manufacturer in cellular handsets provided a fully-featured CAD representation of one of their most popular products. The simulation included all major device components ranging from externally visible items such as the device case, keypad, and display to internal components such as PCBs, flexible connectors, and speakers. The actual shipping version of the phone's quad-band antenna was included as well.

The simulation parameters assumed a typical user setup for broadband antenna tuning. Accurate conductivity and permittivity values were used for all materials. A broadband Gaussian pulse excited a single port, and CPML boundaries terminated the space. Voltage and current at the port were saved at the port location. Return loss and impedance vs. frequency were computed and saved over the entire frequency range of the input waveform. Timing comparisons are based on the user experience – meaning they include all initialization, field updates, boundary condition updates, automatic convergence tests, and data I/O. The tests used a single thread (i.e. one core) for the CPU baseline and all four Tesla C1060's for the GPU implementation. Simulations of varying sizes (ranging from 1 GB to 4 GB) were created by employing increasingly finer cellular resolution.

6. Results

Two performance metrics served as the basis of comparison. Throughput was computed by multiplying the total number of cells in the space, multiplying by the number of time steps, and dividing by total execution time. A relative speedup was determined by dividing CPU total simulation times by GPU total simulation times. As evidenced in figure 4, the GPU implementation consistently achieved a

Fig. #4: Throughput of Single Opteron vs. 4 Tesla C1060's.

speedup of more than two orders of magnitude. Figure 5 demonstrates that the specific gains ranged from approximately 175x to 225x depending on problem size. This difference is due to the fact that the larger memory problems allow the GPU to better hide memory latency.

Fig. #5: Relative Speedup of 4 Tesla C1060's vs.Single Opteron.

References

[1] Lindholm, E., Nickolls, J., Oberman, S., Montrym, J. “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 28, pp. 39-55, March 2008

[2] Nickolls, J., Buck, I., Garland, M., Skadron, K., “Scalable Parallel Programming with CUDA,” Queue 6, pp. 40-53, March 2008

[3] “CUDA Programming Guide, 2.1,” NVIDIA

[4] Kane Yee, “Numerical Solution of Initial Boundary Value Problems Involving Maxwell’s Equations in Isotropic Media.” IEEE Transactions on Antennas and Propagation, vol. AP-14, No. 3, pp. 802–807, May 1966.

[5] Kunz, Karl S. and Raymond J. Luebbers, The Finite Difference Time Domain Method for Electromagnetics, Boca Raton, CRC Press, 1993.

[6] Micikevicius, Paulius, “3D Finite Difference Computation on GPUs using CUDA”

[7] M. J. Inman, A. Z. Elsherbeni, and C. E. Smith “GPU Programming for FDTD Calculations,” The Applied Computational Electromagnetics Society (ACES) Conference, Honolulu, Hawaii, 2005.

[8] M. J. Inman and A. Z. Elsherbeni, “3D FDTD Acceleration Using Graphical Processing Units,” The Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, 2006.

[9] Adams, Samuel, Jason Payne, and Rajendra Boppana, “Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors”, HPCMP Users Group Conference, 2007

accelerating the finite difference time domain (fdtd) method with cuda

Documents