webb aes11

8/3/2019 Webb Aes11

1/7

Audio Engineering Society

Convention PaperPresented at the 130th Convention

2011 May 1316 London, UK

The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that havebeen peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced fromthe authors advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takesno responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio

Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rightsreserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from theJournal of the Audio Engineering Society.

Virtual Room Acoustics : A Comparison oftechniques for computing 3D-FDTD schemesusing CUDA

Craig J. Webb1, and Stefan Bilbao1

1Acoustics and Fluid Dynamics Group/Music, University of Edinburgh

Correspondence should be addressed to Craig J. Webb ([email protected])

ABSTRACTHigh fidelity virtual room acoustics can be approached through direct numerical simulation of wave propa-gation in a defined space. 3D Finite Difference Time Domain schemes can be employed, and adept well toa parallel programming model. This paper examines the various approaches for calculating these schemesusing the Nvidia CUDA architecture. We test the different possibilities for structuring computation, basedon the available memory objects and thread-blocking model. A standard test simulation is computed atdouble precision under different arrangements. We find that a 2D extended tile blocking system, combinedwith shared memory usage, produces the fastest computation for our scheme. However, shared memoryusage is only marginally faster than direct global memory access, using the latest FERMI GPUs.

1. INTRODUCTION

Virtual room acoustics seeks to create auralisationby calculating the propagation of acoustic waves in amodel of a 3D space. Finite Difference Time Domain(FDTD) schemes can be used to approximate thesolutions to the 3D wave equation for this purpose.Whilst this can provide highly accurate simulationsat audio sample rates, the level of computation is

very high [1]. By exploiting the data independenceof the calculations, parallel programming models canaccelerate the computation times. The use of GPUsto perform such computation has become increas-ingly popular, and has applications across many dis-ciplines. FDTD acceleration has been shown in lightscattering simulations for nano particles [2], in wavepropagation in seismology [3] [4], and in audio simu-lations of room acoustics [5]. See [6] for an overview

8/3/2019 Webb Aes11

2/7

Webb AND Bilbao Comparison of techniques for computing 3D-FDTD schemes using CUDA

of such techniques as applied to acoustic auralisa-tion.

This paper examines the use of Nvidias CUDAtechnology to compute a basic 3D FDTD schemethat simulates wave propagation with losses at theboundaries and due to the viscosity of air. Whilst 2Ddata is relatively straightforward to arrange withinCUDA, our scheme uses 3D data sets of double pre-cision floats. CUDAs memory types, and in partic-ular the thread blocking system, allow for multiplearchitectural arrangements to compute the solution.However, the architectural design can have a ma-

jor impact on the efficiency of the computation, asGPUs are highly sensitive to data and thread ar-rangement. We examine the different approaches in

terms of execution times for a standard test problem,in a single GPU domain.

2. THE FINITE DIFFERENCE SCHEME

The scheme being tested calculates propagation withlosses at the boundaries using a single reflection coef-ficient, and due to the viscosity of air. The differenceequation used for the this simulation is as follows:

un+1l,m,p =1

1 + ((2K2)unl,m,p +

2Snl,m,p (1)

(1 )un1l,m,p + ck2d(u

nl,m,p u

n1l,m,p))

where

S= unl+1,m,p + unl1,m,p + u

nl,m+1,p (2)

+unl,m1,p + unl,m,p+1 + u

nl,m,p1

and = cTX

, c is the wave speed in air, the coeffi-cient for losses due to boundary reflections. T is thetime step, and unl,m,p is an approximation to the con-tinuous function u(x,y,z,t), at times t = nT, andlocations x = lX,y = mX and z = pX, for integer

l, m, p and n. 2d is the seven-point discrete Lapla-cian. K is 6 in free space, 5 at a face, 4 at an edgeand 3 at a corner. These integer values are definedin a 2D floor plan map, which is extruded into the3rd dimension. This allows the definition of detailedboundary space, down to the level of the grid spac-ing X (13.5 mm at 44.1 kHz). Details of this schemecan be found at [7].

Computationally, at each time-step we update everypoint in un+1 based on the grid value, and six neigh-bouring values from one and two time-steps ago, as

shown in Figure 1.

Fig. 1: 3D Data Grids

Thus, we require three arrays of 3D data values tocompute the scheme, with a one-point halo aroundeach grid. As every value of un+1 is updated in-dependently based on previously calculated values,we can use a parallel threading model at eachtime-step. A pseudo-code kernel would be:

1. Calculate 3D position from thread ID

2. Get K value from floor plan map

3. IF NOT (K==0 OR Z==0 OR Z==Nz-1)

3.1 Set K-1 at floor and ceiling

3.2 Set loss coefficients at walls

3.3 Process the update equation

Once all of the values in un+1 are computed,we simply update the input and output audio, andthen swap the data memory pointers around. Theprocess then continues at the audio sample rate.

3. TESTING FRAMEWORK

Different architectural arrangement were tested us-ing a standard simulation. This computed 2,000samples at 44.1kHz using double precision floating-

point arithmetic. The size of the three data arrayswas chosen to use all 3GB of global memory on asingle Nvidia Tesla C2050 FERMI card. The C2050has 448 CUDA cores, and peak double precision per-formance of 515 Gflops. This is significantly betterthan previous incarnations of Tesla cards [8]. Thedimensions Nr, Nc, Nz of the three data arrays was480 x 256 x 896, which gives 110 million points in

AES 130th Convention, London, UK, 2011 May 1316

Page 2 of 7

8/3/2019 Webb Aes11

3/7


each. This corresponds to a room size of 6.5m x3.5m x 12m.

The dimensions are set as multiples of 32 for tworeasons. Firstly, it ensures correct alignment of datato maximise memory coalescing and simplifies theCUDA code. This is explained further in section 5.Secondly, it allows the testing of variations in threadblock size without concern for fractional dimensions.However, before going into details of 3D data andthread blocking, we first look at how the simulationis designed in CUDA memory.

4. CUDA MEMORY MODEL

When writing serial CPU code in C, it is not neces-sary to think about memory other than in terms ofvariables and arrays being used. In CUDA, this isnot the case.

4.1. CUDA memory types

Working with GPUs requires a mixture of host se-rial C code, and device CUDA code. Whilst the hostuses CPU memory, the device has multiple memorytypes as shown in Figure 2.

Fig. 2: CUDA Memory types [9]

The host transfers data via global, and constant

memory. Global is the largest, with 3GB availableon the Tesla FERMI C2050. CUDA threads makeuse of local, register memory, and also have a smallamount of shared memory per thread block. Theycan also communicate directly with the global and(read-only) constant memory. Each type has dif-ferent access speeds, with global memory being theslowest, and local, shared and constant being fast.

There is also an additional type, which is texturememory. However, this is of no benefit here as itcannot be used with double precision floats, and it

is read-only from a thread [9]. Hence, the followingsections detail the use of global, constant, local andshared types only.

4.2. Memory usage model

Initial decisions on memory layout are taken with re-gard to the first rule of CUDA coding minimise theuse of host to device transfers [10]. The three dataarrays themselves need only to reside on the device,and be initialised to zero, as the required outputis an audio array read from a given point. This istransferred to the host at the end of the simulation.At setup, this leaves just the audio and map datafor transfer to global memory on the device.

Fig. 3: Memory layout

Calculated parameter constants are loaded intoconstant memory, which has fast access from thethreads. This is useful for the difference equationcoefficients, and reduces the number of variablespassed to each thread when launched. With the aimof minimising register pressure, we can also elimi-nate non-calculated scheme constants with the use of#define. Parameters such as the array dimensionsand input/output positions can simply be defined asintegers throughout the code, which can significantlyreduce the number of variables in use.

On the subject of input and output, there are twomethods for performing the updates of the audioarrays. The first is to use conditional statementswithin the thread kernel, to sum in the input andread off the output at individual positions. Theother is to launch a single thread to perform this,after the main kernel has finished. Testing for a sin-gle input and single output showed little difference


Page 3 of 7

8/3/2019 Webb Aes11

4/7


in execution time, although the later allows slightlyneater code.

5. WORKING WITH 3D ARRAYS

Having set out the physical locations of the data, thenext question is how to actually use it in code. Thefundamental issue is how to work with the three-dimensional arrays. In standard C code, one can de-clare and address a 3D array using notation such asdouble U[Nr][Nc][Nz]. Whilst this notation canbe used to declare global memory, its use is lim-ited by lack of control. We cannot test if there isenough memory space, and if it is contiguous. Therecommended method to declare global memory isusing cudaMalloc() [9]. The problem is that thisis addressed in a linear fashion, with one index, andso requires that our 3D arrays are decomposed intopurely linear data. This can be achieved with a row-major format for each Z layer, and placing theseside-by-side as shown in Figure 4.

Fig. 4: Linear decomposition of 3D data

There is also the cudaMalloc3D() option for declar-ing linear memory. This gives optimal padding tomaximise coalesced access, but at the expense ofmore complex coding [9]. It was not used here as weare using array dimensions that are multiples of 32,the warp size for FERMI cards. Aside from linear

memory, CUDA also offers cudaArrays. However,these are for use with texture memory fetching, andso were not used for the reasons given in 4.1.

We are now left with single index linear memory,which holds a decomposed 3D data set. This has tobe addressed using:

Address = (depthNcNr) + (row Nc+ column)

Rather than having to calculate this each time an ar-ray is accessed, a more efficient method is to locatethe linear position once at the start of the thread

kernel. Access to the neighbouring points in 3Dcan be performed with a single addition, as in lin-ear terms the neighbours are all shifts away fromthe central point, CP. For example, the neighbourpoint beneath a given 3D position is accessed as[CP + Nc]. This method of array declaration andaccess was used throughout the following threadingarrangements.

6. ITERATING OVER 2D

Unlike CPU hardware, GPUs consist of a large num-

ber of processor cores, and consequently threads arerequired to be grouped at launch. The CUDA threadblocking model defines this mapping. Individualthreads are grouped into a block, which can be upto 3 dimensional in shape. There is a maximum of1,024 threads per block (for FERMI cards), and eachdimension should be a power of two. So 1,024x1x1,16x16x1, and 8x8x8 would all be valid block sizes.To launch more than 1,024 threads, multiple blocksare used. These are arranged into a 2D grid.

Therein lies the main issue, as there are many per-mutations for fitting 3D data sets into this 2D grid

system comprised of 1, 2 or 3D blocks. Add to thisthe variation in block size itself, and it is certainlynot clear which permutation is going to produce thefastest execution time.

As the grid system is 2 dimensional, the simplesttechnique is to change the problem to a 2D one, anditerate over the 3rd dimension with a loop. A squareblock size of 16x16 can be used to tile over each Zdimension layer, as shown in Figure 5. Deriving the

Fig. 5: Grid block system for 2D iteration


Page 4 of 7

8/3/2019 Webb Aes11

5/7


3D position of each thread is straightforward enoughusing R = blockIdx.yBh+threadIdx.y, for the rowand similar code for the column, and the depth given

by the loop.

There are still two possibilities for using the loopitself. Firstly we can loop outside the kernel launch,dispatching Nr x Nc threads at each iteration. Forthe test simulation this produced an execution timeof 174.5 seconds. The second option is to loop fromwithin the kernel itself. We still dispatch Nr x Ncthreads, but each of these threads then iterates overthe Z dimension. This produced an execution time of252.1 seconds, far slower than the outer loop version.

7. MAXIMUM THREADING

In this section we look at maximum threading, whereall 110 million threads are launched simultaneously.In order to do this, the thread blocks and grid needto cover the entire data set. Two arrangements weretested; firstly using 1D thread blocks, and then using2D extended tile blocking.

7.1. 1D thread blocking

The use of 1D thread blocks is perhaps the most nat-ural fit when dealing with 3D data sets. The block is

set to cover one dimension of the data, and then the2D grid is formed over the remaining. Deriving the3D position from the thread and block IDs is simple.The choice of dimension for the thread block has avast impact over timings, and clearly demonstratesthe sensitivity to data arrangement patterns. Set-ting the block to cover the Z dimension of the dataresulted in ten-minute execution times even with thedepth reduced to 256. As our 3D data is decomposedin a row-major format, the most efficient memoryaccess is to set the block over the rows.

For the test simulation, which has a width of 256,

this produced a time of 129.5 seconds. Using 1Dblocking has the limitation that the data width mustbe a power of two, whilst there are no restrictions onthe other dimensions. The other possible variation isthe use of a 1D thread block with a simple 1D blockgrid. However, as the maximum size of a dimensionof the grid block is 65,535 this does not allow theexecution of all 100 million threads at once.

7.2. 2D extended tile thread blocking

2D thread blocks can be used to cover the entiredata set, at the expense of more complex coding.As with the 2D iteration approach, each Z layer istiled with a square thread block. These layers arethen arranged side by side in the 2D grid, as shownhere:

Fig. 6: Grid block system for 2D thread block

The complication is in deriving the 3D position for

each thread, which requires integer divisions for theZ position, using:

R = blockIdx.y * Bw + threadIdx.y

Z = (blockIdx.x * Bw + threadIdx.x)/Nc

C = (blockIdx.x * Bw + threadIdx.x)-(Z*Nc)

The test simulation was run for two differentblock sizes, producing the following results:

16 x 16 block size : 120.8 seconds32 x 32 block size : 148.1 seconds

The 16 x 16 block size produces the fastesttime due to reduced pressure on the registers. Sofar, all of the above methods have used direct accessto global memory to read the data from one andtwo time-steps ago. The final stage in optimisationis to consider the use of shared memory, typicallythe most important factor in reducing memoryaccess latency.


Page 5 of 7

8/3/2019 Webb Aes11

6/7


8. SHARED MEMORY

Each block of threads has access to a small amountof fast shared memory. The concept of the optimi-sation is that by getting threads to collaborate inloading to shared memory, this will reduce the num-ber of global memory accesses required. Our testsimulation uses seven accesses to the data from onetime-step ago, and seven from two time-steps ago.However, within a 2D thread block, data is re-usedas neighbour points by threads on the same Z dimen-sion layer. By using collaborative loading to sharedmemory, we can reduce the global access from 14, to6 or 8 per thread.

The shared memory itself is declared in the kernel

using standard 3-index array notation, and so is easyto address. However, the code is then complicatedby the halo requirements. For a 2D thread block, thethreads at the outer edges still need neighbouringdata in the row and column dimensions. So, if a16 x 16 block size is used, the shared memory arrayneeds to be of size 18 x 18. The collaborative loadingrequires that threads along the edges load there ownpoint, plus the necessary halo point. This leads to afour part conditional statement within the kernel.

Despite the complications of the code, the timingresults showed a small improvement over direct

global access. Two arrangements were tested.Firstly, the 2D outer-kernel iteration was tried witha block size of 16 x 16. This produced a time of170.1 seconds. The final arrangement used the 2Dextended tile approach from 7.2. This was testedusing two block sizes, with the following times:

16 x 16 block size : 117.0 seconds32 x 32 block size : 143.2 seconds

Again, these are several seconds faster thanthe direct global access version.

9. RESULTS

Table 1. shows a summary of the test results for eacharrangement, all computed on the Tesla FERMIC2050 using double precision floating-point arith-metic.

The 2D iteration approach produced the slowesttimes, even when used with shared memory. For

Arrangement Version Time(s)2D Iteration Inner-kernel loop 252.1

Outer-kernel loop 174.5

shared memory 170.11D thread block 256 row block 129.52D extended tile 16x16 block size 120.8

32x32 block size 148.1with shared mem 16x16 block size 117.0

32x32 block size 143.2

Table 1: Summary of timing results

maximum threading, the 1D thread block can be effi-cient, but only when using the row dimension. Over-

all, the 2D extended tile produced the fastest times,when using a 16 x 16 thread block. The sharedmemory version was just 3% faster. Whilst muchemphasis is placed on the use of shared memory op-timisation, in this instance the gains were small.

10. AUDIO SIMULATIONS

For the above testing, the simulations were lim-ited to just 2,000 samples. Full audio testswere computed using the optimal 2D extended tileversion. These were calculated for several sec-onds at 44.1 kHz, using the room size of 6.5m

x 3.5m x 12m. Computation time was 40 min-utes for one second of audio output. Vari-ous audio and video examples are available at :www2.ph.ed.ac.uk/s0956654/Site/VirtualRoomAcoustics.html

The effect of the boundary definition using the floorplan can be seen in a plot of a single height layer,Figure 7.

Fig. 7: 1 kHz sine wave after 78, 164, and 268 sam-ples.

Boundary reflection and diffraction effects aroundobjects can be seen.


Page 6 of 7

8/3/2019 Webb Aes11

7/7


11. CONCLUSIONS

This work clearly demonstrates the sensitivity ofGPU performance to architectural arrangement. For3D data sets, the multitude of options for computa-tion can give markedly different results. Even for theefficient versions tested here, the difference betweenbest and worse performance is a factor greater thantwo. A 2D extended tile arrangement showed bestoverall efficiency, and in all cases the differences be-tween shared memory and non-shared memory ver-sions are small. There are still other arrangementsthat can be considered, such as the use of a 3Dthread block. However, the restrictions on threadblock size would likely limit any benefits.

A further option is the possibility of using singleprecision computation. This would open up the useof cudaArray objects and texture memory, but onlyif single precision gave acceptable results from anaudio perspective. The effect of propagating round-off errors on phasing and stability may render sin-gle precision unacceptable, depending on the type ofFDTD scheme being used. Further work will also ex-amine the performance of the architectural arrange-ments when using multiple GPUs for simultaneouscomputation.

12. REFERENCES

[1] A. Southern, D. Murphy, and J. Wells, Ren-dering walk-through auralisations using wave-based acoustical models, in Proc. of the 17thEuropean Signal Processing Conf. EUSIPCO09, Glasgow, UK, 2009.

[2] A. Balevic, L. Rockstroh, A. Tausendfreund,S. Patzelt, G. Goch, and S. Simon, Accel-erating simulations of light scattering based onfinite-difference time- domain method with gen-eral purpose GPUs, in Proceedings of the 11thIEEE Int Conference on Computational Science

and Engineering. Sao Paulo, Brazil, 2008.

[3] D. Brandao, M. Zamith, E. Clua, and A. Mon-tenegro, Performance evaluation of optimizedimplementations of finite difference method forwave propagation problems on GPU architec-ture, in Proceedings of the 22nd Interna-tional Symposium on Computer Architecture

and HPC. Brazil, 2010.

[4] D. Michea and D. Komatitsch, Accelerating athree-dimensional finite-difference wave propa-gation code using GPUs, Geophysical Journal

International, vol. 182, pp. 389402, 2010.

[5] L. Savioja, Real-time 3D finite-difference time-domain simulations of low and mid-frequencyroom acoustics, in Proceedings of 13th Int.Conf on Digital Audio Effects. Austria, Sept,2010.

[6] L. Savioja, D. Manocha, and M. Lin, Use ofGPUs in room acoustic modeling and auraliza-tion, in Proceddings of the International Sym-posium on Room Acoustics. ISRA, Melbourne,2010.

[7] Craig J. Webb and Stefan Bilbao, Comput-ing room acoustics with CUDA - 3D FDTDSchemes with boundary losses and viscosity,University of Edinburgh, Feb 2011.

[8] Nvidia Corp, Fermi Compute Architec-ture whitepaper, http://developer.nvidia.com,Oct, 2010.

[9] Nvidia Corp, CUDA C Programming Guide,v3.2, http://developer.nvidia.com, Sep 2010.

[10] Nvidia Corp, CUDA C Best Practices Guide,

v3.2, http://developer.nvidia.com, Aug 2010.


Page 7 of 7

webb aes11

Documents