interactive simulations with navier-stokes equations on ... · 2 navier-stokes equations in fluid...

Interactive Simulations with Navier-StokesEquations on many-core Architectures

Stefan Bartels

[email protected]

Interdisciplinary Project

Chair of Scientific ComputingTechnische Universitat Munchen, Germany

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Navier-Stokes Equations in Fluid Simulation . . . . . . . . . . . . . . . . . . . . . . 1

2.1 Navier-Stokes Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Discretization and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Basics of GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1

1 Introduction

Navier-Stokes Equations are a mathematical model to describe the behaviour offluids. They have proven to represent real fluid flows quite well and are base formany fluid simulations.

In order to exploit the performance provided by modern many-core systems,fluid simulation algorithms must be able to efficiently solve the Navier-StokesEquations in parallel.

The aim of this interdisciplinary project was to implement a real-time fluidsimulation by solving the Navier-Stokes Equations on a GPU. Additionally, auser should have the possibility to interact with the simulation at run-time,requiring real-time visualization.

The sequential algorithm described by Michael Griebel [1] provides the basisfor the implemented simulation. Within the project a solver for CPUs was im-plemented and ported to GPUs. It is capable of simulating a two-dimensionallaminar flow and visualize it in real time. Obstacles can be added at run-time bydrawing with the cursor on the visualization window and have immediate effectson the simulation.

The simulation framework and a CPU solver was written in C++, whileOpenCL was used for kernels of the GPU solver. The user interface and visual-ization are done with the help of the Qt libraries and OpenGL.

2 Navier-Stokes Equations in Fluid Simulation

In this section the Navier-Stokes Equations for laminar, incompressible fluid sim-ulation, as used for the implementation of the simulation software, are roughlyoutlined. Basically it is a quick summary of the first chapters of the book “Nu-merical Simulation in Fluid Dynamics” by Michael Griebel [1].

2.1 Navier-Stokes Equations

The Navier-Stokes Equations describe flow of an incompressible fluid with a sys-tem of partial differential equations, the momentum equation and the continuityequation. They can be derived from mass and momentum conservation. Thefluid particles are not handled individually, but are grouped and represented bya field of velocity vectors ~u and a pressure field p at a time t. Because of thefluid being incompressible, density is assumed to be constant. The behaviour ofa fluid is influenced by body forces ~g, e.g. gravity, and the Reynolds numberRe. The Reynolds number indicates the ratio of viscous to inertial forces thatinfluence the fluid. A low Reynolds number is used for viscous fluids, e.g. honey,high Reynolds numbers for inviscid fluids, e.g. air. [1]

Momentum equation:

∂

∂t~u+ (~u · ∇)~u+∇p =

1

Re∆~u+ ~g (1)

2

Continuity equation:

∇ · ~u = 0 (2)

For the simulation in two dimensions, these equations can be rearranged andrepresented in the following component forms: [1]

Momentum equation:

∂u

∂t+∂p

∂x=

1

Re

(∂2u

∂x2+∂2u

∂y2

)− ∂(u2)

∂x− ∂(uv)

∂y+ gx (3a)

∂v

∂t+∂p

∂y=

1

Re

(∂2v

∂x2+∂2v

∂y2

)− ∂(uv)

∂x− ∂(v2)

∂y+ gy (3b)

Continuity equation:

∂u

∂x+∂v

∂y= 0 (4)

2.2 Discretization and Algorithm

Staggered Grid: The simulated area is partitioned into several cells on astaggered grid. The pressure of the fluid inside of each cell is stored, as well asthe velocity of fluid particles travelling between these cells. Hence, the velocityvalues are evaluated on cell boundaries, while pressure values are evaluated atcell centres. The values are stored in three matrices, u and v for horizontal andvertical velocities and p for the pressure.

Around the simulation domain, the fluid cells, an additional layer of boundarycells is added for the application of boundary conditions. [1]

Fig. 1: Staggered grid: Pressure (P), horizontal (U) and vertical (V) velocities [1]

3

Algorithm: The Navier-Stokes Equations can be discretized at these positionsand additionally in time. The discrete momentum equations to compute the newvelocity for a grid point at time n+ 1 are defined as following: [1]

u(n+1)i,j = F

(n)i,j −

δt

δx

(p(n+1)i+1,j − p

(n+1)i,j

)(5a)

v(n+1)i,j = G

(n)i,j −

δt

δy

(p(n+1)i,j+1 − p

(n+1)i,j

)(5b)

F and G are the right-hand sides of the two momentum equations 3a and 3b(see [1] for the full discretization).

The pressure at time n + 1 can be determined with the discrete Poissonequation (equation 6 and 7). This leads to a linear system of equations with anumber of unknowns pi,j equal to the number of fluid cells.[1]

p(n+1)i+1,j − 2p

(n+1)i,j + p

(n+1)i−1,j

(δx)2+p(n+1)i,j+1 − 2p

(n+1)i,j + p

(n+1)i,j−1

(δx)2= rhsi,j (6)

rhsi,j =1

δt

(F

(n)i,j − F

(n)i−1,j

δx+G

(n)i,j −G

(n)i,j−1

δy

)(7)

In order to obtain the pressure values for each cell, this linear system ofequations has to be solved. The approach described in [1] and that is used inthis project is to solve the system iteratively with the Gauss-Seidel method andsuccessive over-relaxation (SOR). This is done by solving equation 8 for eachcell repeatedly, until the L2-norm of the residual (equation 9) drops below athreshold or a maximum number of iterations is reached. [1]

pit+1i,j = (1− ω) piti,j +

ω(2

(δx)2+ 2

(δy)2

)·

(piti+1,j + piti−1,j

(δx)2+piti,j+1 + piti,j−1

(δy)2− rhsi,j

)(8)

riti,j =piti+1,j − 2piti,j + piti−1,j

(δx)2+piti,j+1 − 2piti,j + piti,j−1

(δy)2− rhsi,j (9)

Note that the formulas 6, 8 and 9 are only valid if the pressure values of theboundary cells are set to the pressure values of the respective adjacent fluid cellsafter each SOR iteration. [1]

4

Time Step Restrictions: In order to avoid numerical instability and oscilla-tions, particles may not cover a larger distance than the length or height of afluid cell in one time step. The maximum length of a time step that is possiblein a particular situation can be determined according to the cell dimensions andcurrent maximum velocities and may therefore vary for each time step. A savetime step can be chosen by applying a safety factor τ ∈]0, 1] to this value. [1]

δt = τ ·min

(Re

2

(1

δx2+

1

δy2

)−1

,δx

|umax|,

δy

|vmax|

)(10)

Boundary Conditions: The pressure and velocity values of boundary cellscannot be computed the same way as the fluid cells due to missing neighbourcells. Depending on how these values are chosen, different boundary conditionscan be simulated: [1]

No-slip means that boundaries act like solid walls. Velocity values on the edgebetween a fluid and a boundary cell are set to zero, while velocities inside ofboundaries (parallel to the wall surface) are set to the opposite value of thevelocity of the nearest fluid cells which is parallel to the wall surface. Thisway friction is simulated.

Free-slip is similar to no-slip, but in this case the fluid can flow along wallswithout friction.

Outflow is equivalent to no wall. Velocities at boundary cells are copied fromthe nearest fluid cell.

Inflow requires fixed values for velocities at a boundary.

Obstacles inside of the simulation domain are excluded from the solvingalgorithm and treated as no-slip boundaries.

3 Basics of GPU Programming

GPUs (graphics processing units) are many-core microprocessors which supportmultiple hardware threads. In contrast to CPUs they are no general purposeprocessors and have a limited set of functionality, but focus on throughput withhigh numbers of concurrently executed operations. [3]

Multiple work-items (cores) are grouped in work-groups (warps) and shareone instruction unit. Thus, they perform the same operation at one time, butuse different data (SIMD, single instruction, multiple data according to Flynn’staxonomy). [3]

Functions can be implemented as kernels and executed on a GPU device. Theexecution context creates a thread for each data item, groups them into blocksand submits them to the device. [2, 3]

Usually the bottleneck in GPU programming is the memory bandwidth. Con-current access to global memory by hundreds of threads is costly and may reduce

5

the throughput significantly. The memory model of GPUs aims at distributingmemory accesses to several memories. All cores in one block share a cache, calledlocal (or shared) memory. This local memory allows threads in one block to in-teract with each other. Additionally, a part of the global memory is defined asconstant and may not be altered from the device, only from the host system.This enables faster read access without write locks. Each thread also has a smallprivate memory. [2, 3]

From the host, only the global device memory is visible. Any data to be usedby kernels must be copied to the global memory first and be written back to thismemory by the kernel threads after completion. [2, 3]

Due to the high number of hardware threads GPUs offer a very high per-formance even at lower clock frequencies than CPUs. Anyway, it is not alwayspossible to exploit the capabilities of GPUs. Only certain problems are wellsuited for execution on GPUs, particularly algorithms that perform mathemat-ical operations on a massive number of values.

To get optimal performance results, branching inside of kernel functionsshould be kept at a minimum. Due to the shared instruction unit all branchesmust be processed sequentially, while all threads are idling during the executionof other branches.

Further information about the setup of the OpenCL can be found in theOpenCL Programming Guide [4]

4 Implementation Details

4.1 Usage

The simulation software developed during the project solves the Navier-StokesEquations for a two-dimensional domain with the help of a GPU. Fluid parame-ters can be controlled via a configuration file which currently has to be providedat program launch. In the simulation file a black and white image in PGM for-mat can be specified, which may contain arbitrary obstacles for the simulationdomain.

Simulation on GPU can be disabled the command line parameter -cpu, thereal-time visualization can be disabled via -vtk interval time limit. Thenlegacy VTK files with pressure and velocity information are produced insteadwith the provided interval until the specified simulation time is exceeded. Thisway the simulation can be used via command line and ssh connections, even ifno X-server is available.

NavierStokesGPU [−vtk interval time limit] [−cpu]parameter file

If the -vtk option is used, the simulation starts immediately, otherwise theGUI is displayed, providing a button for starting and pausing of the simulation

6

and an option for automatic rescaling of the visualization colors to the minimumand maximum values. The real time viewer represents the velocity magnitudefor each cell. By clicking and dragging inside of the viewer new obstacles can beadded.

Fig. 2: User interface with a Karman Vortex Street

At run-time, information about the time required for simulation and visual-ization of one time step is displayed and the resulting frames per second. Ad-ditionally, the number of SOR iterations is given. After exiting, more detailedstatistics are printed to the console about the simulated and elapsed time andthe ratio of computing time for simulation to visualization.

4.2 Implementation

The program is basically divided into two parts, the simulation and the visual-ization.

Two solving classes are available, one for simulation on a single CPU onlyand one for GPU usage. The OpenCL setup is handled by the class CLManager,which currently is used by the GPU solver only, but generally allows usage ofthe OpenCL context by other classes as well. It was originally added to enablethe modification of OpenGL textures with OpenCL inside of viewer classes.

An input parser reads command line arguments and the configuration file,all parameters are stored in a struct which is passed as reference to all objectsrequiring it.

In general, Qt was only used for the graphical user interface (GUI) andthe real-time visualization. The only exception is the simulation class which isderived from QThread. Solvers use C++ and OpenCL only.

Simulation: The class simulation triggers the initialization of the simula-tion domain and computation of each time step, which is executed by the chosen

7

NavierStokesSolver

Simulation

NavierStokesCpu NavierStokesGpu

CLManager

InputParserMainWindow

Viewer

GLViewer VTKViewer

SimplePGMViewer

Fig. 3: Class diagram

solver. It collects statistical data, provides an interface for interaction with viewerobjects and triggers visualization after a time step is completed. Thus, visual-ization and simulation are not yet interleaved, but sequential. To keep the GUIresponsive, the simulation and visualization are executed in a separate thread.

Both solvers follow the algorithm described by Michael Griebel [1]. First, themaximum time step duration is evaluated and boundary conditions are applied.Then F and G and the right-hand side of the pressure equation (see formulas5a, 5b and 7) are precomputed, as they are constant during the SOR iterations.During of each SOR iteration, first the SOR step is solved, then are pressureboundary conditions applied and the residual calculated.

The GPU solver moves all these steps as kernels to the GPU. The entiresimulation is performed on device memory, it is only copied to the host mainmemory for visualization. At start-up the kernels are created and values andmemory objects are linked as parameters, so only kernel parameters that changeduring the simulation have to be updated.

F and Gfrom velocities

Get time stepApply velocity

boundary conditions

right hand side of pressure

equation

Poisson equation for

pressure

Apply pressure boundary conditions

Compute residual

Residual< ε

Visualizationyes

no

Fig. 4: Basic algorithm of each time-step

8

Visualization and Interactivity: Three different viewers are implemented.The default viewer GLViewer, which is the only one supported by the GUI, isan OpenGL widget of Qt. It renders the magnitude of velocity vectors for eachfluid cell in black and white and may rescale the palette for each time step ifselected in the GUI.

Mouse press and release events are captured to detect user interaction insideof the flow field. Whenever the cursor is moved while the mouse button is pressed,a Qt signal is emitted which is connected via the simulation to a function ofthe solver. Here, obstacle flags of surrounding fluid cells are updated. To preventunphysical behaviour, velocities and pressure inside the obstacles are set to zero.

Note that by converting fluid cells to boundary cells fluid mass is effectivelyremoved. Because of this it is still possible to create situations with unexpectedflow behaviour.

The VTKViewer does not work together with the GUI. The third viewer forPGM files is no longer accessible without source code changes.

4.3 Kernels

The parts algorithm (compare figure 4) are implemented as kernels. The mostinteresting kernels are the time step and residual calculations due to parallelreduction and the pressure solver.

Poisson SOR Kernels: The Gauss-Seidel method is hard to parallelize, as eachcell depends on the value of the upper and left neighbour cell of the current SORiteration. A possibility is to calculate the cell in the first corner first and continuewith adjacent cells, leading to a wave-like pattern, but this complex approachresults on GPUs in a massive number of idling threads and is therefore not anoption.

Instead, a red-black Gauss-Seidel method with over-relaxation was imple-mented. This approach divides all cells into two groups (red and black) in achess-like pattern. All cells with an even index ((i + j)mod2) are classified asblack and evaluated first. All neighbour cells still have their values from theprevious SOR iteration. Afterwards, all red cells are evaluated, their neighboursalready all have updated values. First, this approach was implemented with-out over-relaxation due to concerns regarding the numerical stability, but noproblems where encountered with SOR.

Compared to the original Gauss-Seidel SOR method, the red-black methodconverges slower towards the defined threshold.

Reduction Kernels: For the time step the maximum velocities of all cells arerequired, for the residual the sum of all residual values. Both require to loopover all cells, but yield only one result value.

To efficiently calculate these values in parallel on the GPU, a block of threadsof the maximum work-group size is created, so that all threads have access tothe same shared memory. Each thread is iterating over several cells in a way that

9

all simultaneously accessed cells are adjacent in memory. This way the memoryaccess is coalesced and redundant accesses are minimized (compare also nextsubsection).

Each thread collects a private value (maximum or sum, respectively). Onceeach cell has been visited by one thread, the collected values of the threads aresubsequently merged in a binary tree pattern. Finally, the last thread mergesthe last two values and writes the single result back to global memory.

(a) Exemplary assignment of cells totreads during the collection part of thealgorithm.

(b) Merging of intermediate results inbinary tree structure.

Fig. 5: Parallel reduction algorithm with four threads (blue, green, yellow, red).

General Kernels: As threads are grouped into two dimensional blocks andsubmitted to the device, a guard condition is added to ensure that all calculationsare only executed for values inside of the fluid domain, not for boundaries orinvalid coordinates. The definition of a block size also is an important prerequisitefor memory access pattern optimizations: In many kernels values of neighbourcells are required to determine a new cell value. Hence, each thread requiresmultiple values from global memory. By copying all values required within ablock (a typical block size is 16 times 16 cells) into local memory, the number ofglobal accesses is reduced. Furthermore, multiple subsequent values are loadedfrom device memory at once. To exploit this fact, it is important to take careof coalesced memory access: Instead of a chaotic access pattern (each threadloads its value, threads at block boundaries load an extra boundary value), the

10

values should be copied to shared memory in the same order they are loadedfrom global memory. This feature is not fully implemented yet.

In some cases branches could be avoided by involving the conditions intoa formula. Since false is equal to zero, it can be used to eliminate parts of aformula.

5 Evaluation

5.1 Validation

To test the correctness of the simulation results, three scenarios have been setup: Moving lid, channel and a Karman Vortex Street. Visualization was donewith Paraview from VTK files.

Moving Lid Scenario: In this scenario, one wall is moving with a fixed velocityand should cause the fluid in the simulation domain to rotate due to friction. [1]

Fig. 6: Velocity of moving lid scenario with Re = 1.

Fig. 7: Velocity of moving lid scenario with Re = 5000.

As can be seen on figures 6 and 7, the expected vortexes are formed. For alow Reynolds number the viscous forces are dominant and due to low inertial

11

forces the fluid is starting to rotate slowly. For high Reynolds number as 5000 theviscosity of the fluid obviously is small and the formed vortex is more distinct.Additionally, in the corners of the moving wall high and low pressures can beobserved. Qualitatively this scenario is simulated correctly.

Channel Scenario: In a rectangular channel with a constant inflow and anoutflow on the opposite boundary, a linear drop of pressure and a parabolicshape of the velocity profile is expected due to lower velocity near boundaries.

Fig. 8: Channel scenario with homogenous inflow: Pressure drop.

Fig. 9: Channel scenario with homogenous inflow: Velocity profile.

Figures 8 and 9 show the profiles of pressure (along the channel) and velocity(cut through the channel). The results are close, but clearly not exactly as ex-pected. The inflow is done with the same velocity values for all boundary cells,independent of their vertical position. Hence, the velocity profile at the inflow isconstant and not parabolic, but approaches a parabola with increasing distanceto the inflow. The faster particles near the boundaries at the inflow are sloweddown, increasing the pressure in this area.

To solve this problem, the parabolic inflow was implemented for the channelscenario. Despite the flaw in the parabola formula for the input that causes smallpressure distortions, pressure drop and velocity profile are consistent with theexpectations now (compare figures 10 and 11).

12

Fig. 10: Channel scenario with parabolic inflow: Pressure drop.

Fig. 11: Channel scenario with parabolic inflow: Velocity profile.

Karman Vortex Street: Finally, the flow of a fluid flowing around hitting anobstacle is tested. For the setup displayed in figure 12 the formation of a KarmanVortex Street is expected.

Fig. 12: Karman Vortex Street of flow around an obstacle (Re = 100).

5.2 Performance Evaluation

Depending on the grid size a frame rate of up to 60 frames per second is possiblefor all of the tested scenarios, especially if a stable state is reached and onlyfew SOR iterations are required. While drawing, the time required per time stepmay increase due to fast velocity and pressure changes within few time steps.

Figure 14 shows a comparison of the total run-time of CPU solver and GPUsolver. This measurement was done in vtk mode for the Karman Vortex scenarioand different grid sizes. The grid size is given in total cell numbers.

13

Fig. 13: Karman Vortex Street of flow around an obstacle (Re = 500).

Obviously, the GPU version is only slightly faster than the CPU version. Infact, a single SOR iteration takes less time if calculated on the GPU, but due tothe lower convergence of the red-black method more SOR iterations are requiredto reach the same threshold. For example, for the Karman Vortex scenario with300 times 75 fluid cells, the Gauss-Seidel SOR method on CPU requires only 5 to9 SOR iterations to converge and reaches about 60 frames per second. The GPUsolver with the red-black pattern requires between 15 and 40 SOR iterations forconvergence, with a frame rate between 40 and 60 fps.

Technische Universität München

15Interactive Simulations with Navier-Stokes Equations on many-core Architectures

PerformanceIntroduction – Implementation – Validation – Performance – Future Work

Fig. 14: Run-time of CPU and GPU solver depending on grid cell number.

14

6 Conclusion and Future Directions

6.1 Conclusion

The current version of the simulation software is capable of solving the Navier-Stokes Equations GPU-accelerated for a two dimensional domain and visualizingof the results in real-time. A first interactive steering functionality is imple-mented, allowing the user to draw obstacles into the running simulation.

Nevertheless there are many possibilities to optimize and extend the imple-mentation.

The GPU port itself provides a performance gain, but it is more or lessnegated by the worse convergence. Therefore, further optimizations of the GPUusage may be beneficial, but algorithm optimizations are more urgent. Differ-ent SOR relaxation parameters may change the result a bit, but the overallperformance of the red-black method seems to be limited.

A Jacobi solver could be worth a try and should be relatively easy to beimplemented in the current software. However, experience has shown that Jacobimay be perfectly parallelizable, but converges much slower than Gauss-Seidel,even without SOR, for other partial differential equations, as the simpler heatequation. Another possibility would be the conjugate gradient method.

However, the knowledge gain regarding fluid dynamics during this projectwas immense.

6.2 Future Directions

Some minor issues with the Navier-Stokes Equations remain which could befixed, as the periodic inflow for the channel scenario or the possibility of unphys-ical behaviour in some situations after interactive addition of obstacles.

But there are a number of interesting extensions that could be applied to thesoftware.

First of all, the visualization still is unsatisfactory. Instead of a plain velocitygrey scale plot, different colour gradients, streamlines, arrows or similar could beused. Even better and more intuitive would be the insertion of particles, eitheras single particles or as a particle field.

Also, the interactivity could be extended. Currently, the removal of obsta-cles is not supported due to problems with the pressure in the respective cells.Furthermore, the direct manipulation of velocities via vectors drawn with thecursor could provide interesting possibilities, as well as fluid sources and sinksinside of the domain. Of course, alteration of simulation parameters and loadingof configurations at run time could be added as well.

Aside from other solving methods, as described in the previous section, theNavier-Stokes Equations themselves could be extended. A third dimension mightbe interesting, but it would require a totally different visualization and the com-plexity would increase significantly, making real time simulations for adequatelylarge dimensions hard to realize. Instead, free boundaries for multiple fluids orturbulence models are a more promising.

15

References

1. Michael Griebel, Thomas Dornseifer, Tilmann Neunhoeffer: Numerical SImulationin Fluid Dynamics - A Practical Introduction. Society for Industrial and AppliedMathematics, Philadelphia 1998

2. Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa: Hetero-geneous Computing with OpenCL, 2nd Edition. Morgan Kaufmann, 2012

3. Daniel Cremers, Martin Oswald, Frank Steinbrcker: GPU Programming in Com-puter Vision. Lecture at Informatics 9 - Chair of Computer Vision and PatternRecognition (Prof. Cremers), Technische Universitt Mnchen, 2013

4. Aaftab Munshi, Benedict Gaster, Timothy Mattson, James Fung, Dan Ginsburg:OpenCL Programming Guide. Addison-Wesley Professional, 2011

interactive simulations with navier-stokes equations on ... · 2 navier-stokes equations in fluid...

Documents