solution of the transport equation using graphical processing units

Solution of the Transport Equation

using Graphical Processing Units

Gil Goncalves Brandao

Master Thesis on

Aerospace Engineering

President: Prof. Fernando Jose Parracho Lau

Supervisor: Prof. Jose Carlos Pereira

Co-supervisor: Mst. Ricardo Jose Nunes dos Reis

Examiner: Prof. Jose Leonel Monteiro Fernandes

October 2009

Agradecimentos

Quero agradecer ao Professor Jose Carlos Pereira e ao Ricardo Reis esta grande oportunidade

de trabalhar com eles no LASEF e o me terem ajudado a fechar um longo capıtulo da minha vida.

Quero expressar que o facto dos meus pais sempre me terem deixado viver com total liberdade

sem nunca sequer me terem sugerido que fizesse algo com o qual nao concordasse, e-me de inco-

mensuravel valor. Assim como considero incomensuravel o valor que isso tras ao desfecho deste

capıtulo. De igual forma quero explicitamente expressar o meu profundo agradecimento por nunca

ter sentido qualquer pressao para acabar o curso, num mundo em que o time to market (e nao a

felicidade) parece ser a regra.

Quero tambem agradecer a pessoa que nos ultimos anos soube equilibrar todos os pratos no

melhor sentido mas tambem criar desequilıbrios sempre que necessario. Sem ela, muito provavel-

mente, o presente texto nunca teria sido escrito.

De forma mais geral agradeco a Radio Zero: o meio de acesso, a intervencao, o experimenta-

lismo e os amigos. E a todos aqueles que, ao contribuırem para a Cultura Livre, demostram ao

Mundo que e possıvel conviver em harmonia e progredir, sem deixar ninguem de fora.

Resumo

Contradizendo 30 anos de progresso em termos de rapidez dos CPUs, os ultimos anos mostraram

que se chegou a um ponto de saturacao no que respeita a velocidade de relogio dos CPUs. Este

facto entra em conflito com as sempre crescentes necessidades computacionais da comunidade

cientıfica de mecanica de fluidos computacional. Ao mesmo tempo, as GPUs surgiram como um

recurso computacional paralelo de alta performance, alternativo ao CPU. Esta nova tecnologia e

tambem mais barata que as abordagens paralelas tradicionais. Esta tese investiga o paradigma

computacional associado as GPUs e a sua implementacao, a utilizacao desta tecnologia para a

solucao da equacao de transporte uni-dimensional e os ganhos comparados com uma solucao

baseada em CPU.

A tecnologia CUDA da NVIDIA e utilizada como plataforma de acesso as GPUs. Foram

implementados testes para obter um verdadeiro conhecimento sobre a tecnologia. Foram tambem

implementadas nesta tecnologia as rotinas computacionais necessarias a solucao da equacao de

transporte.

Os resultados obtidos neste trabalho mostram que a tecnologia, embora nova e pouco explorada,

e uma plataforma muito promissora onde ganhos concretos no domınio da mecanica de fluidos

computacional podem ser alcancados.

Palavras chave: computacao paralela; GPU, alto desempenho, mecanica de fluidos compu-

tacional; diferencas finitas; sistemas lineares.

Abstract

Contradicting 30 years of CPU speed progress, the last few years have shown a saturation point in

clock rate. This fact collides with the natural ever growing demand of computational power from

the CFD scientific community. At the same time, GPUs emerged as a parallel, high performance

alternative computational resource. This new technology is also cheaper than the traditional

parallel approaches. This work investigates the computational paradigm attached to the GPUs,

the usage of these devices to solve the uni- dimensional transport equation and what performance

gains exist when compared with the traditional CPU utilization.

The CUDA technology from NVIDIA is used as the platform to access the GPUs. Tests were

implemented to acquire a real knowledge of the technology. The necessary routines to solve the

transport equation were also implemented.

The results obtained in this work show that this technology, albeit new and immature, is a

very promising platform and real speedups in the CFD domain can be achieved.

Keywords: parallel computing; GPU; high performance; speedup; CFD; finite differences;

linear systems.

Contents

1 Introduction 1

1.1 GPU, the Cluster on the Desktop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 GPU Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 CUDA Programming Overview 7

2.1 Parallel Computing Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 GPU Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 CUDA Environment Tests 20

3.1 Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Peak Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 Host - device transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 Device - device transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Stream benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Burgers equation solver 33

4.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Program and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.3 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Simulations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.2 Numeric errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Conclusion 47

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A Additional Informations 55

A.1 Properties of some GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

B Code Listings 57

B.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.1.1 FLOP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.1.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

B.1.3 Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

B.2 Burgers equation solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.2.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

List of Figures

2.1 The von Neumann model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Parallel computing memory patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Parallel Speedup Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Host Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 A GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Asynchronous execution of the device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8 CUDA program example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 FLOP test, total time of execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 FLOP performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Time of the total transfer cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Details of the data transfers for two different sizes . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Workload distribution for a vector of size 7 and 3 threads . . . . . . . . . . . . . . . . . . . 28

3.6 Intra device memory transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Intra device memory transfers w/ cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.8 Stream benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Main program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Problem of the row interchange order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Absolute speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Initialization ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Loop speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Initialization with the inverse computed on the GPU . . . . . . . . . . . . . . . . . . . . . 44

4.7 Speedup with the inverse computed on the GPU . . . . . . . . . . . . . . . . . . . . . . . . 45

List of Tables

3.1 Host system hardware details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Device properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Memory and operation accounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Stream benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Resume of computational operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 memory usage in floating point elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.1 Properties of several GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 1

Introduction

Engineers doing Computational Fluid Dynamics (CFD) always have struggled for computing

resources to solve their problems faster. And, when they managed to solve one of those problems,

there is always something bigger around the corner. The natural way to to fullfil this demand

(of speed and size) is to increase the machine’s speed, maintaining its original sequential form.

This sequential origin probably exists because the normal method to humans in algorithms is as a

sequence of simple actions. Until a few years ago, it was possible to just wait a few months and buy

a new faster machine. Moore’s Law, stating that on die transistor count doubled every two years

was granting this free ride, providing a steady increase in CPU velocity. Unfortunately, current

technology bumped into a wall: the ever decreasing die size with increasing transistor density has

made heat problems unbearable. The obvious sign is in clock rates: the brand new Intel Core i7

has a 3.3GHz clock rate, while the 2001’s Intel Xeon had at that time (8 years ago) a maximum

clock rate of 3.6GHz. The major consequence of being impossible to keep increasing the clock

speeds, is that the old algorithms are no longer as useful as they once were. The solution for growth

is now, more than ever, parallel computing. Instead of adding capacity to one processing unit,

the number of processing units are increased, or metaphorically, to increase the flow rate while

maintaining velocity constant, we can only increase the cross section area. The CPU industry

already understood this since it has turned multi-core and choose to embrace the parallel way of

marching into the future. But parallel computing, as fluid mechanics, isn’t linear and now the

speedup is no longer free after the purchase of new hardware: there is a inherent complexity when

compared to the serial way of thinking and, depending on the problem, the speedup can be from

negligible to linear scaling with the number of processing units.

Meanwhile, the graphic cards industry was pursuing it’s own path, using parallel, dedicated

hardware from the start and drove by the rich market demand of gamers. Graphic Processor

Units (GPUs) gained more capabilities and, in the last couple of years, toolkits and dedicated

frameworks allowing for the true start of the exploration of GPU power for general computing.

The problem is, of course, that albeit this higher level frameworks, this is still specialized hardware

and a thorough knowledge of its intricacies is needed to harness their full power. More so because

the CPU world, albeit going multi-core and parallel, is tied to the need of answering to a very

different kinds of requests simultaneously, a true Land Rover of the computing world. GPUs are

more in the class of high speed F1 machines.

This work focus on dissecting and exploring the GPU for solving partial differential equation

problems. I have tried to carefully characterize the GPU, especially under the CUDA environment

from NVIDIA.

1.1 GPU, the Cluster on the Desktop

The GPU is the processing core of the current graphics cards. The graphics card is the

component of a computer responsible for outputting electrical signals to visual displays. Each

pixel on the display needs to have its color and intensity computed before being transformed into

display signals. The sequence of operations needed to process each pixel it’s called the graphics

pipeline[12, chap. 1] and it’s both computational intense and highly parallel: computational

intense because there are many pixels in a frame and several frames per second; highly parallel

because the state each pixel it’s completely independent from its neighbors so the computation of

each pixel can be done exactly at the same time of any other pixel in the frame.

Mainly due to the electronic game market demand, the specifications of graphics cards have

been growing every year, so that computer games could look and feel more realistic. At the same

time and for the same reasons, the degree of programmability of these devices has been greater

and today the devices aren’t only meant for graphics but they became general purpose computing

devices. In terms of raw computing power, and comparing with the traditional CPU technologies,

this power is only comparable with larges aggregates of computers called clusters. For example

the new Intel Core i7 (with four 3GHz cores) does nearly 70GFLOPS and the NVIDIA Tesla

c1060 device (with its two hundred and forty 1.3GHz cores) does approximately 900GFLOPS.

This means that would be necessary more than 10 Core i7 to achieve the same raw performance.

Each solution has its strengths and weakness. For example, one of the clear disadvantages

in the GPU is that RAM memory is limited (and it can’t be expanded using transparent swap-

ping technologies). However, the GPU memory is faster than the RAM memory of a computer

(102GB/s in the NVIDIA Tesla C1060 vs 12GB/s with the most recent DDR3 memory) and

the communication between the cores of the GPU is faster than the one on a cluster. Another

advantage with the GPUs is its cost in two ways: primarily, the cost of purchase and secondly

the cost of powering the devices. The price of a high end GPU device is, generally, the price of

one single node of the cluster. A real case is the last expansion of the LASEF1 cluster: each node

costed in the order of the 2000e and 5 nodes were bought; at the same time was bought a solution

with GPUs that has more raw processing power than the whole cluster (including the old nodes)

was acquired for less than 8000e. In terms of electrical power, the NVIDIA Tesla C1060 consumes

200W itself, i.e, a fraction of a cluster node would consume. Another big advantage of GPUs is

manageability: the cost of managing a single card (and a single computer) isn’t comparable to the

cost of managing the whole cluster.

At last but not least, if a truly big computing power is needed, there is always the possibility

of making clusters of GPUs. Because all the previous considerations we’re convinced that we need

to investigate how to use this new kind of computing devices.

1.1.1 Historical Perspective

The idea of using multiple co-processors to increase a workstation performance in a (highly)

parallel fashion is not completely new. An example of this approach was the Atari Transputer

Worstation[1], presented at 1987 COMDEX which featured the so called “Farm Cards”. These

cards essentially consisted of more processors to be used by the operating system. In those days

the ATW had a significant parallel performance but this technology didn’t got mainstream and

the product was soon discontinued.

Also for workstations and since the Intel 8086 processor, Intel provided a co-processor fam-

ily - the Intel x87. These co-processors were floating point units designed for high performance

numerical applications (for example they featured, among many others, exponential and logarith-

mic functions). By the time of the i386 processor, they were IEEE 754 compliant and provided

asynchronous operation (i.e., parallel to the CPU). By 1989, these units were incorporated into

the i486dx processor. Another step in the parallel computing history within workstations was

the introduction of the MMX units into the Pentium processor family (in 1997). Although not

meant for numerical applications (since they were integer units and oriented towards the multime-

dia field), they featured instruction pipelines with a SIMD philosophy, so that multiple data was

processed with one single instruction. Since MMX, the use of similar technologies (integer and

floating point) never stopped growing.

When clock rates started to saturate, the CPU makers started to look to the truly parallel

approach. In 2001 Intel released a technology called hyperthreading which improves the perfor-

mance for multi threaded code. In 2005 the dual core (two processing units in the same chip) Intel1LASEF - Laboratory and Simulation of Energy and Fluids, Instituto Superior Tecnico

Pentium D processor was released. Successive generations of CPU devices seen its core to mirror

in multiples of 2 and 3 cores (AMD produces the 3 core Phenom and the 6 core Opteron). In

other architectures, other than x86, there are also multi-core processors such as the IBM/Toshiba

Cell processor.

Meanwhile in the graphics scene the logic was different. From the beginning, the idea was to

offload the CPU from graphics output functions. With the video game industry in mind, in 1999

NVIDIA launched the first graphics processing unit: GeForce 256. This card (with its Transform

& Lightning technology) was the first to offload the whole 3D graphics pipeline[8]. By 2001, with

GeForce 3, the vertex shading process was programmable. The substitution of rigid components

with programmable ones in the graphics pipeline didn’t stop and, in 2002, NVIDIA releases the

Cg technology (C for graphics), which is a highly specialized language (and compiler) for graphics

hardware that works on top of OpenGL or DirectX libraries and opens the graphics hardware to

the graphics developer. Since then, the GPUs have been programmed to solve problems other

than computer graphics - it was the beginning of the General Purpose computing with Graphics

Processing Units (GPGPU). A key to general purpose programming with this approach is the

mapping the algorithms to the highly specialized graphics pipeline methods[26]. In 2007 NVIDIA

releases the Compute Unified Device Architecture (CUDA) which is a new language (deeply based

on C) that is fully oriented towards GPGPU: it allows any programmer to use the full power of

the GPUs without the need of knowing anything about the graphics pipeline. The potential of

GPU based technologies led to the rise of Apple’s OpenCL (Open Computing Language) as an

industry standard for GPGPU. It’s worth mentioning that not only technologies from NVIDIA

exist, other projects such as the Brook project, LibSh or AMD Stream are also available to work

1.2 GPU Technologies

From the hardware side, there are mainly two device makers, NVIDIA and ATI, and their

successive generations of devices. Generally each new generation of devices has its computing

power and programmability significantly increased.

In the software side, as said in section 1.1.1, there are two kinds of technology to compute

data on the GPUs: mapping the algorithms to the graphics pipeline or using the more recent

and general languages. Currently, in the first approach, there are two big families: NVIDIA Cg2

(which includes Microsoft High Level Shading Language3 ) and OpenGL Shading Language4 ; In

2http://developer.nvidia.com/page/cg main.html3http://msdn.microsoft.com/en-us/library/bb509561%28VS.85%29.aspx4http://www.opengl.org/documentation/glsl

the general language field, there are NVIDIA CUDA5, OpenCL6 and the Brook7 family (which

includes the AMD/ATI technology).

Even if at current date they have less interest (because of inherent additional complexity) there

was investigation in how to exploit the GPU potential using the graphics pipeline, i.e, in how to

map a specific problem to the graphics pipeline. For example: in physically based simulation of

fluid dynamics, a Navier Stokes solver algorithm oriented to GPUs[30] using a solver based on

the method of characteristics. A cloud simulator[20] using a jacobi solver and in the fluid flow

scientific domain, lattice-Boltzmann[10], finite element [27] methods were studied. Work on the

advection-diffusion problem using forward finite differences and the Crank-Nicholson methods has

been done[28]. In the linear algebra domain, a broader work [22, 23]has been done on matrix

multiplication [11].

With the release of the CUDA framework, the devices were completely opened to program-

mers and, since then, all kinds of computational applications have been released: video encoding,

weather prediction, molecular simulation, fluid dynamics, computer vision, cryptography, etc. In

the CFD context, lattice Boltzmann methods have been studied[32, 19, 37]. Apart from this the

Navier-Stokes equation has been solved using finite volume[7] and finite element[17] codes. In the

linear algebra domain codes have been developed. NVIDIA released a CUDA version of the stan-

dard BLAS library (routines that compute simple operations such as addition and multiplication

on vector and matrix level). At a higher level (linear system solvers), in the present two main

orientation seems to exist: in one side there is a big interest in maintaining the old LAPACK8

library interface, using the GPU as a high performance computational resource. The factorization

algorithms (such as the LU, QR or Cholesky factorizations) are being implemented[34] and hybrid

CPU-GPU approaches are being studied [33]. On the other side, a new generic approach to algebra

algorithms is being developed[15] and the GPUs are being used as a test framework to this new

approach[5]. Beside this two approaches, there is also work on sparse matrix algebra[16].

1.3 Objectives

The main objectives of the present work are:

• to investigate the concepts behind the technology, their implementation and what key mech-

anisms can lead to best performances.5http://www.nvidia.com/object/cuda home.html6http://www.khronos.org/opencl/7http://graphics.stanford.edu/projects/brookgpu8For complete information, search the lapack working notes in http://www.netlib.org

• to investigate the performance of GPU based computing in a class of CFD problems: solving

the advection-diffusion transport equation using finite difference methods.

1.4 Methodologies

To achieve the stated objectives, the following was done:

• Port a state of the art benchmark to the CUDA environment to understand the programming

paradigm and compare it with the CPU environment. The results are also compared with

other works. This is necessary as the technology is completely new and because of that

major errors can be done without notice.

• Implement tests that quantify the different memory access methods relative performance.

Unlike in the CPU paradigm, where two main performance aspects have to be considered

(with RAM and with cache), the GPUs present a vast number of options are available in

the hardware.

• Implement equivalent programs that solve the advection-diffusion uni-dimensional equation

in the both the CPU and GPU environments using compact schemes to compute the spatial

derivatives and the Runge-Kutta method to do the time integration.

• Two direct dense solvers are compared for each environment. A LU based solver and inverse

matrix based solver.

1.5 Outline

The present thesis contains 5 chapters which are organized in the following way:

In Chapter 1 the problem that originates this work as well as the objectives are presented

In Chapter 2, the concepts of parallel computing and their implementation on GPUs discussed.

The GPU programming paradigm is also presented

In Chapter 3, a GPU is deeply tested to show that the new technology environment is under-

stood as the results are compared with other works. Also some less documented aspects of the

GPU are clarified.

In Chapter 4, the uni-directional advection-diffusion equation is solved using GPU technologies

and the results compared with an equivalent sequential approach.

Chapter 5 concludes this work with the main conclusions drawn and some suggestions for

future work are made.

Chapter 2

CUDA Programming Overview

CUDA itself is just a programming model that follows the underlying GPU hardware pattern.

It’s important to properly understand how the GPUs work (even at the lower levels) in order

to take advantages of them. The purpose of the current chapter is to show the basic concepts

of parallel computing, the CUDA underlying hardware model and how this translates into the

CUDA programming paradigm.

2.1 Parallel Computing Overview

To better grasp the concepts related to GPGPU programming, a short review of parallel

computing concepts is first presented. Parallel computing can be understood as “ a collection of

processing elements that communicate and cooperate to solve large problems fast”[3]. This is of

course incomplete, parallel strategies can be pursued just to make the problem resolution possible,

e.g. problems with memory requirements only found in aggregated machines. So, depending on

the goals, distinct models for parallel computing can be found. The present work concerns itself

with just one field of the parallel computing world: High Performance Computing.

2.1.1 Parallel Systems

The processing element is usually associated to a digital computer. This computer can be

generally modelled by the von Neumann architecture, which is composed of a central processing

unit (CPU - is able to fetch instructions and data and process them); a memory system (that

holds the instructions and data); and an Input/Output system (I/O system - used to communicate

with the outside world); it’s a sequential model ; and it has one path connecting the memory

system to the CPU[24] (see figure 2.1). In this architecture the run of a program is resumed by a

continue iteration of the following cycle (execution cycle): 1) the CPU fetches an instruction from

the memory; 2) it decodes the instruction; 3) it fetches the data needed to process the instruction;

4) executes the instruction.

Memory

CPU I/O

Figure 2.1: The von Neumann model

A parallel computer can be thought as combinations of several of these units1 that can be used

together to fulfill a computational goal. The method to interconnect, the number of units and other

details its matter of the particular system architecture. This poses an inherent extra difficulty: not

only we Humans do not think in parallel, as the parallel model (unlike the sequential) is system

dependent. However, patterns do exists and some of them will be explained during the present

chapter.

With respect to the system’s execution cycle configuration, a method of classifying computer

systems by its parallelism is the Flynn taxonomy[14]. It states that computer systems can be

divided into four categories:

• Single Instruction, Single Data (SISD). This is the common sequential computer. There is

only one instruction being executed at a time operating, as well as one data stream.

• Single Instruction, Multiple Data (SIMD). In this architecture, there is one single instruction

running at a time, but there is a degree of parallelism of data streams.

• Multiple Instruction, Single Data (MISD). Multiple instructions running at the same time,

operating over the same data stream.

• Multiple Instruction, Multiple Data (MIMD). There are multiple instructions operating on

multiple data streams.

In the real world the parallel machines can in fact be combinations of these four models. For1In fact parallel computing isn’t just based on the von Neumann architecture as there other other models such

as data flow computing systolic arrays or neural networks[24, sec. 9.5]. But this architectures are out of the scope

of this document.

example, a single processor personal computer, which is considered to be a SISD system, can also

be considered a SIMD when using special instructions.

Regarding the parallel computer memory system, two main patterns exist: shared memory

systems (figure 2.2a), where the memory of the system is directly accessible by all the processors;

and distributed memory systems (figure 2.2b), where each processor has its own private memory

which isn’t directly accessible by any other processor. This design issue has a major implication in

terms of cooperation between the processors which is if the memory can be used to communicate

(by using predefined shared locations to exchange information) or if a message passing method

has to be implemented on top of the I/O system to exchange information.

Memory

CPU CPU...

(a) Shared memory system

mem CPU

inter−connect

(b) Distributed memory system

Figure 2.2: Parallel computing memory patterns

As said, the processors need to communicate and cooperate. Because no communication can

be performed instantaneously, there are always latencies associated with the communication2.

Even with shared-memory systems, where the communication isn’t done using the I/O subsystem

(which usually is slower than the CPU and memory), the difference between a simultaneous access

to memory and a non-simultaneous can mean the serialization of process (as memory components

aren’t elastic) and thus the lost of performance.

Lastly, if the ultimate goal of parallel computing is to solve large problems fast, the fundamental

metric used in comparisons between sequential and parallel systems is the speedup that’s defined

by equation 2.1, where Ts is the time of the sequential computation and Tp is the time of the

parallel computation.

Tp(2.1)

2These latencies, also exist in sequential systems but as sequential systems are assumed to have only one processor

they don’t play the central role that they do on parallel machines because depending on the number of processors

and how they communicate, the performance of an algorithm can be significantly different

2.1.2 Parallel Programming

Parallel programming is a general term to denote programming in parallel systems. The fact

that parallelism is system dependent, implies that parallel programming is also system dependent.

Common concepts and methodologies in parallel programming are now presented.

Four steps can be defined[9] in the process of coding a parallel program:

0. writing the sequential program;

1. decomposition of the program into computational tasks;

2. assignment of computational tasks to specific threads3;

3. orchestration of the threads, which is set of operations needed by the threads to correctly

cooperate;

4. mapping of the threads to specific processors. This step is usually implemented by the

underlying platform (i.e, the programmer does not have to think about it).

Decomposition defines the degree of concurrency4. The number of tasks should be a bal-

ance between maximizing processors usage and minimizing management resource usage, so that

the resources associated with the computation are actually bigger than the ones associated with

managing the concurrent environment. In assignment, the most important aspect to consider

is the load balancing, i.e, the correct distribution of computing, resources and communications

between the threads. Orchestration is the implementation of the cooperation, i.e, to define the

communication and synchronization methods. Several concepts are important to present:

• Race condition. Whenever a resource is shared in a concurrent environment a race condition

occurs. The problem risen is about coherency: if two threads read a shared memory location

and if both try to update it at the same time, the result is unpredictable. Figure 2.3a

illustrates the case where two processors access a shared variable which has an initial value

(a). At the same time they change that value and at end an update is done. However,

depending on the order of update (which isn’t known) the result will differ, so the final

result is unpredictable.

• Synchronization. Incoherent states may be created during the parallel thread execution.

Synchronization is the operation of ensuring that the incoherency is eliminated by commu-

nicating to a main resource holder.3A thread is the minimal unit of execution. However, depending on the context, this concept may be named

process or thread. There is a subtle difference: usually threads are associated with shared memory contexts and

processes with distributed memory contexts. In the present document the word thread is indefinitely used.4Concurrency is associated with parallelism and the access to common resources.

• Atomic operation. It’s the tool to deal with the race condition. An atomic operation is

an operation that cannot be interrupted; the operated resource is set unavailable until the

operation is finished, so no other thread can access the resource and create an incoherent

state. For example, in the figure 2.3b, the access to the shared resource is denied to the

CPU1 until the completes the sum operation.

• Starvation. The starvation condition happens if a resource is perpetually hold unavailable

and a thread needs access to it. The execution of this thread is hold forever.

These concepts are essential in shared memory systems since the memory is directly available to

several processors. However, in distributed memory systems they also hold. For example: if a

thread needs information about a resource on other thread, it may starve waiting for that resource

to come.

sharedresource

(a) Race condition

a+b a+b+c

sharedresource

(b) Atomic Operation

Figure 2.3

Laws of parallel speedup

Two theoretical laws describe parallel speedup: Amdahl’s Law [4] and Gustafson’s Law [18].

Amdahl’s law reflects how much speedup can be achieved for the same problem size, increasing the

number of available parallel processors. Algorithms always have a sequential, non-parallel fraction

where a fixed amount of time ts is spent, and another fraction that can in fact be parallelized (tp

will be the time spent in this fraction). Assuming perfect parallelization, a theoretical value for

speedup can then be found using Amdahl’s Law, expressed by equation 2.2. Its main implication

is that, even with an infinite number of processors (n → ∞), the maximum possible speedup is

limited by the fraction of the program that was parallelized (rp).

ts + tp

rs + rp

(1− rp) + rp

On the other hand, Gustafson’s law reflects a concern for scalability, e.g. what is the expected

behavior of an algorithm when applied to increasing problem size and machine computing power

(more processors available).

Gustafson’s Law states that the speedup achieved using a parallel system (of n processors) to

compute a parallelized algorithm, when compared to the use of a sequential system to compute

for the same algorithm is given by equation 2.3.

Tp=ts + n · tpts + tp

= rs + n · rp = (1− rp) + n · rp (2.3)

From 2.3 it can be seen that speedup can be proportional to the number of processors, for

increasing size. The evolution of both laws is presented in figures 2.4a and 2.4b.

25 50 75 100 125 150 175 200

Speedup

tp=30% tp=60% tp=90%

(a) Amdahl’s law

25 50 75 100 125 150 175 200

tp=30% tp=60% tp=90%

(b) Gustafson’s law

Figure 2.4: Parallel Speedup Evolution

Parallel High Performance programming: state of the art

After several years, two main approaches have emerged as standards for parallel computing

in High Performance Computing : OpenMP, for shared memory machines, and MPI 5, targeted

mainly at clusters (PVM, one of the first efforts to achieve a standard model, has almost vanished

from the High Performance Computing world). A steady rise in hybrid, OpenMP-MPI codes, has

also been happening because of the increasing number of single node, multicore machines, being

incorporated into clusters.

OpenMP is a standard API for programming shared memory systems6 , using pragma direc-

tives7. There are bindings for fortran, C and C++ programming languages and is supported on

multiple hardware platforms and operating systems. The technology acts at two levels: at the5MPI is the acronym for Message Passing Interface6For a deep read on the technology the reading of [6] is recommended.7A pragma directive is a special preprocessor directive used to inform the compiler of a particular issue of a

specific portion of code.

compiler level and library level, the programmer uses special compiler directives to inform the

compiler of the parallel areas of the code, Decomposition is governed by environment variables

or code directives, leaving the burden of thread management coding to the compiler. Using the

library level routines (as well as other compiler directives) the programmer does the assignment

and orchestration of the program. Finally, the operating system does the mapping of the processes

to the hardware processors.

MPI is a standard specification for communication between computers. It is commonly used in

clusters (distributed memory systems) and operates over various networking protocols (the most

common is TCP over Ethernet). It enforces the user to explicitly provide for data transfers and

synchronization between processes.

2.2 GPU Hardware Model

The GPUs are expansion cards to use in a computer, i.e., they aren’t autonomous computers

system (the general configuration on a computer with a GPU is shown on figure 2.5). GPUs

usually are the computing core of a graphics card but there GPUs without the graphics output

module. In the present document (and, generally, in GPU lexical) a computer with a GPU is

called host and the GPU is called device. The model described in this document is based on the

NVIDIA GPUs but most of the aspects are similar.

deviceGPUm

hostCPU

bridge

Figure 2.5: Host Computer

Each GPU is an aggregate of multi core processors (multi-processors) sharing a global memory.

Multi-processors don’t have any I/O system to communicate between them and, as a consequence,

cannot cooperate by any message passing system. So, the GPU (as a parallel system) is essentially a

shared memory system. Apart from the shared memory, the only available path of communication

is between the host and the device and is limited since the host is the one controlling it (if available,

the GPU can also output to the display sub-system, and thus to a monitor).

Figure 2.6: A GPU

Each multi-processor is composed of: a number of scalar cores which perform the computations

(these scalar cores are specialized in arithmetic instructions); a instruction unit responsible to

delivering instructions to the scalar cores; and on-chip shared memory that can be used in scalar

core communication (this memory isn’t accessible by the other multi-processors in the GPU). Each

multi-processor unit is also a shared-memory system. See A.1 for values properties of particular

devices.

The memory system of the current NVIDIA GPUs is complex. There are two main groups

of memory: on-chip (the memory is located inside each multi-processor) and off-chip or global

memory (the memory is located in the GPU and accessible by all multi-processors). Global memory

is organized into 4 types : linear memory, texture memory, constant memory and local memory.

The main implication of using each type is how multi-processors access the memory: any access

to linear memory means to use the shared bus; texture and constant memories are cached, so

the shared bus isn’t used in every single memory access. These caches are read only so multi-

processors cannot write on it. Because the bus to global memory is shared and serialization of

accesses occur, the GPU has the ability to coalesce8 some access patterns. On-chip memory has

two additional types: the shared memory, which is directly accessible by any scalar core inside8An access is said to be coalesced if with only one transaction, several requests are fulfilled.

each multi-processor; and the local registers that are private to each scalar core. If any scalar core

needs more memory than the available in registers, they can also use the global memory while

maintaining the local scope (this is the local memory). In order to reduce serialization within the

multi-processor, the shared memory is divided into banks that can be simultaneously accessed

without loss of performance (it’s up to the programmer to ensure correct use of this possibility).

In terms of execution in a GPU environment, the minimal computing task is the thread.

These threads are created, managed (scheduled) and destroyed by the GPU, i.e., the threads

live in hardware space. This is one of the major differences from other common parallel

environments: for example, in a multi-tasking9 operating system all the processing units are

scheduled in software. Its up to the operating system (not to the hardware) to decide which process

(or thread) runs at what time and on which particular processor. This is costly, in terms of memory

and processing cycles. This feature is the responsible for a virtually null cost when creating and

scheduling threads and for raising the bar of the number of threads up to the thousands; However,

GPU aren’t oriented towards general computing but only for data processing. In GPUs the threads

are grouped into sets of up to 32 threads called warps. The warp is the scheduling unit.

In the Flynn taxonomy, GPUs are best fit in the SIMD category since the instructions feed to

the scalar cores are the same but each thread can access different data. However, mainly because

the code running in each thread may automatically diverge (i.e., the programmer doesn’t have

manually take care of “if” clauses since branching supported by the hardware), NVIDIA defined

a new category (Single Instruction, Multiple Thread - SIMT)[25, sec 4.1].

2.3 CUDA Programming Model

“CUDA extends C by allowing the programmer to define C functions, called kernels, that when

called are executed N times in parallel by N different CUDA threads, as opposed to only once like

regular C functions”[25].

As said, the CUDA software model is an extension of C/C++ programming languages10 that

reflects the underlying GPU hardware. The main extensions are[25, sec 4.1]:

• function and variables qualifiers to specify whether the function or variable is referred to the

host or to the device;9A multi-tasking operating system is an operating system that can run simultaneously more than one process,

or thread. Examples of such systems are the Unix family - Linux, FreeBSD, MacOS - and the Windows OS. For

more information read[31].10Even that this new model is a superset of the C/C++ languages, there are some features in the C/C++

languages that aren’t possible to do on the device code, such as function recursion.

• a directive to configure and start a kernel.

A CUDA program is no more than a usual C/C++ program that make calls to CUDA kernels

(figure 2.8). A function to be run in the device must have the device or the global qualifiers

(line number 2 on figure 2.8). The former defines functions to be called by code run on the

device, the later defines kernels 11. By default functions with no specifier are considered to be host

functions.

In terms of variables, the environment defines the scope of the variables, i.e, in device functions

the variables belong to the device memory space and on host functions, the variables belong to the

host memory space. In other cases, qualifiers (similar to the function ones) are used. Neither the

device can directly access the host variables, neither can the host directly access the device’s ones.

The only direct interface existent is in the kernel call where the kernel parameters are automatically

copied to the device’s memory. The memory management (allocation, free and copies) is done by

the host using determined functions (in figure 2.8: lines 20 and 21 for allocation; 29 and 41 for

transfers; and 44 and 45 for free). The host holds the locations of the device’s data in its own

memory12 by using traditional pointers.

To launch a kernel, the CUDA API defines a new directive. This directive contains information

about the execution configuration (number and arrangement of threads). Regarding the execution

configuration, the threads are organized in a matrix like form called block; each block is attributed

to a multi-processor. The blocks are also organized in a matrix like form called grid (lines 32 to

36 in the example code). Within a multi-processor, each thread has built-in variables that can

be used to do the assignment of the tasks to the particular threads. Three examples are the

threadIdx, blockIdx and blockDim shown in the line 3 in the example. After launching the kernel

the mapping of each thread to the multi-processors and scalar cores is automatically done by the

hardware.

The execution of the device’s threads is asynchronous with respect to the host, i.e, the host

can execute another unrelated code while the device is processing the data. In the figure 2.7 is

shown the thread organization as well as the asynchronous run feature. The synchronization is

done using a function (see line 38), in which the host program waits until all threads in the device

have finished their work.11Additionally kernel functions must be void typed, i.e, cannot return any value.12If the programmer tries to access this locations using the CPU, the result is undefined and likely to have a

segmentation fault

Figure 2.7: Asynchronous execution of the device

1 // kerne l implementation

2 __global__ void vector_scale ( f loat ∗a , f loat ∗b , f loat k ) {

3 int n = threadIdx . x + blockDim . x ∗ blockIdx . x ;

5 a [ n ] = k∗b [ n ] ;

6 return ;

10 //main program

11 int main ( ) {

13 f loat ∗d_a , ∗ d_b ; // pointers to device ’ s memory space

14 f loat a [ 6 4∗6 4 ] , b [ 6 4 ∗ 6 4 ] ; // host memory

15 f loat k ;

16 int i

17 dim3 grid , block ;

19 // device memory a l l o ca t i on

20 i f ( cudaMalloc ( ( void∗∗) & da , 64∗64∗ s izeof ( f loat ) ) != 0) return 1 ;

21 i f ( cudaMalloc ( ( void∗∗) & db , 64∗64∗ s izeof ( f loat ) ) != 0) return 1 ;

23 // host data i n i t i a l i z a t i o n

24 for ( i=0; i<(64∗64) ; i++) {

25 b [ i ] = 1 .0 f ;

28 // data t rans f e r : device<−host

29 cudaMemcpy ( d_b , b , 64∗64∗ s izeof ( f loat ) , cudaMemcpyHostDevice ) ;

31 // execut ion environment conf igura t ion

32 grid . x = 64 ;

33 block . x = 64 ;

35 // kerne l c a l l

36 vector_scale <<<grid , block>>> ( d_a , d_b , k ) ;

37 // kerne l synchronizat ion

38 cudaThreadSynchronize ( ) ;

40 // data t rans f e r : host<−device

41 cudaMemcpy ( a , d_a , 64∗64∗ s izeof ( f loat ) , cudaMemcpyDeviceToHost ) ;

43 // device memory f ree

44 cudaFree ( d_a ) ;

45 cudaFree ( d_b ) ;

46 return 0 ;

Figure 2.8: CUDA program example

To use all the memory access methods existent in hardware, the following methods exist in the

CUDA API:

• Linear memory. The access is completely transparent as shown by the example code, where

the a,b and k reside in the linear global memory;

• Texture memory. The linear memory has to be binded to a texture13 .Special functions are

used with in kernels to access the memory.

• Constant memory. Its statically defined in the code with the constant qualifier. Specific

functions are used to copy data from the host to the device constant memory but access

within the device is transparent.

• Local memory. It’s automatically managed by the device.

• Shared memory. It’s statically defined in the code of the kernel using the qualifier shared .

Its access is transparent within a kernel.

• Local registers memory. It’s statically defined inside the kernel code. Its access is transparent

as shown in the example (variable n inside the kernel code).

Lastly, to do the orchestration there is a limited framework. Threads within a block may use

the syncthreads function to ensure that every thread got to a defined point. There are also

atomic functions that may be used by the device but they have a performance penalty. All the

memory in the GPU may be constantly in race condition so it’s left to the programmer to ensure

that the code is correctly implemented and that the outcome will be the correct, no matter of

the thread execution order. As said before, there is also a function that synchronizes the host

execution with the device.

13There is a special data layout, called CUDA arrays, that can lead to best performances when compared to the

linear memory binded to textures.

Chapter 3

CUDA Environment Tests

Some testes were done to understand which capabilities the available system have. Under-

standing the real bandwidths of the device as well as the configurations that lead to the best per-

formances is a crutial task in High Performance Computing . The tests are based in two NVIDIA

benchmarks and following [19], a port of the Stream 1 benchmark was implemented. There are

important issues to be made clear:

• Floating point operations and memory transactions are accounted following the Stream[2]

project (table 3.3).

• As in Stream, the first (slower) iteration is ignored.

• Unlike in Stream, the average time (instead of the minimum time) is used, since the sustained

point of view (vs a peak performance point of view) is more interesting to the present

document.

3.1 Test System

A Debian2 GNU/Linux 5.0.1 system was used with a 2.6.26 Linux kernel3. The 2.2 version of

the CUDA libraries are used. All the codes were compiled using the NVCC compiler (version 2.2)

or the GCC compiler (version 4.3.2). The C library used is the 2.7 version of the GNU C Library

compiled by the Debian Project. The hardware details are listed in the table 3.1.1http://www.cs.virginia.edu/stream/2http://www.debian.org3the kernel is the distribution’s stock kernel (i.e., non-optimized)

CPU Intel(R) Xeon(R) CPU E5420 @ 2.50GHz

Motherboard Intel R© Desktop Board D5400XS

chipset Intel R© 5400 Chipset

PCI express 1.1 4GB/s

RAM 4x4096MB 667MHz DDR2 5.3GB/s

Table 3.1: Host system hardware details

The GPU used is a NVIDIA Tesla C1060; Its characteristics are listed in the table 3.2.

Multiprocessors 30

Clock Rate 1.3GHz

Memory 4 GB

Memory Clock Rate 800MHz

Memory bus width 512 bit

Memory bandwidth 102 GB/s

Peak Performance 933 GFLOPS

Device Capability 1.3

Table 3.2: Device properties

3.2 Metrics

The metrics used to benchmark are the number of operations per second, the bandwidth and

the time per byte (inverse of bandwidth). Like in Stream, measures are made using the GNU

libc version of the Posix standard function gettimeofday. This function returns a structure in two

64bit integer fields: the number of seconds and microseconds since the Unix Epoch. The value

is then converted to a double precision floating point number representing the seconds. All the

measures made are done computing the difference between the instant before the kernel is launched

to instant after the return of a cudaThreadSynchronize() function call. The byte and operation

normalization is done using the values in the table 3.3.

name kernel bytes/iter FLOPS/iter

COPY a(i) = b(i) 2*sizeof(word) 0

SCALE a(i) = q ∗ b(i) 2*sizeof(word) 1

SUM a(i) = b(i) + c(i) 3*sizeof(word) 1

TRIAD a(i) = b(i) + q ∗ c(i) 3*sizeof(word) 2

Table 3.3: Memory and operation accounting

One other concept important to this evaluation is the balance (equation 3.1). This relation is

important to evaluate if algorithms are processor or memory bounded.

Ba =memory transactions

operations(3.1)

This concept can be applied to the algorithm itself and to the hardware. The relation between

both ratios (i.e, how much the algorithm balance fits the hardware balance) it’s extremely difficult

to obtain (if possible), since the exact time that a computation takes can only be calculated in

particularly simple situations. In the present document a simple model (hardware oriented) is

used: theoretically, the hardware can deliver 933Gflops and has 102GB/s of memory bandwidth

(in single precision words is 25Gwords/s), so the hardware balance is approximately 0.03. This

point is considered neutral, i.e., the point where time spent with memory transactions is equal to

time spent in processing data. Values higher than 0.03 are considered to mean a memory bounded

and lower processor bounded. The results are summarized in table 3.4.

processor neutral memory

bounded bounded

algorithm 0 1 ∞

GPU 0 0.03 ∞

Table 3.4: Summary

3.3 Peak Throughput

In order to know what is the peak throughput achievable with the GPU, a NVIDIA test is

used. It consists in an unrolled loop with a series of FMAD4 instructions (B.1). The nature of the

test gives a real measure of the raw processing potential of the GPUs as well as the configurations4A FMAD instruction is a hardware instruction that computes an multiplication and a sum, e.g., a · b + c.

that lead to the best performances in processor bound kernels (Ba = 0). The performance is

evaluated in function of the number of blocks and block configuration.

As in the NVIDIA original benchmark, a 10 iterations loop with a total of 2048 FMAD in-

structions is used. The configuration is done using the relation expressed in equation 3.2.

Nb =(Tmp

)·Nmp (3.2)

where:

• Nb is the number of blocks;

• Tmp is the number of threads per multiprocessor;

• Tb is the number of threads per block;

• Nmp is the number of multiprocessors.

Figure 3.1 shows the evolution of the time length of the kernel. There is a staircase like

evolution with 32 threads per block. The step width is 240 blocks, i.e, the number of scalar cores.

This clearly shows that whenever there are processors idle (thus available) the kernel has order

O(1) but when there are no more processors available, the process is serialized and its order is

automatically converted in O(n) which leads to the global linear trend of the staircase. With other

configurations, the step width is shorter because, for each new block of threads, there are 64, 128

and 512 new threads and thus less blocks are needed to reach the 240 hardware limit. In the last

case, 512 threads per block, has more than 240 new threads, so its implied that every new block

is serialized, so its evolution is linear. This serialization also explains the fact that the slope is

significantly higher with blocks that contain more threads.

It’s not clear why the evolution isn’t a perfect staircase, but it may be related with fact that

the warp (the scheduling unit) not being a multiple of the number of the processors and other

scheduling related issues. Information documenting the thread scheduling wasn’t found so it’s

hard understand the real origins of this behaviour.

0 500 1000 1500 2000

f kern

Number of blocks

32 threads/block64 threads/block

Figure 3.1: FLOP test, total time of execution

In terms of FLOP performance, i.e., the number of floating point operations per second, figure

3.2, the maximum FLOP value achieved is 617GFLOPs. This value differs from the value stated

by the hardware maker, 933GFLOPs, because the highest value is only achievable in certain

special usage conditions (dual issue)[21]. This maximum value is not steadily attained from the

beginning, for block sizes of 32, 64 and 128. Looking at the data it’s found that a total value of

3840 threads leads approximately to 378GFLOPs (60% of the peak performance). In terms of

threads per scalar core, this is equivalent to 16 - which is a half warp (or, by another perspective, 4

full warps). So it seems that even that all the cores have work to do (i.e, more than 240 threads),

full performance is only achieved if the scheduler has 32 or more threads to take care of. A simple

model that describes the evolution seen in figure 3.2 could be given by the following equation:

FLOP =k1T

sT + k2tk

sT <<k2tk−−−−−−−→ k1

where sT is a scheduler constant time penalty, T is the number of threads, tk is the time the kernel

takes and k1 and k2 are constants. With large configurations (that take longer time), the scheduler

penalty starts to be negligible; The total time tends to be approximately proportional to the total

number of threads because of the serialization (which means order O(n), thus proportionality).

The observable oscillations are a direct consequence of the staircase like form previously described.

Even at steady state, there are block configurations that perform better than others: for the

present kernel, the configurations with more threads per block perform better.

0 500 1000 1500 2000

Number of blocks

Figure 3.2: FLOP performance

3.4 Bandwidth

3.4.1 Host - device transfers

A benchmark that copies blocks of data from the host to the device’s linear memory and

vice-versa using all the methods that the API provides was implemented (i.e.: standard malloced

memory, page-lock and write-combine alloced memory and mapped memory). In order to compare

the performance of mapped memory, an initialization loop (write in the main memory) and a final

read loop is considered. The total time (normalized by the total transfer size, eq. 3.3) is evaluated

as a function of the number of elements transfered as well as each parcel.

T =twrite + thost→gpu + thost←gpu + tread

bytes transfered(3.3)

In present test, host memory is allocated using the following functions: the system malloc

function and the CUDA cudaHostMalloc function with its flag parameter equal to 0 (page-locked

memory), to cudaHostAllocWriteCombined and to cudaHostAllocMapped. The transfer is done

using cudaMemcpy (except for mapped memory). The terms thost→gpu and thost←gpu are missing

in the mapped memory case because the data transfer is automatic, i.e., there is no explicit memcpy

command. Yet a thread synchronization call is done after both write and read. The initialization

write and final read loops are traditional for loops, with no optimization done.

In the figure 3.3, the total time per byte is represented (T in equation 3.3) as a function of the

number of elements transfered. For the usual malloc reserved and the page-lock memory there is a

high penalty in performance when transferring small quantities of data. For large transfers (more

than 106 elements) a global stable value was achieved. The Write-combined mode is missing from

the figure because the comparison considers the read operation from the CPU and this operation

is highly expensive in this mode. Mapped memory had the best performance.

102 103 104 105 106 107 108

Number of floating point elements

malloc page-locked mapped

Figure 3.3: Time of the total transfer cycle

Analyzing partial times in figure 3.4, its understood that the PCI-express transfer is the re-

sponsible for the big performance loss for small transfers. The time spent on uploading data to

the device differs from the downloading time significantly in the time of the malloced memory

(downloading from the device is slower). The values for the read and write operations in figure

3.4 do not represent the system RAM bandwidth5. Instead they represent just half of it because

in each loop iteration there is a read and write operation. Also, since a naive approach is used in

the loop (i.e., there are no explicit data transfer optimizations), just half of the bus width, due to

single precision usage is being used. This explains the difference to the system RAM theoretical

bandwidth (table 3.1).

5the unit in the figure is time, but the bandwidth can be obtain by computing B = 1/T . In the this case

B ≈ 1111MB/s

writehost > device

device > host

mallocpage-lockwrite-combinemapped

(a) 103 elements

writehost > device

device > host

(b) 5 · 103 elements

Figure 3.4: Details of the data transfers for two different sizes

3.4.2 Device - device transfers

The main intra-device transfers performance is evaluated, namely: accesses from the global

memory, texture memory and constant memory. In the first case, the benchmark is similar to

the host-device (but only tdevice→device is considered). For the other cases, different vector copy

operation versions are implemented using read operations from each one of the available memory

spaces and a write to the global memory, i.e.:

1. read from global memory and write to global memory;

2. read from texture memory and write to global memory;

3. read from constant memory and write to global memory.

Because the texture and constant memories are cached, a vector form of the sum reduction oper-

ation (eq. 3.4) is also implemented to take advantage of it; and because of the global memory is

not cached, a software based cache6 is implemented with shared memory.

a(i) =M∑

b(j) i = 1 · · ·N (3.4)

The purpose is to cache the b vector: M consecutive reads are issued from the cache, making just

one write operation to the global memory.

In this implementation the data is distributed in the following way: For a vector of size Nv

and for Nt threads, each thread does Nv/Nt operations - or Nv/Nt + 1 if Nv is not a multiple of

Nt . The figure 3.5 is an illustrative example that shows the workload distribution for Nv = 8

and Nt = 3. In this example, threads 0 and 1 process 3 data elements (8/3 + 1) and thread 26in fact, no true cache mechanism[24] was implemented. The code takes advantage of knowing previously what

memories will be used with higher frequency.

only 2 data elements. The numbers in the corners represent the loop sequence in each thread;

This approach was chosen because it respects the coalescing considerations made in the CUDA

manual[25, sec. 5.1.2.1], that reduce memory transactions 7. The grid configuration is once again

calculated using equation 3.2 and the number of blocks is limited to 4096;

Figure 3.5: Workload distribution for a vector of size 7 and 3 threads

The figure 3.6a presents the evolution of global memory performance for 2 byte and 3 byte

data type sizes (normalized by the theoretical maximum value) relative to the simple vector copy

operation . Because of the limitations in the constant memory sizes and texture addressing with

CUDA arrays (see section 2.2), a test with small vector sizes is implemented; the results are shown

in figure 3.6b detailing all memory access methods.

103 104 105 106 107 108

number of vector elements

float3 float

(a) General perspective

0 5 k 10 k 15 k 20 k

global texture constant

(b) Small sizes

Figure 3.6: Intra device memory transfers

The most important observation, in the vector copy test, is that the device memory performance

was completely destroyed with small vector dimensions (figures 3.6b and 3.6a). The theoretical7coalescing memory transactions is not only dependent on the access pattern but also on device capability

bandwidth is 102GB/s but, for small sizes (N < 16k), only 7GB/s or less are achieved. In the

literature researched no mention of this issue was found. The origin of this disruption of perfor-

mance seems responsible for the fact that both curves in figure 3.6a intercept: it was supposed

that the float3 performance was always worse than in the float case because of non-coalesced

memory transactions. In fact, the time taken by the float3 test is always greater than the float

test - it’s only faster because there is more information (3 times more) per transaction . This

behaviour seems analog to the FLOP test where there was a minimum number of threads to

achieve full performance. In this case it seems that there is some kind of barrier in the number of

transactions per thread, but the results do not show a clear number. Also, coalescing shouldn’t

be affected by the number of transactions but only with the access pattern. Nevertheless, the test

scales in performance and, for sufficiently large vector sizes (N > 220), significant performances

were achieved (B > 60GB/s). A maximum performance of 84GB/s (83%) with 64 bit data types

(double precision floating point, dual single precision, float2, or long integer) and a vector size

of 227 elements. The effect of the non coalesced memory transactions is the loss of performance,

clearly shown in 3.6a for large vectors as the gap between the two lines. In [35] is reported that

greater performances were achieved (89%) but the devices that were used aren’t the same.

In terms of relative performance between each access type, the direct access to the global

memory and the access through texture cache perform identically, but access to the constant

memory is slower.

In the cached access test the same vector dimensions were used. The implementation of the

algorithm was done using M = N in equation 3.4. The bandwidth shown is calculated using

B = 4N(N+1)∆T . The results are significantly different from the previous test, in even with small

sets and all access types the performance exceeded the memory’s bandwidth. Comparing the

non-cached result (black continuous line) with the previous test, only coalescing to each B(j)

(coordinated broadcast pattern) may explain the boost in performance (up to 126%). For the

cached accesses, the implemented software cache with the share memory presents by far the best

results with large sizes attaining a peak of 287GB/s (280%). Access through texture memory also

outperformed the global memory limit by a 178% factor.

With small sizes, the constant memory accesses can be also compared and, for this access

pattern, it revealed to be as fast as the shared memory, in opposition to the previous test that

showed worse performance for constant memory.

103 104 105 106 107

global shared texture

(a) General perspective

0 4 k 8 k 12 k 16 k 20 k

globalshared

textureconstant

(b) Small sizes

Figure 3.7: Intra device memory transfers w/ cache

3.5 Stream benchmark

The Stream benchmark consists in 4 vector operations: vector copy, product by scalar, vector

sum and vector sum plus scalar product (operations given in table 3.3). This set of simple opera-

tions allows us to evaluate the performance of bandwidth bound algorithms in the GPU context

(balance > 0.03). The performance is evaluated as a function of the number of blocks and the

block configuration. The previous results showed completely different performances for small and

large size problems, so two different vector sizes are evaluated.

Since the main results of the Stream copy were already present in section 3.4.2 and because of

the general results are similar, only scale and triad tests are now presented. In figures 3.8a and

3.8b the evolution of the bandwidth for a vector size of 215 and 227 elements is represented. The

phenomenon of bad performances for small sizes is maintained: the peak performance depends on

kernel configuration and varies between 13.7GB/s (for block size 32, 64, and 512) and 14.6GB/s

(for 128 and 256). One thing that representing data as a function of grid size doesn’t show,

is that the peak has one common parameter constant for all configurations: 256 threads per

multi-processor. With 256 threads per multi-processor, there are exactly 32 threads per scalar

core, which is the size of the warp; so it seems a match of the hardware parameters with the

problem. This result is identical to the FLOP test, but now with memory transactions in play.

The drop in performance after the peak is also explained by the excessive number of threads:

for the 256 and 512 block configurations (even with 256 threads per multi-processor) there are

more threads than elements to compute, so it is guaranteed that there are threads doing nothing.

The logic of having 1 thread per element doesn’t hold: for example for 32 threads per block the

best performance is achieved with 4 elements per thread (this is expected because of the device

balance: it would be necessary 33 floating operations for each memory transaction - theoretically

- to achieve neutrality).

For large vector dimensions (in the scale test, figure 3.8b a peak of 81GB/s as achieved with

64 threads per block. The results are essentially flat curves. At the beginning of the curves

there are to few threads: performances above 70GB/s have always 7680 (i.e., 256 threads per

multi-processor) or more threads, which means 17500 or less elements per thread; at the end, the

excessive thread condition is again responsible for performance disruption. There is an important

implication of the flat form in assignment of tasks to threads: above a certain number of threads

there is no performance gain by launching more threads.

In the triad test for small sizes better peak performances are achieved (19.6− 20.8GB/s) but

the 256 threads per multi-processor is again the configuration that best performs.

In the last test (triad with big vector sizes) there is odd thing with no explanation: the worse

configuration for the scale test (block size of 32 threads) is now the best.

101 102 103 104 105

number of blocks

32 64 128 256 512

(a) Scale test. N = 215

101 102 103 104 105

number of blocks

32 64 128 256 512

(b) Scale test. N = 227

101 102 103 104 105

number of blocks

32 64 128 256 512

(c) Triad test. n = 215

101 102 103 104 105

number of blocks

32 64 128 256 512

(d) Triad test. N = 227

Figure 3.8: Stream benchmark

Finally, between the host CPU performance and the device performance (we are not considering

host-device transfers). The results are presented in the table 3.5.

operation cpu gpu speedup cpu gpu speedup

(MB/s) (MB/s) (MB/s) (MB/s)

N = 2E3 N = 2E6

copy 2314 2003 0.87 3245 74482 22.9

scale 2274 1973 0.87 3181 74814 23.5

add 3247 3004 0.93 3188 77136 24.2

triad 3247 3004 0.93 3310 76666 23.1

Table 3.5: Stream benchmark results

3.6 Summary

The two main objectives of the current chapter are to show if the new concepts and technologies

were successfully acquired and to clarify some aspects less documented. In the first section a raw

performance test was passed and the most important result is that, independently of the block

configuration, full performance isn’t achieved from using 240 threads (1 thread per scalar core)

but only from 8 full warps (16 times more threads than physical scalar cores) or 4800 threads.

Regarding the device ↔ host communication, it was concluded that one of two approaches are

recommend:ed either using mapped memory or making big transfers. By using mapped memory

it’s possible to completely hide the latency of the transfer but it should be carefully used to

avoid race conditions. By making big transfers, all the initial costs are diluted and maximum

performance is achieved. Within device transfers the most important result is that full performance

is only achieved by using a massive number of transfers. When memory transfers can benefit from

a cache, the choice between the available methods has to be oriented towards the actual problem.

Lastly the Stream benchmark results were presented and two remarks have to be made: first, the

small problems (in a GPU scale) are very sensitive to a correct number of threads because their

size is similar (and not massively bigger) than the GPU hardware parameters. Second, for large

problem sizes it’s irrelevant to launch more threads, because the process is fully serialized and full

performance is achieved. When the results are compared with the CPU, all can be summed in

one conclusion: if the problem is small it should be computed by the CPU, otherwise significant

speedups can be achieved with GPUs.

Chapter 4

Burgers equation solver

4.1 Mathematical Model

The model used to test the computational performance speedup of GPGPU programming is a

linearized version of the Burgers equation or the uni-directional transport equation.

∂t+ U0

∂x= ν

∂x2(4.1)

where U0 and ν are real constants and u is a continuous field.

4.2 Computational Model

To solve this equation in order to u, the equation 4.1 is first transformed into an explicit form

(equation 4.2):

F (t, x) =∂u

∂t= ν

∂x2− U0

∂x(4.2)

4.2.1 Computational Methods

Time Integration

As shown by equation 4.2, this is an initial value problem. To solve it, the classic 4th order

Runge-Kutta method[13] is used.

un+1 = un +∆t6

(u′1 + 2u′2 + 2u′3 + u′4) (4.3)

u′1 = f(tn, un) (4.4)

u′2 = f(tn + ∆t/2, un + ∆t/2 · u′1) (4.5)

u′3 = f(tn + ∆t/2, un + ∆t/2 · u′2, ) (4.6)

u′4 = f(tn + ∆t, un + ∆t · u′3, ) (4.7)

Spatial Differentiation

The spatial derivatives (first and second) are calculated using 4th order finite difference compact

schemes [29]. These methods are a particular case of central difference schemes and the derivatives

are calculated by solving a linear equation system, Ax = b where A is an N ×N element matrix

and x and b are N element vectors; For the particular case, the compact scheme methods are

better represented by equations 4.8 and 4.9.

A1ux = B1u (4.8)

A2uxx = B2u (4.9)

The A and B matrices of compact schemes methods are band matrices, in particular, the A

matrices of the 4th order compact schemes are pentadiagonal. However, the approach taken in the

present research is a dense algebra one.

In a matrix form, the problem is now formulated as:

u′ ≈ νA−12 B2u− U0A−1

1 B1u (4.10)

Computational Domain

The computational domain of the problem is defined by the following constraints:

• the domain of the spatial coordinate is normalized: x ∈ [0, 1];

• x is an uniform mesh of N points (hence ∆x = 1N−1 );

• the time step, ∆t, is constant and it’s given by the Courant number (eq. 4.11);

• The simulation has NT time steps.

C =U0∆t∆x

(4.11)

The problem’s constants:

• U0 is normalized (U0 = 1);

• the viscosity coefficient ν is calculated as a function of the grid Fourier number1 (eq. 4.12);

F = ν∆t

∆x2(4.12)

Either equation 4.11, or 4.12 are adimensional parameters that are derived from the finite

difference discretization applied to the transport equation. The conditions imposed are within the

limits to impose numerical stability.

Boundary Conditions

As stated, the compact scheme methods are a particular case of central differences, so spe-

cial considerations have to be taken at both boundaries of the spatial domain (u(t, x = 0) and

u (t, x = 1)). On the right side, a null Dirichlet condition is imposed, i.e, u(t, x = 1) = 0. On the

opposite side, the model order reduction presented in [29] was implemented and on this boundary

the problem is represented by a forward 3rd order scheme.

4.3 Implementation

The implementation consists on two versions of the code: a C based serial (single processor)

version and a CUDA based version. The code of each version is as identical as possible. The code

is structured in the following layers (bottom first):

1. algebra operations;

2. numerical integration and differentiation;

3. simulation;1Even if calculating the physical constant isn’t a natural practice in problems of fluid mechanics - where the

objective is to calculate the flow for a given fluid - the focus of the current work is computational; So we compute

the problem’s parameters in function of computational significant units (as the number of points and iterations)

and numerical stability.

For the algebra operations layer the ATLAS2 implementation of LAPACK and BLAS libraries

is used in the serial version. In the CUDA version there are calls to routines from the ATLAS

project and from the CuBLAS library.

For the numerical methods layer, a library was created. The design of this library follows an

object oriented philosophy. To save resources (memory and time), the current implementation

of both libraries is not thread safe (the thread safeness is referred to host threads): there are

non-reentrant functions (static variables are used). The simulation layer it a program that uses

both layers.

The memory is allocated during the initialization of the program, minimizing the number

of allocations, ensuring that the distinct allocator implementations (the host ’s malloc and the

device’s malloc) and their implications will interfere the minimum as possible.

4.3.1 Data structures

The data structures used reflect the first two layers. The main data structure for algebra

operations is the floating point single precision (32 bit) array. Two data structures were created

to store the configuration data (orders, matrices, pivot arrays) for the Runge-Kutta and for the

compact scheme methods.

4.3.2 Program and algorithms

The core of the simulation is a simple sequential loop in the time variable. In each loop

iteration, the velocity field u is updated by the Runge-Kutta integration. All data is logged into

memory and dumped to a file at the end. The pseudocode for the program is shown in figure 4.1.

In the CUDA version, after the problem initialization, all the necessary data is copied to the

GPU memory. Only in the end, the data is downloaded back into the main memory.

For the linear system solver, two direct methods were considered : an LU (with partial piv-

oting) solver and a matrix inversion solver. In both methods all constants are computed during

initialization and always using a serial CPU method: the LU solver computes the pivots, L and U

during initialization; The A−1 matrices are computed once (using the CPU) during initialization.

Using explicit scheme was considered (replacing the compact schemes) but, inside the main loop,

it would be equivalent to the inverse approach in terms of computations.2http://math-atlas.sourceforge.net/

init compact()

init RK4()

init u0()

for n := 0 to NT{

t = n ∗ dt;

u = RK4(t, u0, F (u));

store(u);

u0 = u;

dump();

proc F(u) ≡

U0 ∗ derivative1(u) + ν ∗ derivative2(u).

Figure 4.1: Main program

Solving an LU factorized linear system on GPU

The only routine that, inside the main loop, wasn’t implemented by any CUDA based package

is the equivalent of the LAPACK sgetrs routine. This routine forms a pair with the sgetrf routine:

sgetrf computes the LU factorization with partial pivoting of a general M ×N matrix and sgetrs

solves a linear system AX = B, with that previously factorized A matrix.

The netlib version of the sgetrs function is used as a guide to port the function to the CUDA

architecture. The netlib version of the routine makes calls to BLAS library routines that are

available in the CuBLAS package, so they were used. It also calls a LAPACK internal routine,

slaswp, that was implemented.

The slaswp is a routine that applies a given permutation (in form of row interchanges) to a

matrix. It receives an integer array with the indices of the rows that need to be permuted and

the matrix to operate on. The algorithm, as implemented by LAPACK, its inherently sequential

because the order of row interchange matters: in the LAPACK standard, the indices in the pivot’s

array returned by the sgetrf routine may be repeated. This leads to differences if the interchange

is applied in different orders. An example is illustrated on figure 4.2: it’s a simple case where the

pivot vector is full of ones. Now, if the predetermined (sequential) order isn’t followed, the final

output will differ and thus will lead to an erroneous solution. If a naive decomposition is done, the

order of access isn’t known and, for example, the solution of the case 1 in the figure (where the

thread responsible for changing the second row would be the first to do it; then the one responsible

for the first row; and then for the third row) is different from the case 2. However, the columns of

the solution matrix are completely independent, so a decomposition may be done mapping each

task to a column. For square matrices the order gets reduced from O(N2) to O(N/p)3 . For

the vector case (i.e., when the size of the matrix is N × 1) there is no performance gain. Texture

memory is used to access the pivot vector. This approach was chosen mainly because of the degree

of flexibility that it presents.

Initial condition: Correct solution (with predetermined order 3→ 2→ 1)

and P = [1, 1, 1] A =

1. sequence: 2→ 1→ 3;

(init,2)−−−−−→

(2,1)−−−→

(1,3)−−−→

2. sequence: 1→ 3→ 2;

(init,1)−−−−−→

(1,3)−−−→

(3,2)−−−→

Figure 4.2: Problem of the row interchange order

If, for example, the pivot vector was computed in a way that no repetitions exist, the order of

substitution wouldn’t be relevant, the task decomposition could be row oriented and the algorithm

further optimized. A drawback of this approach is the usage of more memory: there is an inherent

race condition: since the row interchange isn’t an atomic operation, the operation can lead to

incoherent states. To prevent this to happen, the values are updated on a distinct block of3p is the number of processors.

memory.

4.3.3 Computational Resources

Processing considerations

The Runge-Kutta method uses the following number of calculations:

• the field F (t, x) in equation 4.2 is computed 4 times;

• 6 scalar-vector products;

• 7 vector sum.

Each time that equation 4.2 is computed, the program needs to compute two derivatives (and

each derivative needs to calculate one vector-matrix product and the linear equation solution,

which depends on the solver, LU or inverse) and:

• 2 scalar-vector products;

• 1 vector sum.

The table 4.1 shows all the operations done per time iteration.

Table 4.1: Resume of computational operations

Operation LU inverse

scalar-vector product 10 10

vector-matrix product 1 3

vector sum 9 9

LU solve 2 0

All the previous operations but the LU solve method fit in the massive parallel processing

paradigm as shown in the previous chapter. The LU solver is detailed on section 4.3.2.

Memory considerations

The table 4.2 gives a model to the memory usage. This model doesn’t account for the temporary

memory usage within the algebra routines.

Table 4.2: memory usage in floating point elements

Data memory

A1,A2,B1,B2 4N2

temporary memory 5N

simulation log NT ×N

total N(4N + 5 +NT )

The current GPUs have at least 512MB of memory so there are no constraints: a simulation

with N = NT = 1000 is expected to use as little as 20MB.

4.4 Metrics

The essential metric in this work is the time ratio between the serial version and the CUDA

version (equation 4.13).

G =tserial

tcuda(4.13)

Since the initialization is always done on the CPU, another important measure in the CUDA

version is the ratio of times between the initialization part and the total time (equation 4.14).

R =tinit

ttotal(4.14)

The time measures of the main loop include the data transfer from the device in the end of

the program.

Since hardware architectures (i.e., the Intel CPU and the NVIDIA GPU) differ, their imple-

mentations of the IEEE-754 floating point standard may differ. More, for performance, GPUs

don’t implement all functions in a compliant manner[25, A2]. It’s important to check if the dif-

ferences between the two solutions of the same problem are negligible. To measure the distance

between both solutions, the mean square error is used. The mean square error is computed along

the line for a given time and computed its average and maximum, i.e, the average and maximum

of equation 4.15 (where Cin is the solution point for (t, x) ⇐ (n, i) computed with the CPU and

Gin computed with the GPU).

MSEn =1Nx

Nx∑i=1

(Cin −Gi

n)2 (4.15)

4.5 Simulations Results

All the simulations respected the following constrains:

• Courant number: C = 0.3;

• Fourier number: F = 0.1

• Number of time steps: 500.

Performance is evaluated as a function of the problem size.

4.5.1 Performance

In figure 4.3 the evolution of speedup of both methods is represented. On the left side axis

is the scale for the inverse method while in the right side axis it’s the scale for the LU method.

The results are significantly different for each method. The speedup obtained with the CUDA

version for the inverse method is always greater than unity: the lowest value is 6.5 (N = 500)

and the maximum is 15.2 (N = 2000). With the LU method only in one case a speedup of 1

was achieved; all other cases, the performance was poorer when compared with the serial version.

Another important measure is the comparison of the best method for each platform, i.e., inverse

in the GPU with the lu in the CPU (TClu/TGinv

), which is represented in the figure 4.3 with the

dashed blue line and the left side scale. The computation is now 3.5 to 10.8 faster.

0 1000 2000 3000 4000 5000 0.4

Problem size

inverse solverbest case

lu solver

Figure 4.3: Absolute speedup

The evolution of time ratio between the initialization part and the total simulation (done with

the GPU) is represented in figure 4.4. In the left axis is represented the scale for the inverse

method and on the right side is the scale for the LU . As it can be seen, as the number of points

grows, the initial inversion becomes very significant. With N = 500 the ratio is about 30% of

the total; With N = 2000 (the speedup peak), the initialization takes 57% of the total and for

N = 5000 it’s 82%. With the LU , the evolution is completely different as the initialization fraction

is negligible (< 3%). In the same figure is presented the ratio between the initialization part of the

inverse method done on the CPU and done on GPU (TCinit−inv/TGinit−inv

). The scale is the one

on the left side axis. This relation represents essentially the weight of the PCI-express transfer:

the distance of the curve to 1 represents the difference between each version initialization runtime

(in the code, the only differences that exist are the memory allocation and transfer to the device).

As seen in the figure, for N = 500 it takes around 25% of the time; for N ≥ 2000 it represents no

more than 3%. This clearly explains the disruption in performance showed in figure 4.3: as the

fraction of the initialization becomes the predominant one, and because only a very small fraction

of the time is spent on bus transfers, the inversion becomes the main computational problem.

0 1000 2000 3000 4000 50000.75%

invers

Problem size

inverse solvergpu/cpu inverse

lu solver

Figure 4.4: Initialization ratio

The initialization is now known to be a big burden in the problem. In the figure 4.5 is

represented what would be expected if the initialization becomes insignificant. This could be

achieved by two means: by using an explicit scheme for the derivatives. Or, increasing the time

iterations (thus diminishing the sequential part of the problem). The speedup of the inverse

method increases steadily up to a maximum of 43.3. The LU method maintains the behaviour as

expected.

0 1000 2000 3000 4000 5000 0.4

Problem size

inverse solver lu solver

Figure 4.5: Loop speedup

Mixed Strategy Approach

From the previous results is known that, in the inverse method, performance is greatly affected

by the computation of the inverse itself. More, it’s also known that the LU method should perform

better on systems which the relation rows/columns of the solution matrix B is equal to 1 or less.

To compute the inverse of a matrix is to solve a particular linear system for which that relation is

exactly 1. And, in fact, the method used to compute the inverse matrix is the LU method: first the

matrix to be inverted is factorized into the L and U matrices and then, the system LUA−1 = I,

where I is the identity matrix. The A matrix is still computed on the CPU (using the LAPACK

routine) but the inverse computation is done in the GPU. So, when compared to the compact

scheme linear system, it should perform much better.

In the figure 4.7 the results from the new method are presented. The black continuous line

(referred to the left axis scale) refers to the new initialization fraction. This fraction now belongs

to the interval 14% to 60% (instead of 30% to 82% ). When comparing directly the initialization

speedups (blue-dashed line and right side scale), the best case is for N = 1500, where the gain is

nearly 4.1. Then, the gains continuously decreases (in the window of observation it goes to 3.1).

0 1000 2000 3000 4000 5000 3

Problem size

new inverse cpu /gpu

Figure 4.6: Initialization with the inverse computed on the GPU

Finally, in the figure 4.7 the results of the global speedup obtained are presented: on the left

side axis is the scale to the speedup obtained when compared with the case of computing the

inverse matrix on the CPU (black continuous line); there are two remarks to do (they are both

consequence of the initialization being a considerable fraction):

1. the performance gain is always greater than 1 which means that, in the end, for every

problem size a performance boost was achieved;

2. The boost is continuously increasing - even at a slow rate - which means that the initialization

burden got somehow mitigated.

The updated best method comparison is presented in the same figure with the blue-dashed

line and the right side axis scale: the results were boosted as the black continuous line had shown.

The speedups are now between 4.3 to 18.8;

0 1000 2000 3000 4000 5000 4

Problem size

cpu/gpu inversecpu lu/gpu inverse

Figure 4.7: Speedup with the inverse computed on the GPU

4.5.2 Numeric errors

All the solutions given by the GPU with the CPU were compared. While differences do exist ,

they are negligible: all errors accounted for by using equation 4.15, were less than 0.1%. The only

exceptions (where the code became unstable) were in the inverse method on the CPU with sizes

N = 4500 and N = 5000.

However, even it couldn’t be found precisely on which situations - there is only guarantee that

the simulations results are not affected because of the verification done - the implemented LU

solver presents some instabilities for some A matrices.

4.6 Summary

The present chapter objective is to implement and present all the knowledge acquired during

this research in a test case. A brief presentation of the numeric methods behind the implemented

solution of the unidimensional transport equation were presented. Two direct methods for solv-

ing linear systems were implemented and compared with the sequential solution. An additional

method was used and the speedup was increased. The knowledge acquired from the previous chap-

ter was essential since it provided a practical framework of experience in terms of block number

and configuration and memory transfers (either host-device transfers or intra device transfers).

Because of its novelty, there is no known literature to compare results of direct dense linear solvers

using GPUs. However, the mixed approach 18.8 result is quite promising. It was also shown that

the old sequential best methods may not be as good in parallel approaches.

Chapter 5

Conclusion

5.1 Summary

The work developed in the present thesis investigates the potential of the GPUs as scientific

computing devices and, in particular, the usage of GPUs in the solution of the uni dimensional

convection-diffusion problem. The motivation is clear: currently the only solution to significantly

increase the performance in scientific problem solving - being the objective to solve more problems

in the same time or to solve larger problems, augmenting precision or the size of the problem - is

going parallel. GPUs are a low cost solution when compared with the other choices available in

the market.

In the present work, the concepts related to parallel computing (as well as their implementation

on the devices) were studied.

The platform was tested in all major aspects: processing, communication with the host com-

puter and in-device memory transfers. The results were compared with equivalent operations done

on the CPU. The inherent complexity of parallel systems results in many configurations and pos-

sible strategies. The combinations that achieve higher performances are related with the hardware

itself.

Finally, a particular problem to be solved using GPU based technologies was presented. Be-

cause the technology is new, there isn’t yet a software framework equivalent to the one existing

for serial computing. Two direct methods were studied in order to solve a linear system: using an

LU -based method and using the inverse matrix. The LU -based method reveals very poor perfor-

mances for linear systems for which the solution matrix has more columns than rows. The inverse

based approach has significant speedups. In order to improve the inverse method performance,

the implemented LU method was used to invert the matrix. This approach resulted in better

performances.

5.2 Conclusions

The main objective of this work was to investigate whether a class of problems in the compu-

tational fluid domain could benefit from the possibility open by the GPU based computing. This

objective was accomplished with success as speedups between 4 and 18 were obtained.

The study of the parallel computing paradigm and its influence on the device’s model emerge

from the fact that the use of GPUs for scientific problem solving is a novelty. Because of the

device’s design, best performances are generally obtained with massive size problems. Because of

this, even that parallel like programming is used, there is a global serialization effect; whenever a

parallel like performance is visible (i.e., whenever an order reduction in computation time is clear)

means that the code isn’t benefiting of the full potential of the GPU.

This size factor leads to the fact that communication between the host and the device (with the

hardware used) becomes continually less significant when compared with the traditional sequential

access pattern to the variables. This gap opens a possibility to completely hide the bus transfer

and that is easily and transparently obtained by using mapped memory (one of the CUDA’s

possibilities). However, using mapped memory (and in a general sense, asynchronous operations)

may pose additional problems, as race conditions may occur (simultaneous accesses to the same

memory region by the host and device).

In respect to the device’s memory system, majors differences from the host memory were

observed: in the current (multi-core) computers, the system’s memory bus is shared by 4 cores

while in the GPU, the equivalent bus is shared among at least 8 scalar cores (as concrete example,

in the GPU used in this work there are 240 scalar cores). This implies that the knowledge of

how to explore the device’s memory system have a significant impact on the results obtained. It

was verified that the use of memory access patterns that the device is able to coalesce (delivering

memory requests to many threads with a single memory transaction) is crucial for achieving

maximum performance. Yet, the number of requests is also important and, because of this, the

access pattern seems less important for a small number of requests. When repeated access to a

vector variable is needed, there are several approaches to obtain a cached (thus faster) access to it:

using the hardware based texture and constant caches (read access only) or implementing a cache

mechanism with shared memory. Each one of them have their benefits and limitations. Regarding

constant memory: it has to be statically defined and its size is relatively small (64KB), so it’s

impossible to use it on large problems. The cached access for the studied cases performed as good

as shared memory. Regarding texture memory: the fact that it is possible to use linear memory

as a texture, make it possible to use in most cases. Access through texture cache performed worse

than the constant or shared memory but was faster than the direct use of global memory. Lastly,

the shared memory was the fastest memory but poses one big problem: implementing the cache

mechanism in software is not a trivial task. Because of bank conflicts the access to shared memory

can become serialized and thus slower. When compared with the typical cache memories1 (that

have hardware hard-coded strategies to define which memory is on and off the cache), shared

memory presents the advantage of, even at a hard cost, being possible to implement intelligent

cache strategies oriented towards the algorithm itself.

Finally to achieve the main goal of the present work, a uni-dimensional convection-diffusion

transport equation solver was implemented. In this way, a particular finite difference scheme (the

compact scheme) was used for the spatial derivatives and a explicit iterative method (Runge-Kutta

4) was used to solve the time derivative. The partial differential equation is transformed into a

linear algebra specific problem: solving a linear system. Two direct methods for solving the linear

systems were compared: an LU method and the inverse method. Both methods allow to compute

initial constants on the host , transfer them and work only in the device domain with it during

the main loop. The LU method (which is widely used and known to be fast method for the

CPU) as standardized in LAPACK, is an inherently sequential in respect to the rows but can be

parallelized with respect to the columns. On the other side, the inverse method is slower on CPU

implementations. The implemented LU method performs poorly on the device for this problem

as its solution it’s just one column. The inverse method outperforms the LU which clearly shows

how an algorithm better suited for sequential computation is uncorrelated with its performance

on parallel computing.

The fact that the matrix inversion on the CPU is an expensive computation implies that the

method’s scalability is doomed for problems with sizes larger than 2000. This fact also completely

hides the eventual problem of the latency added by data transfer to the device. The way that

the inverse is computed on the CPU (it uses the same factorization used in the LU case) is the

slowest part when computing the inverse matrix itself. This fact, and having a linear system solver

implemented, drove to use the solver to invert the matrix, i.e, the factorization is still done on

the CPU but now the process of obtaining the inverse matrix (which is a particular linear system

solving problem with the number of columns equal to the number of rows) is done on the GPU.

This strategy resulted on a performance increase factor of approximately 2 when compared with

the previous strategy of computing the matrix on the CPU.1for example the CPU’s caches, the constant and texture caches on each multi-processor

5.3 Future Work

The use of GPUs in the scientific computing is a completely new world. When compared with

the CPU approaches there is much to be done in a vast meaning. The knowledge acquired by this

thesis opens a framework of optimizations to be applied on the code developed. But this is just a

small fraction of what could be done. In a future perspective the following ideas are suggested:

• improve the knowledge of the device’s scheduler as the complete understanding of it will lead

to best performances;

• test with more devices. The empirical knowledge learn in this work should be confirmed

with other devices (of different sizes and capabilities);

• algorithms which constantly need to transfer memory for and from the device weren’t studied.

In the literature there are several mentions to benefits using mixed approaches, being the

benefit in form of higher performances or in improved precisions;

• when solving the Burgers eq., a final massive data download is done. Even if the main

performance obstacle is the initialization fraction, higher performances could be achieved if

asynchronous (but smaller) data transfers are made within the loop itself, hiding this cost

completely;

• a general dense direct approach to solve the linear system was selected. In this way, other

methods should be studied, namely iterative and band methods;

• the bottleneck in the LU system should be clearly detailed. There is also work on factorizations[36]

so the performance can be seriously improved;

• the numeric instabilities of the LU solver should be deeply studied and understood as the

implications can be of great importance if this fact is correlated with a blind mix of results

obtained in the CPU with the ones on the GPU;

• the nature of the LU solver is clearly adapted to the 2.5D problem, where there is a method

that governs one dimension and another method that is responsible for the other two di-

mensions (transversal section). This problem leads to a right side of AX = B, which the

number of columns in B is proportional to the area of the transversal section;

• in the current solution, improved performance could have by computing the first and second

derivatives at the same time and eventually, the results made it pertinent, using the CPU

and the GPU at the same time to use the system as a whole;

• the use of multiple GPUs wasn’t explored. This strategy poses the problem of the PCI-

express bus sharing that should be dealt with;

• clustering the GPUs to solve even larger problems;

• due to time constrains, a similar solution using typical cluster technologies wasn’t imple-

mented. Would have been important to compare both parallel solutions.

Bibliography

[1] Ram meenakshisundaram’s transputer home page. http://www.classiccmp.org/

transputer/atw800.htm.

[2] Stream benchmark data counting. http://www.cs.virginia.edu/stream/ref.html\

#counting.

[3] G. S. Almasi and A. Gottlieb. Highly parallel computing. Benjamin-Cummings Publishing

Co., Inc., Redwood City, CA, USA, 1989.

[4] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing

capabilities. pages 79–81, 2000.

[5] Sergio Barrachina, Maribel Castillo, Francisco D. Igual, and Gregorio Quintana-Ortı.

Rafael Mayo, Enrique S. Quintana-Ortı. Exploiting the capabilities of modern gpus for dense

matrix computations. Technical report, Universidad Jaime I, 2008.

[6] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared

Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press,

[7] Jonathan Cohen and Michael Garland. Solving computational problems with gpu computing.

Computing in Science and Engineering, 11(5):58–63, 2009.

[8] NVIDIA Corporation. Transform & lighting technical brief.

[9] David Culler, J. P. Singh, and Anoop Gupta. Parallel Computer Architecture: A

Hardware/Software Approach (The Morgan Kaufmann Series in Computer Architecture and

Design). Morgan Kaufmann, August 1998.

[10] Zhe Fan, Feng Qiu, Arie Kaufman, and Suzanne Y. Stover. Gpu cluster for high performance

computing. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing,

pages 47+, Washington, DC, USA, 2004. IEEE Computer Society.

[11] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of gpu al-

gorithms for matrix-matrix multiplication. In HWWS ’04: Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 133–137, New York,

NY, USA, 2004. ACM.

[12] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide to

Programmable Real-Time Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston,

MA, USA, 2003.

[13] Joel H. Ferziger and Peric Milovan. Computational Methods for Fluid Dynamics. Springer,

2 edition, 1997.

[14] Michael J. Flynn. Some computer organizations and their effectiveness. Computers, IEEE

Transactions on, C-21(9):948–960, Sept. 1972.

[15] Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortı, and Gregorio

Quintana-Ortı. Introducing: The libflame library for dense matrix computations. CiSE,

[16] Michael Garland. Sparse matrix computations on manycore gpu’s. In DAC ’08: Proceedings

of the 45th annual Design Automation Conference, pages 2–6, New York, NY, USA, 2008.

[17] Dominik Goddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick,

Hilmar Wobker, Christian Becker, and Stefan Turek. Using gpus to improve multigrid solver

performance on a cluster. Int. J. Comput. Sci. Eng., 4(1):36–55, 2008.

[18] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532–533, 1988.

[19] Joanes Habich. Performance evaluation of numeric compute kernels on nvidia gpus. Master’s

thesis, FRIEDRICH-ALEXANDER-UNIVERSITAT, 2008.

[20] Mark Harris, William Baxter, Thorsten Scheuermann, and Anselmo Lastra. Simulation of

cloud dynamics on graphics hardware. In Proc. Graphics Hardware, 2003.

[21] David Kanter. Nvidia’s gt200: Inside a parallel processor. Real World Technologies, page

http://realworldtech.com/page.cfm?ArticleID=RWT090808195242\&p=1, Augst. 2008.

[22] Jens Kruger. Linear algebra on gpus. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses,

, New York, NY, USA, 2005. ACM.

[23] Jens Kruger and Rudiger Westermann. Linear algebra operators for gpu implementation of

numerical algorithms. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Papers, pages 908–916,

New York, NY, USA, 2003. ACM.

[24] Linda Null and Julia Lobur. Essentials of Computer Organization and Architecture. Jones

and Bartlett Publishers, Inc., USA, 2003.

[25] Nvidia. CUDA Programming Guide.

[26] Matt Pharr and Randima Fernando. GPU Gems 2: Programming Techniques for

High-Performance Graphics and General-Purpose Computation (Gpu Gems). Addison-

Wesley Professional, 2005.

[27] Martin Rumpf and Robert Strzodka. Using graphics cards for quantized fem computations.

In in IASTED Visualization, Imaging and Image Processing Conference, pages 193–202, 2001.

[28] Allen R. Sanderson, Miriah D. Meyer, Robert M. Kirby, and Chris R. Johnson. A framework

for exploring numerical solutions of advection–reaction–diffusion equations

using a gpu-based approach. Comput. Vis. Sci., 12(4):155–170, 2009.

[29] Lele Sanjiva K. Compact finite difference schemes with spectral-like resolution. Journal of

Computational Physics, 103:16–42, 1992.

[30] Jos Stam. Stable fluids. In SIGGRAPH 99 Conference Proceedings, Annual Conference

Series, pages 121–128, 1999.

[31] Andrew S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River,

NJ, USA, 2007.

[32] J. Tolke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J.

Comput. Fluid Dyn., 22(7):443–456, 2008.

[33] Stanimire Tomov, Jack Dongarra, and Marc Baboulin. Towards dense linear algebra for

hybrid gpu accelerated manycore systems. Technical Report 210, LAPACK Working Note,

October 2008.

[34] Vasily Volkov and James Demmel. Lu, qr and cholesky factorizations using vector capabili-

ties of gpus. Technical report, Electrical Engineering and Computer Sciences, University of

California at Berkeley, 2008.

[35] Vasily Volkov and James W. Demmel. Benchmarking gpus to tune dense linear algebra. In

SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–11,

Piscataway, NJ, USA, 2008. IEEE Press.

[36] Vasily Volkov and James W. Demmel. LU, QR and Cholesky factorizations using vector

capabilities of GPUs. LAPACK Working Note 202, May 2008.

[37] Ye Zhao. Lattice boltzmann based pde solver on the gpu. Vis. Comput., 24(5):323–333, 2008.

Appendix A

Additional Informations

A.1 Properties of some GPUs

Appendix B

Code Listings

B.1 Benchmarks

B.1.1 FLOP benchmark

Listing B.1: FLOP test1 /∗

2 ∗ C o p y r i g h t 1993−2007 NVIDIA C o r p o r a t i o n . A l l r i g h t s r e s e r v e d .

4 ∗ NOTICE TO USER :

6 ∗ Th i s s o u r c e c o d e i s s u b j e c t t o NVIDIA o w n e r s h i p r i g h t s u n d e r U . S . and

7 ∗ i n t e r n a t i o n a l C o p y r i g h t l a w s . U s e r s and p o s s e s s o r s o f t h i s s o u r c e c o d e

8 ∗ a r e h e r e b y g r a n t e d a n o n e x c l u s i v e , r o y a l t y − f r e e l i c e n s e t o u s e t h i s c o d e

9 ∗ i n i n d i v i d u a l and c omme r c i a l s o f t w a r e .

10 ∗

11 ∗ NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

12 ∗ CODE FOR ANY PURPOSE . IT IS PROVIDED ”AS IS ” WITHOUT EXPRESS OR

13 ∗ IMPLIED WARRANTY OF ANY KIND . NVIDIA DISCLAIMS ALL WARRANTIES WITH

14 ∗ REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

15 ∗ MERCHANTABILITY , NONINFRINGEMENT , AND FITNESS FOR A PARTICULAR PURPOSE .

16 ∗ IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL , INDIRECT , INCIDENTAL ,

17 ∗ OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

18 ∗ OF USE , DATA OR PROFITS , WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

19 ∗ OR OTHER TORTIOUS ACTION , ARISING OUT OF OR IN CONNECTION WITH THE USE

20 ∗ OR PERFORMANCE OF THIS SOURCE CODE .

21 ∗

22 ∗ U . S . Gove rnmen t End U s e r s . T h i s s o u r c e c o d e i s a ” c omme r c i a l i t em ” a s

23 ∗ t h a t t e rm i s d e f i n e d a t 48 C . F . R . 2 . 1 0 1 (OCT 1 9 9 5 ) , c o n s i s t i n g o f

24 ∗ ” c omme r c i a l c ompu t e r s o f t w a r e ” and ” c omme r c i a l c ompu t e r s o f t w a r e

25 ∗ d o c u m e n t a t i o n ” a s s u c h t e rm s a r e u s e d i n 48 C . F . R . 1 2 . 2 1 2 ( SEPT 1 9 9 5 )

26 ∗ and i s p r o v i d e d t o t h e U . S . Gove rnmen t o n l y a s a c omme r c i a l end i t em .

27 ∗ C o n s i s t e n t w i t h 48 C . F . R . 1 2 . 2 1 2 and 48 C . F . R . 227 .7202−1 t h r o u g h

28 ∗ 227 .7202−4 ( JUNE 1 9 9 5 ) , a l l U . S . Gove rnmen t End U s e r s a c q u i r e t h e

29 ∗ s o u r c e c o d e w i t h o n l y t h o s e r i g h t s s e t f o r t h h e r e i n .

30 ∗

31 ∗ Any u s e o f t h i s s o u r c e c o d e i n i n d i v i d u a l and c omme r c i a l s o f t w a r e mus t

32 ∗ i n c l u d e , i n t h e u s e r d o c u m e n t a t i o n and i n t e r n a l comment s t o t h e code ,

33 ∗ t h e a b o v e D i s c l a i m e r and U . S . Gove rnmen t End U s e r s N o t i c e .

34 ∗/

36 /∗

37 Th i s s amp l e i s i n t e n d e d t o mea s u r e t h e p e a k c o m p u t a t i o n r a t e o f t h e GPU i n GFLOPs

38 ( g i g a f l o a t i n g p o i n t o p e r a t i o n s p e r s e c o n d ) .

40 I t e x e c u t e s a l a r g e number o f m u l t i p l y −add o p e r a t i o n s , w r i t i n g t h e r e s u l t s t o

41 s h a r e d memory . The l o o p i s u n r o l l e d f o r maximum p e r f o rm a n c e .

43 Dep e n d i n g on t h e c o m p i l e r and h a r dw a r e i t m i g h t n o t t a k e a d v a n t a g e o f a l l t h e

44 c o m p u t a t i o n a l r e s o u r c e s o f t h e GPU , s o t r e a t t h e r e s u l t s p r o d u c e d b y t h i s c o d e

45 w i t h some c a u t i o n .

46 ∗/

48 #include <s t d l i b . h>

49 #include <s td i o . h>

50 #include <s t r i n g . h>

51 #include <math . h>

53 #include <c u t i l . h>

55 #ifndef NUM SMS

56 # define NUM SMS (30) // 16

57 #endif

58 #ifndef NUM THREADS PER SM

59 # define NUM THREADS PER SM (1000) // 384

60 #endif

61 #ifndef NUM THREADS PER BLOCK

62 # define NUM THREADS PER BLOCK (512) // 192

63 #endif

64 #define NUM BLOCKS ((NUM THREADS PER SM / NUM THREADS PER BLOCK) ∗ NUM SMS)

65 #define NUM ITERATIONS 10

66 #i f NUM BLOCKS == 0

67 #define NUM BLOCKS 1

68 #endif

69 // 128 MAD i n s t r u c t i o n s

70 #define FMAD128(a , b) \

71 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \

72 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \

102 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ;

104 _ _ s h a r e d _ _ f loat r e s u l t [ N U M _ T H R E A D S _ P E R _ B L O C K ] ;

106 _ _ g l o b a l _ _ void g f l o p s ( )

108 f loat a = r e s u l t [ t h r e a d I d x . x ] ; // t h i s e n s u r e s t h e mads don ’ t g e t c o m p i l e d o u t

109 f loat b = 1.01 f ;

111 for ( int i = 0; i < N U M _ I T E R A T I O N S ; i++)

113 F M A D 1 2 8 ( a , b ) ;

114 F M A D 1 2 8 ( a , b ) ;

115 F M A D 1 2 8 ( a , b ) ;

116 F M A D 1 2 8 ( a , b ) ;

117 F M A D 1 2 8 ( a , b ) ;

118 F M A D 1 2 8 ( a , b ) ;

119 F M A D 1 2 8 ( a , b ) ;

120 F M A D 1 2 8 ( a , b ) ;

121 F M A D 1 2 8 ( a , b ) ;

122 F M A D 1 2 8 ( a , b ) ;

123 F M A D 1 2 8 ( a , b ) ;

124 F M A D 1 2 8 ( a , b ) ;

125 F M A D 1 2 8 ( a , b ) ;

126 F M A D 1 2 8 ( a , b ) ;

127 F M A D 1 2 8 ( a , b ) ;

128 F M A D 1 2 8 ( a , b ) ;

130 r e s u l t [ t h r e a d I d x . x ] = a + b ;

135 int

136 m a i n ( int a r g c , char∗∗ a r g v )

138 C U T _ D E V I C E _ I N I T ( a r g c , a r g v ) ;

139 unsigned int t i m e r = 0;

141 // warmup

142 g f l o p s <<<N U M _ B L O C K S , N U M _ T H R E A D S _ P E R _ B L O C K >>>() ;

143 C U D A _ S A F E _ C A L L ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;

145 // e x e c u t e k e r n e l

146 C U T _ S A F E _ C A L L ( c u t C r e a t e T i m e r ( &t i m e r ) ) ;

147 C U T _ S A F E _ C A L L ( c u t S t a r t T i m e r ( t i m e r ) ) ;

149 g f l o p s <<<N U M _ B L O C K S , N U M _ T H R E A D S _ P E R _ B L O C K >>>() ;

151 C U D A _ S A F E _ C A L L ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;

152 C U T _ S A F E _ C A L L ( c u t S t o p T i m e r ( t i m e r ) ) ;

153 f loat t i m e = c u t G e t T i m e r V a l u e ( t i m e r ) ;

155 // o u t p u t r e s u l t s

156 f p r i n t f ( s t d e r r , ”#block th/sms gr id f l o p s / cyc l e Time(ms) f l o p s (G)\n” ) ;

157 f p r i n t f ( s t d o u t , ”%3d %5d %5d %10ld %7.3 f %7.3 f \n” , N U M _ T H R E A D S _ P E R _ B L O C K ,

158 N U M _ T H R E A D S _ P E R _ S M , N U M _ B L O C K S , 128 ∗ 16 ∗ 2 ∗ N U M _ I T E R A T I O N S , t i m e ,

159 128.0 ∗ 16 .0 ∗ 2 .0 ∗ N U M _ I T E R A T I O N S ∗ N U M _ B L O C K S ∗ N U M _ T H R E A D S _ P E R _ B L O C K / t i m e ∗ 1 e−6) ;

162 C U T _ E X I T ( a r g c , a r g v ) ;

167 /∗ v im : s e t f t =cpp : ∗/

B.1.2 Bandwidth

Listing B.2: memory access1

5 #include <f loat . h>

6 #include <cuda . h>

7 #include <cuda runtime . h>

9 #include ”aux . h”

12 #i f CUDART VERSION < 2020

13 #error ”This CUDART ver s i on does not support mapped memory !\n”

14 #endif

17 #define NMP 30

18 #define TpMP 1024

19 #define TpB 64

21 #define GLOBAL 1

22 #define TEX 2

23 #define CONST 3

24 #define GLOBAL NC 4

25 #define GLOBAL C 5

26 #define TEX C 6

27 #define CONST C 7

30 #ifndef N

31 # define N (1<<26)

32 //# d e f i n e N 14720

33 #endif

35 #ifndef NROUNDS

36 # define NROUNDS 10

37 #endif

40 #ifndef DTYPE

41 # define DTYPE f loat

42 # define BpW 4

43 #endif

46 typedef struct _ _ a l i g n _ _ (16) { f loat a [ 3 ] ; } f 3 ;

48 #define SIZE (N∗BpW)

49 #define mSIZE(x ) (x∗ s izeo f (DTYPE) )

53 #i f SIZE <= 1<<16

54 _ _ c o n s t a n t _ _ D T Y P E d _ c [ N ] ;

55 #endif

59 _ _ g l o b a l _ _ void g p u _ C O P Y ( D T Y P E ∗ , D T Y P E ∗ , int ) ;

60 _ _ g l o b a l _ _ void c o p y _ t e x ( D T Y P E ∗a , int ) ;

61 _ _ g l o b a l _ _ void c o p y _ c t e ( D T Y P E ∗a , int ) ;

63 void g o l d e n _ C O P Y ( D T Y P E ∗ , D T Y P E ∗) ;

68 int N _ E ;

69 d i m 3 g r i d , b l o c k ;

70 t e x t u r e <f loat , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;

73 int

74 c h e c k ( D T Y P E ∗ x , D T Y P E ∗ y , int n n )

76 int i ;

77 for ( i=0; i<n n ; i++ ) {

78 i f ( x [ i ] != y [ i ] ) {

79 f f l u s h ( s t d o u t ) ;

80 f p r i n t f ( s t d e r r , ” e r r o r at index %d : (x , y )=(%f ,% f )\n” , i , x [ i ] , y [ i ] ) ;

81 f f l u s h ( s t d e r r ) ;

84 return 0 ;

88 extern ”C” {

89 #include <sys / time . h>

91 double m c l o c k ( )

93 struct t i m e v a l t 1 ;

94 // s t r u c t t i m e z o n e t z ;

95 g e t t i m e o f d a y (& t1 , N U L L ) ;

96 return (double ) t 1 . t v _ s e c + (double ) t 1 . t v _ u s e c ∗ 1 e−6;

99 stat ic void

100 o u t p u t ( d i m 3 g , d i m 3 b , double t i m e s [ N R O U N D S ] , s i z e _ t e l e m e n t s , char s [ ] )

102 double a v g t i m e = 0 , m a x t i m e=0 , m i n t i m e = F L T _ M A X ;

103 int i ; s i z e _ t b y t e s = e l e m e n t s ∗ s izeo f ( D T Y P E ) ;

105 for ( i=1; i<N R O U N D S ; i++ ) {

106 a v g t i m e += t i m e s [ i ] ;

107 m a x t i m e = ( m a x t i m e > t i m e s [ i ] ) ? m a x t i m e : t i m e s [ i ] ;

108 m i n t i m e = ( m i n t i m e < t i m e s [ i ] ) ? m i n t i m e : t i m e s [ i ] ;

110 a v g t i m e /= (double ) ( N R O U N D S −1) ;

113 p r i n t f ( ”%5d %3d %10d %11d %8.2 f %8.2 f \n” , g . x , b . x , e l e m e n t s , b y t e s , a v g t i m e ∗1 e 6 , ( b y t e s ∗ 1 e−6)

/ a v g t i m e ) ;

114 f f l u s h ( s t d o u t ) ;

120 int

121 m a i n ( int a r g c , char ∗∗ a r g v )

123 D T Y P E ∗ h _ a , ∗ h _ b ;

124 D T Y P E ∗ d _ a , ∗ d d ;

125 int i , j ;

126 s i z e _ t b y t e s , s i z e ;

127 #i f N<= 8192

128 c u d a A r r a y ∗ d _ b ;

129 #e l i f SIZE <= (1<<16) && N> 8192

130 D T Y P E ∗ d _ b ;

131 #else

132 D T Y P E ∗ d _ b ;

133 D T Y P E ∗ d _ c = N U L L ;

134 #endif

135 char ∗ l a b e l s [ ] = {” g l oba l ” , ” texture ” , ” constant ” } ;

136 double t i m e s [ N R O U N D S ] ;

137 int o p [ ] = { G L O B A L , T E X , C O N S T } ;

139 c u d a _ i n i t ( a r g c , a r g v ) ;

141 b l o c k . x= T p B ;

144 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ a , S I Z E , 0) ) ;

145 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ b , S I Z E , 0) ) ;

147 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ a , N∗ s izeo f ( D T Y P E ) ) ) ;

148 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &dd , N∗ s izeo f ( D T Y P E ) ) ) ;

149 c u d a M e m s e t ( dd , 0 , N∗ s izeo f ( D T Y P E ) ) ;

151 m e m s e t ( h _ a , 0 , S I Z E ) ;

152 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ a , h _ a , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

154 #i f N <= (8912)

155 c u d a _ e r r o r _ e ( c u d a M a l l o c A r r a y (& d _ b , &t e x . c h a n n e l D e s c , S I Z E , 1) ) ;

156 c u d a _ e r r o r _ e ( c u d a M e m c p y T o A r r a y ( d _ b , 0 , 0 , (void∗) d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

157 t e x . n o r m a l i z e d = f a l s e ;

158 c u d a B i n d T e x t u r e T o A r r a y ( t e x , d _ b ) ;

159 #else

160 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ b , S I Z E ) ) ;

161 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ b , d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

163 // c u d aB i n dT e x t u r e ( 0 , t e x , d b , c u d aC r e a t e C h a n n e l D e s c (8∗ s i z e o f (DTYPE) , 0 , 0 , 0 ,

c u d aC h a n n e l F o rm a tK i n d F l o a t ) , SIZE ) ;

164 c u d a B i n d T e x t u r e (0 , t e x , d _ b , c u d a C r e a t e C h a n n e l D e s c (24∗ s izeo f ( D T Y P E ) , 0 , 0 , 0 ,

c u d a C h a n n e l F o r m a t K i n d F l o a t ) , S I Z E ) ;

165 #endif

167 #i f SIZE <= (1<<16)

168 c u d a _ e r r o r _ e ( c u d a M e m c p y T o S y m b o l ( d _ c , h _ a , S I Z E ) ) ;

169 #endif

175 /∗ g l o b a l ∗/

176 p r i n t f ( ”#g loba l \n” ) ;

177 p r i n t f ( ”%5s %3s %10s %11s %9s %9s\n” , ” g r id ” , ” blck ” , ” po int s ” , ” bytes ” , ” avgtime” , ”bandwidth” ) ;

178 for ( s i z e = 1<<10; s i z e <= N ; s i z e=s i z e <<1) {

179 b y t e s = s i z e ∗ s izeo f ( D T Y P E ) ;

180 i f ( s i z e > N M P ∗ T p M P )

181 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;

182 else

183 g r i d . x = ( s i z e / b l o c k . x )+1 ;

185 for ( i=0; i<N R O U N D S ; i++ ) {

186 t i m e s [ i ] = m c l o c k ( ) ;

187 g p u _ C O P Y <<<g r i d , b l o c k >>> ( dd , d _ a , s i z e ) ;

188 c u d a T h r e a d S y n c h r o n i z e ( ) ;

189 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;

190 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;

191 // c h e c k ( h a , h b , s i z e ) ;

193 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ s i z e , l a b e l s [ 0 ] ) ;

196 /∗ t e x t u r e ∗/

197 p r i n t f ( ”#texture\n” ) ;

199 for ( s i z e = 1<<10; s i z e <= N ; s i z e=s i z e <<1) {

200 b y t e s = s i z e ∗ s izeo f ( D T Y P E ) ;

201 i f ( s i z e > N M P ∗ T p M P )

203 else

206 for ( i=0; i<N R O U N D S ; i++ ) {

207 t i m e s [ i ] = m c l o c k ( ) ;

208 c o p y _ t e x <<<g r i d , b l o c k >>> ( dd , s i z e ) ;

212 // c h e c k ( h a , h b , s i z e ) ;

214 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ s i z e , l a b e l s [ 0 ] ) ;

216 #i f SIZE <= (1<<16)

217 /∗ c o n s t a n t ∗/

218 p r i n t f ( ”#constant\n” ) ;

220 for ( b y t e s=1<<12; b y t e s <=S I Z E ; b y t e s +=1024 ) {

221 // f o r ( b y t e s =1<<12; b y t e s <=SIZE ; b y t e s = b y t e s <<1 ) {

222 long s i z e = b y t e s / s izeo f ( D T Y P E ) ;

223 i f ( s i z e > N M P ∗ T p M P )

225 else

228 for ( i=0; i<N R O U N D S ; i++ ) {

229 t i m e s [ i ] = m c l o c k ( ) ;

230 c o p y _ c t e <<<g r i d , b l o c k >>> ( dd , s i z e ) ;

234 // c h e c k ( h a , h b , s i z e ) ;

236 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ b y t e s , l a b e l s [ 0 ] ) ;

238 #endif

239 return 0 ;

247 _ _ g l o b a l _ _ void

248 g p u _ C O P Y ( D T Y P E ∗a , D T Y P E ∗ b , int n n )

250 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;

251 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;

252 int n ;

253 int d e l t a ;

255 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;

257 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<n n ; n += d e l t a ){

258 a [ n ] = b [ n ] ;

261 return ;

265 _ _ g l o b a l _ _ void

266 c o p y _ t e x ( D T Y P E ∗a , int N N )

270 int n ;

271 int d e l t a ;

275 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a )

276 // a [ n ] = t e x 1 D f e t c h ( t e x , n ) ;

277 return ;

282 #i f SIZE <= (1<<16)

283 _ _ g l o b a l _ _ void

284 c o p y _ c t e ( D T Y P E ∗a , int N N )

288 int n ;

289 int d e l t a ;

293 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){

294 a [ n ] = d _ c [ n ] ;

296 return ;

298 #endif

300 void

301 g o l d e n _ C O P Y ( D T Y P E ∗ a , D T Y P E ∗ b )

303 int i ;

305 for ( i=0; i<N ; i++ ) {

306 a [ i ] = b [ i ] ;

308 return ;

314 /∗ v im : s e t f t =cpp : ∗/

315 /∗ EOF ∗/

Listing B.3: cached access1

14 #endif

16 #define NMP 30

17 #define TpMP 4096

18 #define TpB 64

20 #define GLOBAL 1

21 #define TEX 2

22 #define CONST 3

23 #define GLOBAL NC 4

24 #define GLOBAL C 5

25 #define TEX C 6

26 #define CONST C 7

29 #ifndef N

30 # define N (1<<14)

31 #endif

32 #define SIZE (N∗4)

34 #ifndef NROUNDS

36 #endif

38 #ifndef DTYPE

40 #endif

44 #i f SIZE <= 1<<16

45 _ _ c o n s t a n t _ _ D T Y P E d _ c [ N ] ;

46 #endif

48 _ _ g l o b a l _ _ void g p u _ C O P Y _ c ( D T Y P E ∗ , D T Y P E ∗ , int , int ) ;

49 _ _ g l o b a l _ _ void g p u _ C O P Y _ n c ( D T Y P E ∗ , D T Y P E ∗ , int , int ) ;

50 _ _ g l o b a l _ _ void c o p y _ t e x _ c ( D T Y P E ∗a , int , int ) ;

51 _ _ g l o b a l _ _ void c o p y _ c t e _ c ( D T Y P E ∗a , int , int ) ;

53 void g o l d e n _ C O P Y ( D T Y P E ∗ , D T Y P E ∗) ;

58 int N _ E ;

59 d i m 3 g r i d , b l o c k ;

60 t e x t u r e <f loat , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;

61 f loat ∗ d _ o u t ;

64 int

65 c h e c k ( D T Y P E ∗ x , D T Y P E ∗ y , int n n )

67 int i ;

68 for ( i=0; i<n n ; i++ ) {

69 i f ( x [ i ] != y [ i ] ) {

70 f f l u s h ( s t d o u t ) ;

71 f p r i n t f ( s t d e r r , ” e r r o r at index %d : (x , y )=(%f ,% f )\n” , i , x [ i ] , y [ i ] ) ;

72 f f l u s h ( s t d e r r ) ;

75 return 0 ;

78 extern ”C” {

84 // s t r u c t t i m e z o n e t z ;

90 stat ic void

91 o u t p u t ( d i m 3 g , d i m 3 b , double t i m e s [ N R O U N D S ] , s i z e _ t e l e m e n t s , char s [ ] )

93 double a v g t i m e = 0 , m a x t i m e=0 , m i n t i m e = F L T _ M A X ;

94 int i ;

95 s i z e _ t b y t e s = (1+ e l e m e n t s ) ∗ e l e m e n t s ∗ s izeo f ( D T Y P E ) ;

96 int t _ m p = ( g . x / N M P ) ∗ b . x ;

98 for ( i=1; i<N R O U N D S ; i++ ) {

99 a v g t i m e += t i m e s [ i ] ;

100 m a x t i m e = ( m a x t i m e > t i m e s [ i ] ) ? m a x t i m e : t i m e s [ i ] ;

101 m i n t i m e = ( m i n t i m e < t i m e s [ i ] ) ? m i n t i m e : t i m e s [ i ] ;

103 a v g t i m e /= (double ) ( N R O U N D S −1) ;

106 p r i n t f ( ”%5d %3d %4d %6d %10d %11d %12.2 f %8.2 f \n” , g . x , b . x , t _ m p , g . x∗ b . x , e l e m e n t s , b y t e s ,

a v g t i m e ∗1 e 6 , ( b y t e s ∗ 1 e−6) / a v g t i m e ) ;

107 f f l u s h ( s t d o u t ) ;

113 int

116 D T Y P E ∗ h _ a , ∗ h _ b ;

117 D T Y P E ∗ d _ a ;

118 int i ;

119 s i z e _ t b y t e s , s i z e ;

120 #i f N<= 8192

121 c u d a A r r a y ∗ d _ b ;

122 #e l i f SIZE <= 1<<16 && N> 8192

123 D T Y P E ∗ d _ b ;

124 #else

125 D T Y P E ∗ d _ b ;

126 D T Y P E ∗ d _ c = N U L L ;

127 #endif

128 char ∗ l a b e l s [ ] = {” g loba l−nc ” , ” g loba l−l ” , ” texture−l ” , ” constant−l ” } ;

129 double t i m e s [ N R O U N D S ] ;

130 int o p [ ] = { G L O B A L _ N C , G L O B A L _ C , T E X _ C , C O N S T _ C } ;

134 b l o c k . x = T p B ;

136 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ a , S I Z E , 0) ) ;

137 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ b , S I Z E , 0) ) ;

139 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ a , S I Z E ) ) ;

140 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ o u t , S I Z E ) ) ;

142 for ( i=0; i<N ; i++ ) {

143 h _ b [ i ] = 1 .0 f ;

145 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ a , h _ b , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

147 #i f N <= (8912)

148 c u d a _ e r r o r _ e ( c u d a M a l l o c A r r a y (& d _ b , &t e x . c h a n n e l D e s c , S I Z E , 1) ) ;

149 c u d a _ e r r o r _ e ( c u d a M e m c p y T o A r r a y ( d _ b , 0 , 0 , (void∗) d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

151 c u d a B i n d T e x t u r e T o A r r a y ( t e x , d _ b ) ;

152 #else

153 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ b , S I Z E ) ) ;

154 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ b , d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

156 c u d a B i n d T e x t u r e (0 , t e x , d _ b , c u d a C r e a t e C h a n n e l D e s c (8∗ s izeo f ( D T Y P E ) , 0 , 0 , 0 ,

c u d a C h a n n e l F o r m a t K i n d F l o a t ) , S I Z E ) ;

157 #endif

159 #i f SIZE <= 1<<16

160 c u d a _ e r r o r _ e ( c u d a M e m c p y T o S y m b o l ( d _ c , h _ b , N∗ s izeo f ( D T Y P E ) ) ) ;

161 #endif

163 g o l d e n _ C O P Y ( h _ a , h _ b ) ;

165 /∗ g l o b a l −nc ∗/

166 p r i n t f ( ”#g loba l \n” ) ;

167 p r i n t f ( ”%5s %3s %4s %6s %10s %11s %12s %8s\n” , ” g r id ” , ” blck ” , ”TpMP” , ” threads ” , ” po int s ” , ” bytes ”

, ”avgtime” , ”bandwidth” ) ;

168 for ( s i z e = 1<<10; s i z e <=N ; s i z e = s i z e <<1 ) {

169 i f ( s i z e > N M P ∗ T p M P )

171 else

174 for ( i=0; i<N R O U N D S ; i++ ) {

175 t i m e s [ i ] = m c l o c k ( ) ;

176 g p u _ C O P Y _ n c <<<g r i d , b l o c k >>> ( d _ o u t , d _ a , s i z e , s i z e ) ;

180 // c h e c k ( h a , h b , s i z e ) ;

182 o u t p u t ( g r i d , b l o c k , t i m e s , s i z e , l a b e l s [ 0 ] ) ;

185 /∗ g l o b a l −c ∗/

186 p r i n t f ( ”#global−c\n” ) ;

189 i f ( s i z e > N M P ∗ T p M P )

191 else

194 for ( i=0; i<N R O U N D S ; i++ ) {

195 t i m e s [ i ] = m c l o c k ( ) ;

196 g p u _ C O P Y _ c <<<g r i d , b l o c k >>> ( d _ o u t , d _ a , s i z e , s i z e ) ;

200 // c h e c k ( h a , h b , s i z e ) ;

205 /∗ t e x−c ∗/

206 p r i n t f ( ”#texture\n” ) ;

209 i f ( s i z e > N M P ∗ T p M P )

211 else

214 for ( i=0; i<N R O U N D S ; i++ ) {

215 t i m e s [ i ] = m c l o c k ( ) ;

216 c o p y _ t e x _ c <<<g r i d , b l o c k >>> ( d _ o u t , s i z e , s i z e ) ;

220 // c h e c k ( h a , h b , s i z e ) ;

225 #i f SIZE <= 1<<16

226 /∗ c t e−c ∗/

227 p r i n t f ( ”#constant\n” ) ;

229 // f o r ( b y t e s =1<<12; b y t e s <=SIZE ; b y t e s = b y t e s <<1 ) {

231 i f ( s i z e > N M P ∗ T p M P )

233 else

236 for ( i=0; i<N R O U N D S ; i++ ) {

237 t i m e s [ i ] = m c l o c k ( ) ;

238 c o p y _ c t e _ c <<<g r i d , b l o c k >>> ( d _ o u t , s i z e , s i z e ) ;

242 // c h e c k ( h a , h b , s i z e ) ;

246 #endif

248 return 0 ;

253 /∗∗∗∗∗∗∗∗∗∗ COPY KERNELs ∗∗∗∗∗∗∗∗∗∗/

254 _ _ g l o b a l _ _ void

255 g p u _ C O P Y _ n c ( D T Y P E ∗a , D T Y P E ∗ b , int NN , int t )

259 int n , k ;

260 int d e l t a ;

261 D T Y P E t m p ;

266 t m p = 0.0 f ;

267 for ( k=0; k<t ; k++) {

268 t m p += b [ k ] ;

270 a [ n ] = t m p ;

272 return ;

275 _ _ g l o b a l _ _ void

276 g p u _ C O P Y _ c ( D T Y P E ∗a , D T Y P E ∗ b , int NN , int t )

280 int n , i , k ;

281 int d e l t a ;

282 #define BANK SIZE 512

283 _ _ s h a r e d _ _ D T Y P E s b [ B A N K _ S I Z E ] ;

284 // i n t dd = BANK SIZE / ( b l o c kD im . x∗ b l o c kD im . y ) ;

285 int d d = 16;

286 D T Y P E t m p ;

291 t m p = 0.0 f ;

292 for ( i=0; i< t ; i+=d d ) {

293 i f ( t i d<d d ) {

294 s b [ t i d ] = b [ i+t i d ] ;

296 _ _ s y n c t h r e a d s ( ) ;

297 for ( k=0; k<d d ; k++ ) {

298 i f ( i + k < t )

299 t m p += s b [ k ] ;

302 a [ n ] = t m p ;

305 return ;

309 _ _ g l o b a l _ _ void

310 c o p y _ t e x _ c ( D T Y P E ∗a , int NN , int t )

314 int n , k ;

315 int d e l t a ;

316 D T Y P E t m p ;

321 t m p = 0.0 f ;

322 for ( k=0; k<t ; k++ )

323 t m p += t e x 1 D f e t c h ( t e x , k ) ;

324 a [ n ] = t m p ;

325 // a [ n ] = t e x 1D ( t e x , ( f l o a t ) k ) ;

327 return ;

330 #i f SIZE <= 1<<16

331 _ _ g l o b a l _ _ void

332 c o p y _ c t e _ c ( D T Y P E ∗a , int NN , int t )

336 int n , k ;

337 int d e l t a ;

338 D T Y P E t m p ;

343 t m p = 0.0 f ;

344 for ( k=0; k<t ; k++)

345 t m p += d _ c [ k ] ;

346 a [ n ] = t m p ;

348 return ;

350 #endif

359 void

360 g o l d e n _ C O P Y ( D T Y P E ∗ a , D T Y P E ∗ b )

362 int i , j ;

363 f loat t m p ;

365 for ( i=0; i<N ; i++ ) {

366 t m p = 0.0 f ;

367 for ( j=0; j<N ; j++ )

368 t m p += b [ j ] ;

369 a [ i ] = t m p ;

371 return ;

377 /∗ v im : s e t f t =cpp : ∗/

378 /∗ EOF ∗/

B.1.3 Stream

Listing B.4: Stream benchmark1

14 #endif

16 #define COPY 0

17 #define SCALE 1

18 #define ADD 2

19 #define TRIAD 3

21 const char ∗ l a b e l s [ ] = {”COPY ” , ”SCALE ” , ”ADD ” ,

22 ”TRIAD ” } ;

24 const int s b y t e s [ ] = { 2 , 2 ,3 ,3} ;

25 #ifndef DTYPE

27 #endif

29 #i f de f ined (NN)

30 # define N NN

31 #else

32 # define N (1<<15)

33 #endif

35 #ifndef NROUNDS

37 #endif

39 #ifndef OPERATION

40 # define OPERATION COPY

41 #endif

43 #define SIZE (N∗ s izeo f (DTYPE) )

44 #define SIZEs (N∗ s izeo f (DTYPE)∗ sbytes [OPERATION] )

48 #define NMP 30

49 #define TB SIZE 5

50 #define TMP SIZE 10

52 typedef struct {

53 D T Y P E ∗ a ;

54 D T Y P E ∗ b ;

55 D T Y P E ∗ c ;

56 D T Y P E k ;

57 D T Y P E s i z e ;

58 } s t r e a m ;

60 _ _ g l o b a l _ _ void g p u _ C O P Y ( s t r e a m ) ;

61 _ _ g l o b a l _ _ void g p u _ A D D ( s t r e a m s ) ;

62 _ _ g l o b a l _ _ void g p u _ S C A L E ( s t r e a m s ) ;

63 _ _ g l o b a l _ _ void g p u _ T R I A D ( s t r e a m s ) ;

64 void g o l d e n _ C O P Y ( s t r e a m s ) ;

65 void g o l d e n _ S C A L E ( s t r e a m s ) ;

66 void g o l d e n _ A D D ( s t r e a m s ) ;

67 void g o l d e n _ T R I A D ( s t r e a m s ) ;

69 void

70 c h e c k ( s t r e a m h , s t r e a m d )

72 int i ;

73 D T Y P E ∗a , ∗b , ∗ c ;

75 a = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;

76 b = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;

77 c = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;

79 c u d a _ e r r o r _ e ( c u d a M e m c p y ( a , d . a , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;

80 c u d a _ e r r o r _ e ( c u d a M e m c p y ( b , d . b , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;

81 c u d a _ e r r o r _ e ( c u d a M e m c p y ( c , d . c , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;

82 i f ( h . k != d . k ) {

83 f p r i n t f ( s t d e r r , ”k mismatch” ) ;

84 e x i t (3) ;

87 for ( i=0; i<h . k ; i++ ) {

88 i f ( h . a [ i ] != a [ i ] ) {

89 f p r i n t f ( s t d e r r , ”not check : a on index %d” , i ) ;

90 e x i t (3) ;

92 i f ( h . b [ i ] != b [ i ] ) {

93 f p r i n t f ( s t d e r r , ”not check : b on index %d” , i ) ;

94 e x i t (3) ;

96 i f ( h . c [ i ] != c [ i ] ) {

97 f p r i n t f ( s t d e r r , ”not check : c on index %d” , i ) ;

98 e x i t (3) ;

101 f r e e ( a ) ; f r e e ( b ) ; f r e e ( c ) ;

102 return ;

105 extern ”C” {

116 d i m 3 b l o c k , g r i d ;

119 int

122 int i , j ;

123 double t i m e s [ T B _ S I Z E ] [ T M P _ S I Z E ] ;

124 int t b _ s i z e s [ ] = {32 , 64 , 128 , 256 , 512 } ;

125 int t m p _ s i z e s [ ]={30 ,60 ,120 ,240 ,480 ,960 ,1920 ,3840 ,7680 ,15360} ;

126 s t r e a m d _ s , h _ s ;

128 t _ i n i t ( a r g c , a r g v ) ;

131 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . a ) , S I Z E , 0) ) ;

132 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . b ) , S I Z E , 0) ) ;

133 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . c ) , S I Z E , 0) ) ;

135 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . a , S I Z E ) ) ;

136 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . b , S I Z E ) ) ;

137 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . c , S I Z E ) ) ;

139 for ( i=0; i<N ; i++ ) {

140 h _ s . a [ i ] = 1 .0 f ;

141 h _ s . b [ i ] = 1 .0 f ;

142 h _ s . c [ i ] = 0 .0 f ;

144 d _ s . s i z e=h _ s . s i z e=N ;

145 d _ s . k = h _ s . k = 2.0 f ;

146 p r i n t f ( ”#operat ion(%d) : %s vector s i z e : %d , data s i z e : %d \n” , O P E R A T I O N , l a b e l s [ O P E R A T I O N ] , N ,

S I Z E ) ;

147 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . a , h _ s . a , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

148 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . b , h _ s . b , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

149 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . c , h _ s . c , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

150 for ( i=0; i<T B _ S I Z E ; i++ ) {

151 b l o c k . x = t b _ s i z e s [ i ] ;

152 for ( j=0; j<T M P _ S I Z E ; j++ ) {

153 g r i d . x = t m p _ s i z e s [ j ] ;

154 // g r i d . x = NMP∗ ( t m p s i z e s [ j ] / b l o c k . x ) ;

155 i f ( g r i d . x == 0 ) {

156 t i m e s [ i ] [ j ] = 0 .0 f ;

157 continue ;

159 t i m e s [ i ] [ j ] = m c l o c k ( ) ;

161 #i f OPERATION == COPY

162 g p u _ C O P Y <<<g r i d , b l o c k >>> ( d _ s ) ;

163 #e l i f OPERATION == SCALE

164 g p u _ S C A L E <<<g r i d , b l o c k >>> ( d _ s ) ;

165 #e l i f OPERATION == ADD

166 g p u _ A D D <<<g r i d , b l o c k >>> ( d _ s ) ;

167 #e l i f OPERATION == TRIAD

168 g p u _ T R I A D <<<g r i d , b l o c k >>>(d _ s ) ;

169 #endif

170 c u d a _ e r r o r _ e ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;

171 t i m e s [ i ] [ j ] = m c l o c k ( ) − t i m e s [ i ] [ j ] ;

175 #i f OPERATION == COPY

176 g o l d e n _ C O P Y ( h _ s ) ;

177 #e l i f OPERATION == SCALE

178 g o l d e n _ S C A L E ( h _ s ) ;

179 #e l i f OPERATION == ADD

180 g o l d e n _ A D D ( h _ s ) ;

181 #e l i f OPERATION == TRIAD

182 g o l d e n _ T R I A D ( h _ s ) ;

183 #endif

185 c h e c k ( h _ s , d _ s ) ;

186 p r i n t f ( ”%6s ” , ” g r id ” ) ;

187 for ( i=0; i<T B _ S I Z E ; i++ ) {

188 p r i n t f ( ”%7s %8s %9d %9d” , ” t /mp” , ” threads ” , t b _ s i z e s [ i ] , t b _ s i z e s [ i ] ) ;

190 p r i n t f ( ”\n” ) ;

192 for ( j=0; j<T M P _ S I Z E ; j++ ) {

193 p r i n t f ( ”%6d ” , t m p _ s i z e s [ j ] ) ;

194 for ( i=0; i<T B _ S I Z E ; i++ ) {

195 // s i z e t gd = ( t m p s i z e s [ j ] / t b s i z e s [ i ] ) ∗ NMP;

196 s i z e _ t g d = ( t m p _ s i z e s [ j ] / N M P ) ∗ t b _ s i z e s [ i ] ;

198 i f ( t i m e s [ i ] [ j ] == 0.0 f )

199 p r i n t f ( ”%7d %8d %9.2 f %9s ” , gd , t b _ s i z e s [ i ]∗ t m p _ s i z e s [ j ] , ( t i m e s [ i ] [ j ]∗1 e 6 ) , ”−” ) ;

200 else

201 p r i n t f ( ”%7d %8d %9.2 f %9.2 f ” , gd , t b _ s i z e s [ i ]∗ t m p _ s i z e s [ j ] , ( t i m e s [ i ] [ j ]∗1 e 6 ) , S I Z E s /(

t i m e s [ i ] [ j ]∗1 e 6 ) ) ;

203 p r i n t f ( ”\n” ) ;

206 return 0 ;

211 /∗∗∗∗∗∗∗∗∗∗ COPY KERNELs ∗∗∗∗∗∗∗∗∗∗/

213 _ _ g l o b a l _ _ void

214 g p u _ C O P Y ( s t r e a m s )

218 int n ;

219 int d e l t a ;

220 int n t = s . s i z e ;

221 f loat ∗ a = s . a ;

222 f loat ∗ b = s . b ;

226 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<n t ; n += d e l t a ){

227 a [ n ] = b [ n ] ;

230 return ;

233 void

234 g o l d e n _ C O P Y ( s t r e a m s )

236 int i ;

238 for ( i=0; i<s . s i z e ; i++ ) {

239 s . a [ i ] = s . b [ i ] ;

241 return ;

245 void

246 g o l d e n _ S C A L E ( s t r e a m s )

248 int i ;

250 for ( i=0; i<s . s i z e ; i++ ) {

251 s . c [ i ] = s . k∗ s . b [ i ] ;

253 return ;

256 _ _ g l o b a l _ _ void

257 g p u _ S C A L E ( s t r e a m s )

261 int n ;

262 int d e l t a ;

263 D T Y P E l k = s . k ;

267 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<s . s i z e ; n += d e l t a ){

268 s . c [ n ] = l k ∗ s . a [ n ] ;

271 return ;

274 /∗∗∗∗∗∗∗∗∗∗ ADD KERNELs ∗∗∗∗∗∗∗∗∗∗/

275 _ _ g l o b a l _ _ void

276 g p u _ A D D ( s t r e a m s )

280 int n ;

281 int d e l t a ;

286 s . c [ n ] = s . a [ n ] + s . b [ n ] ;

289 return ;

292 void

293 g o l d e n _ A D D ( s t r e a m s )

295 int i ;

297 for ( i=0; i<N ; i++ ) {

298 s . c [ i ] = s . a [ i ] + s . b [ i ] ;

300 return ;

303 _ _ g l o b a l _ _ void

304 g p u _ T R I A D ( s t r e a m s )

308 int n ;

309 int d e l t a ;

314 s . c [ n ] = s . a [ n ] + s . k∗ s . b [ n ] ;

317 return ;

320 void

321 g o l d e n _ T R I A D ( s t r e a m s )

323 int i ;

325 for ( i=0; i<N ; i++ ) {

326 s . c [ i ] = s . a [ i ] + s . k∗ s . b [ i ] ;

328 return ;

332 /∗ v im : s e t f t =cpp : ∗/

333 /∗ EOF ∗/

B.2 Burgers equation solver

B.2.1 Linear Algebra

Listing B.5: sgetrs routine1

2 #include ” cuda lapack . h”

6 extern ”C” int

7 c u d a _ s g e t r s ( const enum C B L A S _ O R D E R O r d e r , const enum C B L A S _ T R A N S P O S E T r a n s A ,

8 const int N , const int N R H S , const f loat ∗A , const int l d a , const int∗ i p i v ,

9 f loat ∗B , const int l d b )

11 char N O T R A N ;

12 int _ n r o w s , _ n c o l s ;

13 const f loat O N E = 1.0 f ;

14 int i n f o ;

17 _ n r o w s = ( O r d e r == C b l a s R o w M a j o r ) ? N : l d a ;

18 _ n c o l s = ( O r d e r == C b l a s C o l M a j o r ) ? l d a : N ;

21 i n f o = 0;

22 N O T R A N = ( T r a n s A == C b l a s T r a n s ) ? 0 : 1 ;

23 i f ( T r a n s A != C b l a s N o T r a n s && T r a n s A != C b l a s T r a n s && T r a n s A != C b l a s C o n j T r a n s ){

24 i n f o = −1;

26 else i f ( _ n c o l s < 0){

27 i n f o = −2;

29 else i f ( N R H S < 0){

30 i n f o = −3;

32 else i f ( _ n r o w s < m a x (1 , _ n c o l s ) ){

33 i n f o = −5;

35 else i f ( l d b < m a x (1 , _ n r o w s ) ){

36 i n f o = −8;

39 i f ( i n f o != 0){

40 return i n f o ;

43 i f ( _ n r o w s==0 | | N R H S==0){

44 return i n f o ;

50 i f ( O r d e r == C b l a s R o w M a j o r ) {

51 i f ( N O T R A N ) {

52 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,

l d a , B , l d b ) ;

53 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a ,

B , l d b ) ;

54 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , −1) ;

56 else {

58 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , 1) ;

59 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s N o T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a

, B , l d b ) ;

60 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s N o T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,

l d a , B , l d b ) ;

63 else {

64 i f ( N O T R A N ) {

65 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , 1) ;

66 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s N o T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a

, B , l d b ) ;

67 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s N o T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,

l d a , B , l d b ) ;

70 else {

71 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,

l d a , B , l d b ) ;

72 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a ,

B , l d b ) ;

73 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , −1) ;

77 return i n f o ;

80 /∗ v im : s e t f t =cpp tw =78 t s =4 : ∗/

81 /∗ EOF ∗/

Listing B.6: sgetri routine1

7 void _ _ g l o b a l _ _

8 c r e a t e _ i d e n t i t y ( f loat∗A , int N )

10 int d 1 = b l o c k D i m . x∗ b l o c k D i m . y ;

11 int d 2 = g r i d D i m . x∗ g r i d D i m . y ∗ d 1 ;

12 int t i d = t h r e a d I d x . x+b l o c k I d x . x∗ d 1 ;

13 int n ;

15 for ( n = t i d ; n<N ; n+=d 2 ) {

16 ∗( A+ ( N+1)∗ n ) = 1 .0 f ;

19 return ;

23 extern ”C” int

24 c u d a _ s g e t r i ( const int N , f loat ∗A , int ∗ i p i v )

26 int i n f o = 0;

27 f loat ∗ t A ;

28 d i m 3 b l o c k , g r i d ;

30 c u d a M a l l o c ( ( void∗∗) &tA , N∗ N∗ s izeo f ( f loat ) ) ;

31 c u d a M e m c p y ( tA , A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ;

32 c u d a M e m s e t ( A , 0 , N∗ N∗ s izeo f ( f loat ) ) ;

34 b l o c k . x = 64;

35 g r i d . x = N / b l o c k . x ; //

36 i f ( g r i d . x == 0 ) g r i d . x++;

38 c r e a t e _ i d e n t i t y <<<g r i d , b l o c k >>> ( A , N ) ;

41 c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , N , N , tA , N , i p i v , A , N ) ;

42 c u d a F r e e ( t A ) ;

44 return i n f o ;

47 /∗ v im : s e t f t =cpp tw =78 t s =4 : ∗/

48 /∗ EOF ∗/

Listing B.7: slaswp routine1

11 #define NB 64

13 /∗ CUDA HELPERS∗/

15 t e x t u r e <int , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;

17 // row ma j o r v e r s i o n

18 stat ic _ _ g l o b a l _ _ void

19 _ s l a s w p _ d _ r m ( const int N , f loat∗A , const int l d a , const int K1 , const int K2 , const int∗ I P I V , const int

i n c x )

21 int d _ j = b l o c k D i m . x∗ b l o c k D i m . y ;

22 int i , j , c o l , r o w ;

23 f loat t m p ;

24 int k1 , k2 , i n c , ix , i x 0 ;

25 int t i d = t h r e a d I d x . x+b l o c k D i m . x∗ t h r e a d I d x . y ;

27 i f ( i n c x > 0 ) {

28 k 1 = K 1 ;

29 k 2 = K 2 +1;

30 i n c= 1;

31 i x 0= k 1 ;

33 else i f ( i n c x < 0){

34 k 1 = K 2 ;

35 k 2 = K1 −1;

36 i n c= −1;

37 i x 0= −K 2 ∗ i n c x ;

40 for ( c o l=t i d ; c o l<l d a ; c o l+= ( d _ j ∗ b l o c k I d x . x ) ) {

41 /∗ DANGEROUS CONDITION ! DO NOT BRAKE : i != k2 ∗/

42 for ( i=k1 , i x=i x 0 ; i != k 2 ; i+=i n c , i x+=i n c x ){

43 r o w = I P I V [ i x ] ;

44 // row = t e x 1 D f e t c h ( t e x , i x ) ;

45 i f ( r o w != i ) {

46 t m p = ∗( A+r o w ∗ l d a+c o l ) ;

47 ∗( A+r o w ∗ l d a+c o l ) = ∗( A+i∗ l d a+c o l ) ;

48 ∗( A+i∗ l d a+c o l ) = t m p ;

54 return ;

57 // co l umn Major v e r s i o n

58 stat ic _ _ g l o b a l _ _ void

59 _ s l a s w p _ d _ c m ( const int N , f loat∗A , const int l d a , const int K1 , const int K2 , const int∗ I P I V , const int

i n c x )

61 int d _ j = b l o c k D i m . x∗ b l o c k D i m . y ;

62 int i , j , c o l , r o w ;

63 f loat t m p ;

64 int k1 , k2 , i n c , ix , i x 0 ;

66 i f ( i n c x > 0 ) {

67 k 1 = K 1 ;

68 k 2 = K 2 +1;

69 i n c= 1;

70 i x 0= k 1 ;

72 else i f ( i n c x < 0){

73 k 1 = K 2 ;

74 k 2 = K1 −1;

75 i n c= −1;

76 i x 0= −K 2 ∗ i n c x ;

79 for ( j=0; j<N ; j+=d _ j ){

80 c o l = j + t h r e a d I d x . x + b l o c k D i m . x∗ t h r e a d I d x . y ;

81 i f ( c o l>=N ){

82 return ;

84 /∗ DANGEROUS CONDITION ! DO NOT BRAKE : i != k2 ∗/

85 for ( i=k1 , i x=i x 0 ; i != k 2 ; i+=i n c , i x+=i n c x ){

86 r o w = I P I V [ i x ] ;

87 i f ( r o w != i ) {

88 t m p = ∗( A+c o l ∗ l d a+r o w ) ;

89 ∗( A+c o l ∗ l d a+r o w ) = ∗( A+c o l ∗ l d a+i ) ;

90 ∗( A+c o l ∗ l d a+i ) = t m p ;

97 return ;

100 extern ”C” void

101 c u d a _ s l a s w p ( const enum C B L A S _ O R D E R o r d e r , const int N , f loat∗A , const int l d a , const int K1 , const int

K2 , const int∗ I P I V , int I N C X )

103 d i m 3 b l o c k _ d i m , g r i d ;

104 void (∗ _ s l a s w p _ d ) ( const int , f loat ∗ , const int , const int , const int , const int ∗ , const int ) ;

105 int r o w _ m a j o r ;

106 int m , n ;

108 r o w _ m a j o r = ( o r d e r == C b l a s R o w M a j o r ) ? 1 : 0 ;

109 n = ( r o w _ m a j o r ) ? N : l d a ;

110 m = ( r o w _ m a j o r ) ? l d a : N ;

111 _ s l a s w p _ d = ( r o w _ m a j o r ) ? _ s l a s w p _ d _ r m : _ s l a s w p _ d _ c m ;

114 // b l o c k d i m . x = im in ( (m / 64 + 1 ) ∗ 64 , 512 ) ;

115 b l o c k _ d i m . x = 64;

116 g r i d . x = ( m / b l o c k _ d i m . x ) ;

117 i f ( g r i d . x == 0) g r i d . x = 1;

118 i f ( g r i d . x > 30∗ (4096 / b l o c k _ d i m . x ) ) g r i d . x = 30∗(4096 / b l o c k _ d i m . x ) ;

121 i f ( ( K1 <0|| K2>=n | | K1>K 2 ) && I N C X >0 ){

122 f p r i n t f ( s t d e r r , ” [ arg e r r o r ] l im i t s K1 or K2 out o f bounds : (K1 , K2)=(%d,%d)\n” , K1 , K 2 ) ;

123 return ;

126 i f ( ( K2 <0|| K1>=n | | K1>K 2 ) && I N C X <0 ){

127 f p r i n t f ( s t d e r r , ” [ arg e r r o r ] l im i t s K1 or K2 out o f bounds : (K1 , K2)=(%d,%d)\n” , K1 , K 2 ) ;

128 return ;

130 // c u d aB i n dT e x t u r e ( 0 , t e x , IPIV , c u d aC r e a t e C h a n n e l D e s c (8∗ s i z e o f ( i n t ) , 0 , 0 , 0 ,

c u d aC h a n n e l F o rm a tK i n d F l o a t ) , N∗ s i z e o f ( i n t ) ) ;

131 _ s l a s w p _ d <<<g r i d , b l o c k _ d i m >>>(N , A , l d a , K1 , K2 , I P I V , I N C X ) ;

132 c u d a _ e r r o r ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;

133 return ;

139 /∗ v im : s e t f t =cpp : ∗/

140 /∗ EOF ∗/

B.2.2 Numerical Methods

Listing B.8: compact schemes header file1 #ifndef RK4 H

2 #define RK4 H

6 struct _ r k 4 {

7 f loat d t ;

8 int (∗ F ) ( int , f loat ∗ , f loat ∗) ;

11 typedef struct _ r k 4 R K 4 ;

13 int r k 4 _ i n i t ( R K 4 ∗ , f loat , int (∗ F ) ( int , f loat ∗ , f loat ∗) ) ;

14 int r k 4 _ i n t e g r a t e ( R K 4 ∗ , int , f loat ∗ , f loat ∗) ;

17 #endif

Listing B.9: RK4 header file1 #ifndef RK4 H

2 #define RK4 H

6 struct _ r k 4 {

7 f loat d t ;

8 int (∗ F ) ( int , f loat ∗ , f loat ∗) ;

11 typedef struct _ r k 4 R K 4 ;

13 int r k 4 _ i n i t ( R K 4 ∗ , f loat , int (∗ F ) ( int , f loat ∗ , f loat ∗) ) ;

14 int r k 4 _ i n t e g r a t e ( R K 4 ∗ , int , f loat ∗ , f loat ∗) ;

17 #endif

Listing B.10: compact schemes CUDA implmentation

6 #include <c lapack . h>

7 #include ”muti l . h”

8 #include ” compact schemes cuda . h”

14 #define FST DER 0

15 #define SND DER 5

17 #define ALFAC 0

18 #define BETAC 1

19 #define AC 2

20 #define BC 3

21 #define CC 4

22 #define DC BETAC

25 #define ALFA2C 5

26 #define BETA2C 6

27 #define A2C 7

28 #define B2C 8

29 #define C2C 9

30 #define D2C BETA2C

31 #define E2C 10

33 int _ c o m p a c t _ c a l c _ c o e f ( C o m p a c t ∗ , int , f loat [ ] ) ;

34 int _ c o m p a c t _ c a l c _ c o e f 2 ( C o m p a c t ∗ , int , f loat [ ] ) ;

35 int _ c o m p a c t _ i n i t _ A ( C o m p a c t ∗) ;

36 int _ c o m p a c t _ i n i t _ A 2 ( C o m p a c t ∗) ;

37 int _ c o m p a c t _ i n i t _ B ( C o m p a c t ∗) ;

38 int _ c o m p a c t _ i n i t _ B 2 ( C o m p a c t ∗) ;

41 /∗Compact r e l a t e d f u n c t i o n s ∗/

47 #include <aux . h>

48 #include <cuda blas . h>

49 #include <cuda lapack . h>

51 int c o m p a c t _ i n i t ( C o m p a c t ∗ s e l f , f loat h , int N , int o r d e r , f loat ∗ c o e f )

53 f loat ∗ f _ p ,∗ d f _ p ,∗ A ,∗ B ,∗ A2 ,∗ B 2 ;

54 int ∗ i _ p , ∗ d i _ p , ∗ A p i v o t s , ∗ A 2 p i v o t s ;

56 f _ p = d f _ p= N U L L ;

57 i _ p = d i _ p= N U L L ;

59 s e l f−>h = h ;

60 s e l f−>N = N ;

62 f _ p = ( f loat ∗) c a l l o c (4∗ N∗N , s izeo f ( f loat ) ) ;

63 i _ p = ( int ∗) c a l l o c ( 2∗ N , s izeo f ( int ) ) ;

64 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗)&d f _ p , 4∗ N∗ N∗ s izeo f ( f loat ) ) ) ;

65 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗)&d i _ p , 2∗ N∗ s izeo f ( int ) ) ) ;

66 i f ( f _ p == N U L L | | i _ p == N U L L | | d f _ p == N U L L | | d i _ p == N U L L ) {

67 return −1;

70 s e l f−>A = f _ p ;

71 s e l f−>B = f _ p + N∗ N ;

72 s e l f−>A 2= f _ p + N∗ N ∗2;

73 s e l f−>B 2= f _ p + N∗ N ∗3;

74 s e l f−>A p i v o t s = i _ p ;

75 s e l f−>A 2 p i v o t s= i _ p + N ;

77 A = d f _ p ;

78 B = d f _ p + N∗ N ;

79 A 2= d f _ p + N∗ N ∗2;

80 B 2= d f _ p + N∗ N ∗3;

82 A p i v o t s = d i _ p ;

83 A 2 p i v o t s= d i _ p + N ;

85 _ c o m p a c t _ c a l c _ c o e f ( s e l f , o r d e r , c o e f ) ;

86 _ c o m p a c t _ i n i t _ A ( s e l f ) ;

88 #ifde f INVERSE

89 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A , s e l f−>A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

90 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A p i v o t s , s e l f−>A p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

91 c u d a _ e r r o r _ e ( c u d a _ s g e t r i ( N , A , A p i v o t s ) ) ;

92 #else

93 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A , s e l f−>A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

94 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A p i v o t s , s e l f−>A p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

95 #endif

97 _ c o m p a c t _ i n i t _ B ( s e l f ) ;

98 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( B , s e l f−>B , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

100 _ c o m p a c t _ c a l c _ c o e f 2 ( s e l f , o r d e r , &( c o e f [ S N D _ D E R ] ) ) ;

101 _ c o m p a c t _ i n i t _ A 2 ( s e l f ) ;

102 #ifde f INVERSE

103 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A2 , s e l f−>A2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

104 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A 2 p i v o t s , s e l f−>A 2 p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

105 c u d a _ e r r o r _ e ( c u d a _ s g e t r i ( N , A2 , A 2 p i v o t s ) ) ;

106 #else

107 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A2 , s e l f−>A2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

108 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A 2 p i v o t s , s e l f−>A 2 p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

109 #endif

110 _ c o m p a c t _ i n i t _ B 2 ( s e l f ) ;

111 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( B2 , s e l f−>B2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;

115 s e l f−>A = d f _ p ;

116 s e l f−>B = d f _ p + N∗ N ;

117 s e l f−>A 2= d f _ p + N∗ N ∗2;

118 s e l f−>B 2= d f _ p + N∗ N ∗3;

120 s e l f−>A p i v o t s = d i _ p ;

121 s e l f−>A 2 p i v o t s= d i _ p + N ;

123 c u d a _ e r r o r _ e ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;

124 f r e e ( i _ p ) ;

126 return 0 ;

130 /∗

131 ∗ C a l c u l a t e s f i r s t d e r i v a t i v e

132 ∗

133 ∗/

134 int

135 c o m p a c t _ d e r i v a t i v e ( C o m p a c t ∗ s e l f , f loat ∗f , f loat ∗ d f _ b , f loat ∗ f _ b ,

136 f loat ∗ Y )

138 stat ic f loat ∗ t m p 1 = N U L L ;

139 f loat a l p h a = 1.0 f ;

140 f loat b e t a = 0.0 f ;

141 int s o l v e r _ m = 1;

142 int s o l v e r _ n = s e l f−>N ;

143 int s o l v e r _ i n f o = 0;

145 i f ( t m p 1 == N U L L ) {

146 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , s o l v e r _ n ∗ s izeo f ( f loat ) ) ) ;

150 /∗SOLVE∗/

152 c u d a _ s g e m v ( C b l a s C o l M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,

153 s e l f−>B , s o l v e r _ n , f , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;

154 #ifde f INVERSE

155 c u d a _ s g e m v ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,

156 s e l f−>A , s o l v e r _ n , Y , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;

157 #else

158 s o l v e r _ i n f o = c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ m ,

159 s e l f−>A , s o l v e r _ n , s e l f−>A p i v o t s , Y , s o l v e r _ n ) ; // tmp2 , s o l v e r n ) ;

161 D P R I N T ( ” s o l v e r return value : %d\n” , s o l v e r _ i n f o ) ;

162 #endif

164 return 0 ;

171 /∗

172 ∗ c a l c u l a t e s t h e c o e f c i e n t s f o r a l g o r i t h m .

173 ∗ R e c e i v e s :

174 ∗ ∗ t h e a l g o r i t h m d a t a s t r u c t u r e ;

175 ∗ ∗ t h e o r d e r o f t h e e r r o r p r e t e n d e d ;

176 ∗ ∗ An a r r a y o f c o e f i c i e n t s . A l l n e g a t i v e a r e c o n s i d e r e d

177 ∗ u n i n i t i a l i z e d

178 ∗/

180 int c o m p a c t _ g e t _ c o e f ( int o r d e r , f loat x [ 5 ] , f loat y [ 5 ] )

183 f loat t m p [ 5 ] ;

186 switch ( o r d e r ){

187 case 4 :

188 i f ( x [ A L F A C ] < 0 .0 f ){

189 t m p [ A L F A C ] = 1 . / 3 . ;

190 } else{ t m p [ A L F A C ] = x [ A L F A C ] ; }

191 i f ( x [ B E T A C ] < 0 .0 f ){

192 t m p [ B E T A C ] = 0 . 0 ;

193 } else{ t m p [ B E T A C ] = x [ B E T A C ] ; }

194 i f ( x [ C C ] < 0 .0 f ){

195 t m p [ C C ] = 0 .0 f ;

196 } else{ t m p [ C C ] = x [ C C ] ; }

197 y [ A L F A C ] = t m p [ A L F A C ] ;

198 y [ B E T A C ] = t m p [ B E T A C ] ;

199 y [ C C ] = t m p [ C C ] ;

200 y [ B C ] = 1 . / 3 .∗ ( 4 .∗ t m p [ A L F A C ]−1.+

201 22.0∗ t m p [ B E T A C ]−8.0∗ t m p [ C C ] ) ;

202 y [ A C ] = 1 . / 3 .∗ ( 2 .∗ t m p [ A L F A C ]+4.+

203 16.0∗ t m p [ B E T A C ]−5.0∗ t m p [ C C ] ) ;

204 break ;

205 case 6 :

206 i f ( x [ A L F A C ] < 1 . 0 ) {

207 t m p [ A L F A C ] = 3 .0 f /8 .0 f ;

209 i f ( x [ B E T A C ] < 1 .0 f ){

210 t m p [ B E T A C ] = 0 .0 f ;

212 y [ A L F A C ] = t m p [ A L F A C ] ;

213 y [ B E T A C ] = t m p [ B E T A C ] ;

214 y [ C C ] = 1 .0 f /10.0 f ∗ ( 1 . 0 f−3.0 f∗ t m p [ A L F A C ]+

215 12.0∗ t m p [ B E T A C ] ) ;

216 y [ B C ] = 1 .0 f /15.0 f ∗ (−9.0 f+32.0 f∗ t m p [ A L F A C ]+

217 62.0∗ t m p [ B E T A C ] ) ;

218 y [ A C ] = 1 .0 f /6 .0 f ∗ (9 .0 f+t m p [ A L F A C ] −

219 20.∗ t m p [ B E T A C ] ) ;

220 break ;

221 default :

222 return −1;

225 return 0 ;

228 int _ c o m p a c t _ c a l c _ c o e f ( C o m p a c t ∗ s e l f , int o r d e r , f loat c o e f [ 5 ] )

230 f loat t m p ;

232 c o m p a c t _ g e t _ c o e f ( o r d e r , c o e f , s e l f−>c o e f ) ;

234 s e l f−>_ c o e f [ A C ] = s e l f−>c o e f [ A C ] / (2 .0∗ s e l f−>h ) ;

235 s e l f−>_ c o e f [ B C ] = s e l f−>c o e f [ B C ] / (4 .0∗ s e l f−>h ) ;

236 s e l f−>_ c o e f [ C C ] = s e l f−>c o e f [ C C ] / (6 .0∗ s e l f−>h ) ;

238 t m p = 3.0 f ;

239 s e l f−>b o u n d a r y _ c o e f [ A L F A C ] = t m p ;

240 s e l f−>b o u n d a r y _ c o e f [ A C ] = −1.0 f ∗ (11 .0 f+2.0 f∗ t m p ) / 6 .0 f ;

241 s e l f−>b o u n d a r y _ c o e f [ B C ] = (6 . 0 f−t m p ) /2 ;

242 s e l f−>b o u n d a r y _ c o e f [ C C ] = (2 . 0 f∗ t m p −3.0 f ) /2 .0 f ;

243 s e l f−>b o u n d a r y _ c o e f [ D C ] = (2 . 0 f−t m p ) / 6 .0 f ;

245 return 0 ;

249 /∗

250 ∗ I n i t a l i z e s t h e a l g o r i t h m m a t r i x A

251 ∗/

252 int _ c o m p a c t _ i n i t _ A ( C o m p a c t ∗ s e l f )

254 int N = s e l f−>N ;

255 f loat ∗ A = N U L L ;

256 int ∗ p i v o t s = N U L L ;

257 f loat c o e f [ 5 ] = {1.0 f /4 .0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;

258 int i , s t a t u s = −1;

260 A = s e l f−>A ;

261 p i v o t s = s e l f−>A p i v o t s ;

264 /∗∗∗∗ Bounda ry node : f ’1+ a l p h a f ’ 2 = a f 1+ b f 2+ c f 3+d+ f 4 ∗∗∗/

265 M S E T ( A , N−1, N−1, N , 1 .0 f ) ;

266 M S E T ( A , N−1, N−2, N , s e l f−>b o u n d a r y _ c o e f [ A L F A C ] ) ;

268 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 4 t h o r d e r e r r o r s ∗∗∗/

269 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;

271 M S E T ( A , 0 , 0 , N , 1 .0 f ) ;

272 M S E T ( A , 0 , 1 , N , c o e f [ A L F A C ] ) ;

275 M S E T ( A , 1 , 1 , N , 1 .0 f ) ;

277 M S E T ( A , N−2, N−3, N , c o e f [ A L F A C ] ) ;

278 M S E T ( A , N−2, N−2, N , 1 .0 f ) ;

282 c o e f [ A L F A C ] = 1 .0 f /3 .0 f ;

283 c o e f [ B E T A C ] = −1.0 f ; c o e f [ A C ] = −1.0 f ; c o e f [ B C ] = −1.0 f ; c o e f [ C C ] = −1.0 f ;

287 M S E T ( A , 2 , 2 , N , 1 .0 f ) ;

290 M S E T ( A , N−3, N−3, N , 1 .0 f ) ;

294 for ( i=3; i<N−3; i++){

295 M S E T ( A , i , i−2, N , s e l f−>c o e f [ B E T A C ] ) ;

296 M S E T ( A , i , i−1, N , s e l f−>c o e f [ A L F A C ] ) ;

297 M S E T ( A , i , i , N , 1 .0 f ) ;

298 //MSET(A , i , i +1 , N , s e l f −>c o e f [ ALFAC ] ) ;

299 //MSET(A , i , i +2 , N , s e l f −>c o e f [ BETAC ] ) ;

301 s t a t u s = c l a p a c k _ s g e t r f ( C b l a s R o w M a j o r , N , N , A , N , p i v o t s ) ;

302 //DPRINT (” f a c t o r i z a t i o n r e t u r n v a l u e : %d\n ” , s t a t u s ) ;

303 i f ( s t a t u s != 0){

304 return s t a t u s ;

306 return 0 ;

309 int _ c o m p a c t _ i n i t _ B ( C o m p a c t ∗ s e l f )

311 int N = s e l f−>N ;

312 int i ;

313 f loat ∗ B = N U L L ;

314 f loat c o e f [ 5 ] = {1.0 f /4 .0 f , −1.0 f , −1.0 f , −1.0 f , −1.0 f } ;

316 B = s e l f−>B ;

319 M S E T ( B , N−1, N−1 , N , −s e l f−>b o u n d a r y _ c o e f [ A C ] / s e l f−>h ) ;

320 M S E T ( B , N−1, N−2 , N , −s e l f−>b o u n d a r y _ c o e f [ B C ] / s e l f−>h ) ;

321 M S E T ( B , N−1, N−3 , N , −s e l f−>b o u n d a r y _ c o e f [ C C ] / s e l f−>h ) ;

322 M S E T ( B , N−1, N−4 , N , −s e l f−>b o u n d a r y _ c o e f [ D C ] / s e l f−>h ) ;

327 M S E T ( B , 0 , 1 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

328 M S E T ( B , 0 , 0 , N , 0 .0 f ) ;

330 M S E T ( B , 1 , 0 , N , −c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

331 M S E T ( B , 1 , 1 , N , 0 .0 f ) ;

332 M S E T ( B , 1 , 2 , N , c o e f [ A C ] / (2∗ s e l f−>h ) ) ;

333 M S E T ( B , N−2, N−3 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

334 M S E T ( B , N−2, N−2 , N , 0 .0 f ) ;

335 M S E T ( B , N−2, N−1 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

339 c o e f [ A L F A C ] = 1 .0 f /3 .0 f ;

342 M S E T ( B , 2 , 0 , N ,− c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;

343 M S E T ( B , 2 , 1 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

344 M S E T ( B , 2 , 2 , N , 0 .0 f ) ;

345 M S E T ( B , 2 , 3 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

346 M S E T ( B , 2 , 4 , N , c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;

347 M S E T ( B , N−3, N−5 , N ,− c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;

348 M S E T ( B , N−3, N−4 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

349 M S E T ( B , N−3, N−3 , N , 0 .0 f ) ;

350 M S E T ( B , N−3, N−2 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;

351 M S E T ( B , N−3, N−1 , N , c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;

353 for ( i=3; i<N−3; i++){

354 M S E T ( B , i , i−3, N ,− s e l f−>_ c o e f [ C C ] ) ;

355 M S E T ( B , i , i−2, N ,− s e l f−>_ c o e f [ B C ] ) ;

356 M S E T ( B , i , i−1, N ,− s e l f−>_ c o e f [ A C ] ) ;

357 M S E T ( B , i , i , N , 0 .0 f ) ;

358 M S E T ( B , i , i+1, N , s e l f−>_ c o e f [ A C ] ) ;

359 M S E T ( B , i , i+2, N , s e l f−>_ c o e f [ B C ] ) ;

360 M S E T ( B , i , i+3, N , s e l f−>_ c o e f [ C C ] ) ;

364 return 0 ;

374 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 2 nd d e r i v a t i v e s t u f f ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/

376 /∗

377 ∗ c a l c u l a t e s t h e c o e f c i e n t s f o r a l g o r i t h m ( 2 nd d e r i v a t i v e ) .

378 ∗ R e c e i v e s :

379 ∗ ∗ t h e a l g o r i t h m d a t a s t r u c t u r e ;

380 ∗ ∗ t h e o r d e r o f t h e e r r o r p r e t e n d e d ;

381 ∗ ∗ An a r r a y o f 5 c o e f i c i e n t s . A l l n e g a t i v e a r e c o n s i d e r e d

382 ∗ u n i n i t i a l i z e d

383 ∗/

386 int c o m p a c t _ g e t _ c o e f 2 ( int o r d e r , f loat x [ 5 ] , f loat y [ 5 ] )

388 f loat t m p [ 5 ] ;

390 switch ( o r d e r ){

391 case 4 :

392 i f ( x [ A L F A C ] < 0 .0 f ){

393 t m p [ A L F A C ] = 2 . / 1 1 . ;

395 i f ( x [ B E T A C ] < 0 .0 f ){

396 t m p [ B E T A C ] = 0 . 0 ;

398 i f ( x [ C C ] < 0 .0 f ){

399 t m p [ C C ] = 0 .0 f ;

400 } else{ t m p [ C C ] = x [ C C ] ; }

401 y [ A L F A C ] = t m p [ A L F A C ] ;

402 y [ B E T A C ] = t m p [ B E T A C ] ;

403 y [ C C ] = t m p [ C C ] ;

405 y [ A C ] = 1 . / 3 .∗ ( 4 . 0 f− 4 .0 f∗ t m p [ A L F A C ]

406 −40.0∗ t m p [ B E T A C ]+5.0∗ t m p [ C C ] ) ;

407 y [ B C ] = 1./3.∗(−1.0 f + 10.0 f∗ t m p [ A L F A C ]

408 +46.0 f∗ t m p [ B E T A C ]−8.0∗ t m p [ C C ] ) ;

410 break ;

411 case 6 :

412 i f ( x [ A L F A C ] < 1 . 0 ) {

413 t m p [ A L F A C ] = 2 .0 f /11.0 f ;

415 i f ( x [ B E T A C ] < 1 .0 f ){

416 t m p [ B E T A C ] = 0 .0 f ;

418 y [ A L F A C ] = t m p [ A L F A C ] ;

419 y [ B E T A C ] = t m p [ B E T A C ] ;

421 y [ A C ] = (6 . 0 f−9.0 f∗ t m p [ A L F A C ]

422 −12.0 f∗ t m p [ B E T A C ] ) / 4 .0 f ;

423 y [ B C ] = (−3.0 f+24.0 f∗ t m p [ A L F A C ]

424 −6.0∗ t m p [ B E T A C ] ) / 5 .0 f ;

425 y [ C C ] = (2 . 0 f−11.0 f∗ t m p [ A L F A C ]

426 +124.0∗ t m p [ B E T A C ] ) / 20 .0 f ;

427 break ;

428 default :

429 return −1;

431 return 0 ;

434 int _ c o m p a c t _ c a l c _ c o e f 2 ( C o m p a c t ∗ s e l f , int o r d e r , f loat c o e f [ 5 ] )

436 f loat t m p ;

438 c o m p a c t _ g e t _ c o e f 2 ( o r d e r , c o e f , &( s e l f−>c o e f [ 5 ] ) ) ;

440 s e l f−>_ c o e f [ A 2 C ] = s e l f−>c o e f [ A 2 C ] / ( s e l f−>h∗ s e l f−>h ) ;

441 s e l f−>_ c o e f [ B 2 C ] = s e l f−>c o e f [ B 2 C ] / ( 4 . 0 f∗ s e l f−>h∗ s e l f−>h ) ;

442 s e l f−>_ c o e f [ C 2 C ] = s e l f−>c o e f [ C 2 C ] / ( 9 . 0 f∗ s e l f−>h∗ s e l f−>h ) ;

444 t m p = 0.0 f ;

445 s e l f−>b o u n d a r y _ c o e f [ A L F A 2 C ] = t m p ;

446 s e l f−>b o u n d a r y _ c o e f [ A 2 C ] = (11 . 0 f∗ t m p +35.0 f ) / 12 .0 f ;

447 s e l f−>b o u n d a r y _ c o e f [ B 2 C ] = −(5.0 f∗ t m p +26.0 f ) / 3 .0 f ;

448 s e l f−>b o u n d a r y _ c o e f [ C 2 C ] = ( t m p +19.0 f ) / 2 .0 f ;

449 s e l f−>b o u n d a r y _ c o e f [ D 2 C ] = ( t m p −14.0 f ) / 3 .0 f ;

450 s e l f−>b o u n d a r y _ c o e f [ E 2 C ] = (11 . 0 f−t m p ) / 12 .0 f ;

452 return 0 ;

457 /∗

458 ∗ I n i t a l i z e s t h e a l g o r i t h m m a t r i x A

459 ∗/

460 int _ c o m p a c t _ i n i t _ A 2 ( C o m p a c t ∗ s e l f )

462 int N = s e l f−>N ;

463 f loat ∗ A = N U L L ;

464 int∗ p i v o t s = N U L L ;

465 f loat c o e f [ 5 ] = {1.0 f /10.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;

466 int i , s t a t u s =−1;

468 A = s e l f−>A 2 ;

469 p i v o t s = s e l f−>A 2 p i v o t s ;

472 M S E T ( A , N−1, N−1, N , 1 .0 f ) ;

473 M S E T ( A , N−1, N−2, N , s e l f−>b o u n d a r y _ c o e f [ A L F A 2 C ] ) ;

476 c o m p a c t _ g e t _ c o e f 2 (4 , c o e f , c o e f ) ;

478 M S E T ( A , 0 , 0 , N , 1 .0 f ) ;

482 M S E T ( A , 1 , 1 , N , 1 .0 f ) ;

485 M S E T ( A , N−2, N−2, N , 1 .0 f ) ;

489 c o e f [ A L F A C ] = 2 .0 f /11.0 f ;

494 M S E T ( A , 2 , 2 , N , 1 .0 f ) ;

497 M S E T ( A , N−3, N−3, N , 1 .0 f ) ;

501 for ( i=3; i<N−3; i++){

502 M S E T ( A , i , i−2, N , s e l f−>c o e f [ B E T A 2 C ] ) ;

503 M S E T ( A , i , i−1, N , s e l f−>c o e f [ A L F A 2 C ] ) ;

504 M S E T ( A , i , i , N , 1 .0 f ) ;

505 M S E T ( A , i , i+1, N , s e l f−>c o e f [ A L F A 2 C ] ) ;

506 M S E T ( A , i , i+2, N , s e l f−>c o e f [ B E T A 2 C ] ) ;

509 s t a t u s = c l a p a c k _ s g e t r f ( C b l a s R o w M a j o r , N , N , A , N , p i v o t s ) ;

510 i f ( s t a t u s != 0){

511 return s t a t u s ;

515 return 0 ;

518 int _ c o m p a c t _ i n i t _ B 2 ( C o m p a c t ∗ s e l f )

520 int N = s e l f−>N ;

521 int i ;

522 f loat ∗ B = N U L L ;

523 f loat c o e f [ 5 ] = {1.0 f /10.0 f , −1.0 f , −1.0 f , −1.0 f , −1.0 f } ;

524 const f loat h 2 = s e l f−>h∗ s e l f−>h ;

526 B = s e l f−>B 2 ;

529 M S E T ( B , N−1, N−1 , N , −s e l f−>b o u n d a r y _ c o e f [ A 2 C ] / h 2 ) ;

530 M S E T ( B , N−1, N−2 , N , −s e l f−>b o u n d a r y _ c o e f [ B 2 C ] / h 2 ) ;

531 M S E T ( B , N−1, N−3 , N , −s e l f−>b o u n d a r y _ c o e f [ C 2 C ] / h 2 ) ;

532 M S E T ( B , N−1, N−4 , N , −s e l f−>b o u n d a r y _ c o e f [ D 2 C ] / h 2 ) ;

533 M S E T ( B , N−1, N−5 , N , −s e l f−>b o u n d a r y _ c o e f [ E 2 C ] / h 2 ) ;

538 M S E T ( B , 0 , 0 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;

539 M S E T ( B , 0 , 1 , N , c o e f [ A C ] / h 2 ) ;

541 M S E T ( B , 1 , 0 , N , c o e f [ A C ] / h 2 ) ;

542 M S E T ( B , 1 , 1 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;

543 M S E T ( B , 1 , 2 , N , c o e f [ A C ] / h 2 ) ;

544 M S E T ( B , N−2, N−3 , N , c o e f [ A C ] / h 2 ) ;

545 M S E T ( B , N−2, N−2 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;

550 c o e f [ A L F A C ] = 2 .0 f /11.0 f ;

553 M S E T ( B , 2 , 0 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;

554 M S E T ( B , 2 , 1 , N , c o e f [ A C ] / h 2 ) ;

555 M S E T ( B , 2 , 2 , N , −2.0 f ∗( c o e f [ A C ]+ c o e f [ B C ] / 4 . 0 f ) / h 2 ) ;

556 M S E T ( B , 2 , 3 , N , c o e f [ A C ] / h 2 ) ;

557 M S E T ( B , 2 , 4 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;

558 M S E T ( B , N−3, N−5 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;

560 M S E T ( B , N−3, N−3 , N ,−2.0 f ∗( c o e f [ A C ]+ c o e f [ B C ] / 4 . 0 f ) / h 2 ) ;

562 M S E T ( B , N−3, N−1 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;

564 /∗TODO: t e s t i f B i s NULL∗/

565 for ( i=3; i<N−3; i++){

566 M S E T ( B , i , i−3, N , s e l f−>_ c o e f [ C 2 C ] ) ;

567 M S E T ( B , i , i−2, N , s e l f−>_ c o e f [ B 2 C ] ) ;

568 M S E T ( B , i , i−1, N , s e l f−>_ c o e f [ A 2 C ] ) ;

569 M S E T ( B , i , i , N , −2.0 f / h 2 ∗( s e l f−>c o e f [ A 2 C ]+ s e l f−>c o e f [ B 2 C ] / 4 . 0 f+s e l f−>c o e f [ C 2 C ] / 9 . 0 f ) ) ;

570 M S E T ( B , i , i+1, N , s e l f−>_ c o e f [ A 2 C ] ) ;

571 M S E T ( B , i , i+2, N , s e l f−>_ c o e f [ B 2 C ] ) ;

572 M S E T ( B , i , i+3, N , s e l f−>_ c o e f [ C 2 C ] ) ;

575 return 0 ;

583 /∗

584 ∗ C a l c u l a t e s s e c o n d d e r i v a t i v e

585 ∗

586 ∗/

587 int c o m p a c t _ d e r i v a t i v e 2 ( C o m p a c t ∗ s e l f , f loat ∗f , f loat ∗ d f _ b , f loat ∗ f _ b , f loat ∗ Y )

590 f loat a l p h a = 1.0 f ;

591 f loat b e t a = 0.0 f ;

592 int s o l v e r _ m = 1;

593 int s o l v e r _ n = s e l f−>N ;

594 int s o l v e r _ i n f o = 0;

596 i f ( t m p 1 == N U L L ) {

597 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , s o l v e r _ n ∗ s izeo f ( f loat ) ) ) ;

600 /∗SOLVE∗/

601 c u d a _ s g e m v ( C b l a s C o l M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,

602 s e l f−>B2 , s o l v e r _ n , f , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;

603 #ifde f INVERSE

604 c u d a _ s g e m v ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,

605 s e l f−>A2 , s o l v e r _ n , Y , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;

606 #else

607 s o l v e r _ i n f o = c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ m ,

608 s e l f−>A2 , s o l v e r _ n , s e l f−>A 2 p i v o t s , Y , s o l v e r _ n ) ; // tmp2 , s o l v e r n ) ;

611 D P R I N T ( ” s o l v e r return value : %d\n” , s o l v e r _ i n f o ) ;

612 #endif

614 return 0 ;

620 /∗ v i : s e t f o l d m e t h o d= s y n t a x tw =100 : ∗/

621 /∗EOF∗/

Listing B.11: RK4 CUDA implementation1

2 #include ” rk4 cuda . h”

11 int r k 4 _ i n i t ( R K 4 ∗ s e l f , f loat dt , int (∗ F ) ( int , f loat ∗ , f loat ∗) )

13 i f ( F == N U L L ){

14 return −1;

16 i f ( d t <= 0.0 f ){

17 return −2;

19 s e l f−>d t = d t ;

20 s e l f−>F = F ;

21 return 0 ;

26 int r k 4 _ i n t e g r a t e ( R K 4 ∗ s e l f , int n , f loat∗ i n p u t , f loat∗ o u t p u t )

28 stat ic f loat ∗ t m p _ y=N U L L ;

29 stat ic f loat ∗ t m p _ x=N U L L ;

31 i f ( t m p _ x == N U L L ) {

32 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p _ x , n∗ s izeo f ( f loat ) ) ) ;

34 else {

35 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p _ x , 0 .0 f , n∗ s izeo f ( f loat ) ) ) ;

37 i f ( t m p _ y == N U L L ) {

38 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p _ y , n∗ s izeo f ( f loat ) ) ) ;

40 else {

41 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p _ y , 0 .0 f , n∗ s izeo f ( f loat ) ) ) ;

45 /∗ i n i t i a l s t a t u s ( k1 ) ∗/

46 s e l f−>F ( n , i n p u t , t m p _ y ) ;

47 c u d a _ e r r o r ( c u d a M e m c p y ( o u t p u t , t m p _ y , n∗ s izeo f ( f loat ) ,

48 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ o u t p u t = k1 ∗/

50 /∗ f i r s t m i d d l e s t e p ( k2 ) ∗/

51 c u d a _ e r r o r ( c u d a M e m c p y ( t m p _ x , i n p u t , n∗ s izeo f ( f loat ) ,

52 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ x0 ∗/

53 c u b l a s S a x p y ( n , s e l f−>d t /2 .0 f , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = 2 d t ∗ k1 +x0 ∗/

54 s e l f−>F ( n , t m p _ x , t m p _ y ) ; /∗ k2 ∗/

55 c u b l a s S a x p y ( n , 2 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 ∗/

58 /∗ s e c o n d m i d d l e s t e p ∗/

61 c u b l a s S a x p y ( n , s e l f−>d t /2 .0 f , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = 2 d t ∗ k2+x0 ∗/

63 c u b l a s S a x p y ( n , 2 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 +2∗ k3 ∗/

66 /∗ l a s t m i d d l e s t e p ∗/

69 c u b l a s S a x p y ( n , s e l f−>dt , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = d t ∗ k3+x0 ∗/

71 c u b l a s S a x p y ( n , 1 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 +2∗ k3+k4 ∗/

74 /∗ a v a r a g i n g s t e p ∗/

77 c u b l a s S s w a p ( n , t m p _ x , 1 , o u t p u t , 1) ;

78 c u b l a s S a x p y ( n , s e l f−>d t /6 .0 f , t m p _ x , 1 , o u t p u t , 1) ; /∗ tmpx = 6 d t ∗ o u t+x0 ∗/

80 return 0 ;

86 /∗EOF∗/

B.2.3 Application

Listing B.12: Simulation implementation1

2 #define XOPEN SOURCE 500

5 #include <math . h>

6 #include <time . h>

7 #include <malloc . h>

10 #include <muti l . h>

11 #include <mNumeric . h>

20 #define pi M PI

21 #define BENCH FNAME ” ./ tmp/ gbench burgers . l og ”

22 #define LOG FNAME ” ./ tmp/ gburgers . l og ”

23 #define X MIN 0.0

24 #define X MAX 1.0

26 #define K2 0.1

27 #define K1 0.3

29 long int N X ;

30 int N T = 500;

31 f loat n u ;

32 const f loat a = −10.0 f ;

33 C o m p a c t ∗ C A ;

34 f loat∗ d f _ b , ∗ f _ b ;

36 /∗

37 ∗ l i n s p a c e ( s t a r t , s t o p , num=50 , e n d p o i n t , r e t s t e p )

38 ∗/

40 int l i n s p a c e ( f loat s t a r t , f loat s t o p , int n u m , int e n d p o i n t , f loat ∗ s t e p , f loat ∗ Y )

42 f loat d x ;

43 int i , n ;

45 i f ( e n d p o i n t <= 0){

46 n = n u m + 1;

49 else{

50 n = n u m ;

53 d x = ( s t o p −s t a r t ) / ( n−1) ;

54 for ( i=0; i<n u m ; i++){

55 Y [ i ] = i∗ d x ;

58 ∗ s t e p = d x ;

59 return 0 ;

64 int F ( int nx , f loat∗ x , f loat∗ y )

67 const int i n c x = 1;

69 i f ( t m p 1 == N U L L ) {

70 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , n x ∗ s izeo f ( f loat ) ) ) ;

72 else {

73 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p 1 , 0 .0 f , n x ∗ s izeo f ( f loat ) ) ) ;

76 c o m p a c t _ d e r i v a t i v e 2 ( CA , x , d f _ b , f _ b , t m p 1 ) ;

77 c o m p a c t _ d e r i v a t i v e ( CA , x , d f _ b , f _ b , y ) ;

78 c u b l a s S s c a l ( nx , a , y , i n c x ) ;

79 c u b l a s S a x p y ( nx , nu , t m p 1 , i n c x , y , i n c x ) ;

80 return 0 ;

83 int f _ u 0 ( int nx , f loat∗x , f loat∗ y )

85 int i ;

86 f loat t m p ;

88 goto s i n u s o i d a l ;

89 for ( i=0; i<n x ; i++){

90 i f ( i == n x /10){

91 y [ i ] = 1 .0 f ;

93 else{

94 y [ i ] = 0 .0 f ;

97 goto e n d ;

99 s i n u s o i d a l :

100 for ( i=0; i<n x ; i++){

101 t m p = x [ i ] ;

102 i f ( t m p >= 0.05 && t m p <0.15){

103 y [ i ] = 0 .5 f∗ s i n f ( ( t m p −0.85) ∗2.0∗ p i /0 .1 ) ;

105 else{

106 y [ i ] = 0 .0 f ;

109 goto e n d ;

110 e n d :

111 return 0 ;

119 int m a i n ( int a r g c , char ∗ a r g v [ ] )

121 f loat ∗ xx , ∗ u 0 ;

122 f loat∗ L O G , ∗ h _ l o g , ∗ d _ u 0 ;

123 f loat dx , d t ;

124 int i , j ;

125 C o m p a c t C A _ [ 1 ] ;

126 R K 4 R K [ 1 ] ;

127 f loat c o e f [ ] = {−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f , −1.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;

128 F I L E ∗ l o g _ f i l e ;

129 c l o c k _ t t i m e s [ N T +10] ;

131 t i m e s [ 0 ] = c l o c k ( ) ;

132 m a l l o p t ( M _ M M A P _ M A X , 0) ;

135 i f ( a r g c > 1){

136 N X = s t r t o l ( a r g v [ 1 ] , N U L L , 10) ;

137 i f ( N X==L O N G _ M I N | | N X== L O N G _ M A X ){

138 p e r r o r ( ”Argument e r r o r ” ) ;

141 else N X = 512;

142 /∗ i n i t domain ∗/

143 x x = ( f loat ∗) c a l l o c ( NX , s izeo f ( f loat ) ) ;

144 u 0 = ( f loat ∗) c a l l o c ( NX , s izeo f ( f loat ) ) ;

145 f _ b = ( f loat ∗) c a l l o c (2 , s izeo f ( f loat ) ) ;

146 d f _ b =( f loat ∗) c a l l o c (2 , s izeo f ( f loat ) ) ;

149 l i n s p a c e ( X _ M I N , X _ M A X , NX , 1 , &dx , x x ) ;

151 d t = f a b s f ( ( K 1 ∗ d x ) / a ) ;

152 n u = ( K 2 ∗ d x ∗ d x ) / d t ;

154 C A = C A _ ;

155 c o m p a c t _ i n i t ( CA , dx , NX , 4 , c o e f ) ;

157 f _ u 0 ( NX , xx , u 0 ) ;

158 r k 4 _ i n i t ( RK , dt , F ) ;

161 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&d _ u 0 , N X ∗ s izeo f ( f loat ) ) ) ;

162 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&L O G , N T ∗ N X ∗ s izeo f ( f loat ) ) ) ;

163 h _ l o g = ( f loat ∗) m a l l o c ( N X ∗ N T ∗ s izeo f ( f loat ) ) ;

165 D P R I N T ( ”x domain : (x m , x M , dx )=(%1.2 f ,%1.2 f ,%1.2 f \n” , x x [ 0 ] , x x [ NX −1] , d x ) ;

166 D P R I N T ( ” t domain : ( t m , x m , dx )=(%2.5 f ,%2.5 f ,% f \n” , 0 .0 f , ( NT−1)∗ dt , d t ) ;

168 /∗ i n i t i a l c o n d i t i o n ∗/

169 c u d a _ e r r o r ( c u d a M e m c p y ( d _ u 0 , u0 , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

170 c u d a _ e r r o r ( c u d a M e m c p y ( L O G , d _ u 0 , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

171 m e m c p y ( h _ l o g , u0 , N X ∗ s izeo f ( f loat ) ) ;

173 /∗ main l o o p ∗/

175 t i m e s [ 1 ] = c l o c k ( ) ;

176 for ( i=1; i<N T ; i++){

177 r k 4 _ i n t e g r a t e ( RK , NX , d _ u 0 , L O G+i∗ N X ) ;

178 c u d a _ e r r o r ( c u d a M e m c p y ( d _ u 0 , L O G+i∗ NX , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;

181 c u d a _ e r r o r ( c u d a M e m c p y ( h _ l o g , L O G , N T ∗ N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o H o s t ) ) ;

182 t i m e s [ 2 ] = c l o c k ( ) ;

184 l o g _ f i l e = f o p e n ( B E N C H _ F N A M E , ”a” ) ;

185 // o u t p u t f o rm a t NX NT t i t l

186 f p r i n t f ( l o g _ f i l e , ”%04d %03d %0∗ ld %0∗ ld\n” , ( int ) NX , NT , 8 ,

187 ( t i m e s [1]− t i m e s [ 0 ] ) /1000 , 8 , ( t i m e s [2]− t i m e s [ 1 ] ) /1000) ;

188 f c l o s e ( l o g _ f i l e ) ;

189 f p r i n t f ( s t d e r r , ”DONE\n” ) ;

190 f f l u s h ( N U L L ) ;

191 /∗ OUTPUT ∗/

193 l o g _ f i l e = f o p e n ( L O G _ F N A M E , ”w” ) ;

195 for ( i=0; i<N X ; i++){

196 f p r i n t f ( l o g _ f i l e , ”%+1.5 f ” , x x [ i ] ) ;

197 for ( j=0; j<NT −1; j++){

198 f p r i n t f ( l o g _ f i l e , ”%+2.5 f ” , ∗( h _ l o g+j∗ N X+i ) ) ;

200 f p r i n t f ( l o g _ f i l e , ”%+2.5 f \n” , ∗( h _ l o g +( NT−1)∗ N X+i ) ) ;

202 f c l o s e ( l o g _ f i l e ) ;

204 return 0 ;

209 /∗ v i : s e t f o l d m e t h o d= s y n t a x : ∗/

210 /∗EOF∗/

solution of the transport equation using graphical processing units

Documents

powerpoint presentation - fundamentals of sesame … ·...

intro to guis (graphical user interfaces) section 2.5intro....

approximate solution of equations example : solve the...

graphical tools for linear structural equation modeling

quadratic equations, solving a quadratic equation by...

4.1 graphical solutions of quadratic equations … –...

graphical tools for linear structural equation...

chapter 13 graphical causal models - university of … ·...

warm up 1.what are units for velocity? 2.what are units for...

the 3-equation new keynesian model — a graphical...

circle equations the graphical form of the circle equation...

particle-in-cell codes for graphical processing units viktor...

quantum chemistry on graphical processing units. 3

meng353 fluid mechanics chapter 8 summary...

structural equation modeling, 2012, 217 pages, natasha k...

quantum chemistry for solvated molecules on graphical...

graphical processing units and cuda - edward...

solving differential equation with graphical data...

g. solve the equation in part e using a graphical...

challenges in causality volume...