solution of the transport equation using graphical processing units
Post on 12-Feb-2022
2 Views
Preview:
TRANSCRIPT
Solution of the Transport Equation
using Graphical Processing Units
Gil Goncalves Brandao
Master Thesis on
Aerospace Engineering
Jury
President: Prof. Fernando Jose Parracho Lau
Supervisor: Prof. Jose Carlos Pereira
Co-supervisor: Mst. Ricardo Jose Nunes dos Reis
Examiner: Prof. Jose Leonel Monteiro Fernandes
October 2009
Agradecimentos
Quero agradecer ao Professor Jose Carlos Pereira e ao Ricardo Reis esta grande oportunidade
de trabalhar com eles no LASEF e o me terem ajudado a fechar um longo capıtulo da minha vida.
Quero expressar que o facto dos meus pais sempre me terem deixado viver com total liberdade
sem nunca sequer me terem sugerido que fizesse algo com o qual nao concordasse, e-me de inco-
mensuravel valor. Assim como considero incomensuravel o valor que isso tras ao desfecho deste
capıtulo. De igual forma quero explicitamente expressar o meu profundo agradecimento por nunca
ter sentido qualquer pressao para acabar o curso, num mundo em que o time to market (e nao a
felicidade) parece ser a regra.
Quero tambem agradecer a pessoa que nos ultimos anos soube equilibrar todos os pratos no
melhor sentido mas tambem criar desequilıbrios sempre que necessario. Sem ela, muito provavel-
mente, o presente texto nunca teria sido escrito.
De forma mais geral agradeco a Radio Zero: o meio de acesso, a intervencao, o experimenta-
lismo e os amigos. E a todos aqueles que, ao contribuırem para a Cultura Livre, demostram ao
Mundo que e possıvel conviver em harmonia e progredir, sem deixar ninguem de fora.
i
Resumo
Contradizendo 30 anos de progresso em termos de rapidez dos CPUs, os ultimos anos mostraram
que se chegou a um ponto de saturacao no que respeita a velocidade de relogio dos CPUs. Este
facto entra em conflito com as sempre crescentes necessidades computacionais da comunidade
cientıfica de mecanica de fluidos computacional. Ao mesmo tempo, as GPUs surgiram como um
recurso computacional paralelo de alta performance, alternativo ao CPU. Esta nova tecnologia e
tambem mais barata que as abordagens paralelas tradicionais. Esta tese investiga o paradigma
computacional associado as GPUs e a sua implementacao, a utilizacao desta tecnologia para a
solucao da equacao de transporte uni-dimensional e os ganhos comparados com uma solucao
baseada em CPU.
A tecnologia CUDA da NVIDIA e utilizada como plataforma de acesso as GPUs. Foram
implementados testes para obter um verdadeiro conhecimento sobre a tecnologia. Foram tambem
implementadas nesta tecnologia as rotinas computacionais necessarias a solucao da equacao de
transporte.
Os resultados obtidos neste trabalho mostram que a tecnologia, embora nova e pouco explorada,
e uma plataforma muito promissora onde ganhos concretos no domınio da mecanica de fluidos
computacional podem ser alcancados.
Palavras chave: computacao paralela; GPU, alto desempenho, mecanica de fluidos compu-
tacional; diferencas finitas; sistemas lineares.
Abstract
Contradicting 30 years of CPU speed progress, the last few years have shown a saturation point in
clock rate. This fact collides with the natural ever growing demand of computational power from
the CFD scientific community. At the same time, GPUs emerged as a parallel, high performance
alternative computational resource. This new technology is also cheaper than the traditional
parallel approaches. This work investigates the computational paradigm attached to the GPUs,
the usage of these devices to solve the uni- dimensional transport equation and what performance
gains exist when compared with the traditional CPU utilization.
The CUDA technology from NVIDIA is used as the platform to access the GPUs. Tests were
implemented to acquire a real knowledge of the technology. The necessary routines to solve the
transport equation were also implemented.
The results obtained in this work show that this technology, albeit new and immature, is a
very promising platform and real speedups in the CFD domain can be achieved.
Keywords: parallel computing; GPU; high performance; speedup; CFD; finite differences;
linear systems.
Contents
1 Introduction 1
1.1 GPU, the Cluster on the Desktop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 GPU Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 CUDA Programming Overview 7
2.1 Parallel Computing Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 GPU Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 CUDA Environment Tests 20
3.1 Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Peak Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Host - device transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Device - device transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Stream benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
i
4 Burgers equation solver 33
4.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Program and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Simulations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Numeric errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Conclusion 47
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Additional Informations 55
A.1 Properties of some GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B Code Listings 57
B.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.1.1 FLOP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.1.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.1.3 Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
B.2 Burgers equation solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.2.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
ii
List of Figures
2.1 The von Neumann model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Parallel computing memory patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Parallel Speedup Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Host Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 A GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Asynchronous execution of the device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 CUDA program example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 FLOP test, total time of execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 FLOP performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Time of the total transfer cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Details of the data transfers for two different sizes . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Workload distribution for a vector of size 7 and 3 threads . . . . . . . . . . . . . . . . . . . 28
3.6 Intra device memory transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Intra device memory transfers w/ cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Stream benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Main program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Problem of the row interchange order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Absolute speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Initialization ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Loop speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Initialization with the inverse computed on the GPU . . . . . . . . . . . . . . . . . . . . . 44
4.7 Speedup with the inverse computed on the GPU . . . . . . . . . . . . . . . . . . . . . . . . 45
iii
List of Tables
3.1 Host system hardware details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Device properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Memory and operation accounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Stream benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Resume of computational operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 memory usage in floating point elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.1 Properties of several GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
Chapter 1
Introduction
Engineers doing Computational Fluid Dynamics (CFD) always have struggled for computing
resources to solve their problems faster. And, when they managed to solve one of those problems,
there is always something bigger around the corner. The natural way to to fullfil this demand
(of speed and size) is to increase the machine’s speed, maintaining its original sequential form.
This sequential origin probably exists because the normal method to humans in algorithms is as a
sequence of simple actions. Until a few years ago, it was possible to just wait a few months and buy
a new faster machine. Moore’s Law, stating that on die transistor count doubled every two years
was granting this free ride, providing a steady increase in CPU velocity. Unfortunately, current
technology bumped into a wall: the ever decreasing die size with increasing transistor density has
made heat problems unbearable. The obvious sign is in clock rates: the brand new Intel Core i7
has a 3.3GHz clock rate, while the 2001’s Intel Xeon had at that time (8 years ago) a maximum
clock rate of 3.6GHz. The major consequence of being impossible to keep increasing the clock
speeds, is that the old algorithms are no longer as useful as they once were. The solution for growth
is now, more than ever, parallel computing. Instead of adding capacity to one processing unit,
the number of processing units are increased, or metaphorically, to increase the flow rate while
maintaining velocity constant, we can only increase the cross section area. The CPU industry
already understood this since it has turned multi-core and choose to embrace the parallel way of
marching into the future. But parallel computing, as fluid mechanics, isn’t linear and now the
speedup is no longer free after the purchase of new hardware: there is a inherent complexity when
compared to the serial way of thinking and, depending on the problem, the speedup can be from
negligible to linear scaling with the number of processing units.
Meanwhile, the graphic cards industry was pursuing it’s own path, using parallel, dedicated
hardware from the start and drove by the rich market demand of gamers. Graphic Processor
1
Units (GPUs) gained more capabilities and, in the last couple of years, toolkits and dedicated
frameworks allowing for the true start of the exploration of GPU power for general computing.
The problem is, of course, that albeit this higher level frameworks, this is still specialized hardware
and a thorough knowledge of its intricacies is needed to harness their full power. More so because
the CPU world, albeit going multi-core and parallel, is tied to the need of answering to a very
different kinds of requests simultaneously, a true Land Rover of the computing world. GPUs are
more in the class of high speed F1 machines.
This work focus on dissecting and exploring the GPU for solving partial differential equation
problems. I have tried to carefully characterize the GPU, especially under the CUDA environment
from NVIDIA.
1.1 GPU, the Cluster on the Desktop
The GPU is the processing core of the current graphics cards. The graphics card is the
component of a computer responsible for outputting electrical signals to visual displays. Each
pixel on the display needs to have its color and intensity computed before being transformed into
display signals. The sequence of operations needed to process each pixel it’s called the graphics
pipeline[12, chap. 1] and it’s both computational intense and highly parallel: computational
intense because there are many pixels in a frame and several frames per second; highly parallel
because the state each pixel it’s completely independent from its neighbors so the computation of
each pixel can be done exactly at the same time of any other pixel in the frame.
Mainly due to the electronic game market demand, the specifications of graphics cards have
been growing every year, so that computer games could look and feel more realistic. At the same
time and for the same reasons, the degree of programmability of these devices has been greater
and today the devices aren’t only meant for graphics but they became general purpose computing
devices. In terms of raw computing power, and comparing with the traditional CPU technologies,
this power is only comparable with larges aggregates of computers called clusters. For example
the new Intel Core i7 (with four 3GHz cores) does nearly 70GFLOPS and the NVIDIA Tesla
c1060 device (with its two hundred and forty 1.3GHz cores) does approximately 900GFLOPS.
This means that would be necessary more than 10 Core i7 to achieve the same raw performance.
Each solution has its strengths and weakness. For example, one of the clear disadvantages
in the GPU is that RAM memory is limited (and it can’t be expanded using transparent swap-
ping technologies). However, the GPU memory is faster than the RAM memory of a computer
(102GB/s in the NVIDIA Tesla C1060 vs 12GB/s with the most recent DDR3 memory) and
the communication between the cores of the GPU is faster than the one on a cluster. Another
2
advantage with the GPUs is its cost in two ways: primarily, the cost of purchase and secondly
the cost of powering the devices. The price of a high end GPU device is, generally, the price of
one single node of the cluster. A real case is the last expansion of the LASEF1 cluster: each node
costed in the order of the 2000e and 5 nodes were bought; at the same time was bought a solution
with GPUs that has more raw processing power than the whole cluster (including the old nodes)
was acquired for less than 8000e. In terms of electrical power, the NVIDIA Tesla C1060 consumes
200W itself, i.e, a fraction of a cluster node would consume. Another big advantage of GPUs is
manageability: the cost of managing a single card (and a single computer) isn’t comparable to the
cost of managing the whole cluster.
At last but not least, if a truly big computing power is needed, there is always the possibility
of making clusters of GPUs. Because all the previous considerations we’re convinced that we need
to investigate how to use this new kind of computing devices.
1.1.1 Historical Perspective
The idea of using multiple co-processors to increase a workstation performance in a (highly)
parallel fashion is not completely new. An example of this approach was the Atari Transputer
Worstation[1], presented at 1987 COMDEX which featured the so called “Farm Cards”. These
cards essentially consisted of more processors to be used by the operating system. In those days
the ATW had a significant parallel performance but this technology didn’t got mainstream and
the product was soon discontinued.
Also for workstations and since the Intel 8086 processor, Intel provided a co-processor fam-
ily - the Intel x87. These co-processors were floating point units designed for high performance
numerical applications (for example they featured, among many others, exponential and logarith-
mic functions). By the time of the i386 processor, they were IEEE 754 compliant and provided
asynchronous operation (i.e., parallel to the CPU). By 1989, these units were incorporated into
the i486dx processor. Another step in the parallel computing history within workstations was
the introduction of the MMX units into the Pentium processor family (in 1997). Although not
meant for numerical applications (since they were integer units and oriented towards the multime-
dia field), they featured instruction pipelines with a SIMD philosophy, so that multiple data was
processed with one single instruction. Since MMX, the use of similar technologies (integer and
floating point) never stopped growing.
When clock rates started to saturate, the CPU makers started to look to the truly parallel
approach. In 2001 Intel released a technology called hyperthreading which improves the perfor-
mance for multi threaded code. In 2005 the dual core (two processing units in the same chip) Intel1LASEF - Laboratory and Simulation of Energy and Fluids, Instituto Superior Tecnico
3
Pentium D processor was released. Successive generations of CPU devices seen its core to mirror
in multiples of 2 and 3 cores (AMD produces the 3 core Phenom and the 6 core Opteron). In
other architectures, other than x86, there are also multi-core processors such as the IBM/Toshiba
Cell processor.
Meanwhile in the graphics scene the logic was different. From the beginning, the idea was to
offload the CPU from graphics output functions. With the video game industry in mind, in 1999
NVIDIA launched the first graphics processing unit: GeForce 256. This card (with its Transform
& Lightning technology) was the first to offload the whole 3D graphics pipeline[8]. By 2001, with
GeForce 3, the vertex shading process was programmable. The substitution of rigid components
with programmable ones in the graphics pipeline didn’t stop and, in 2002, NVIDIA releases the
Cg technology (C for graphics), which is a highly specialized language (and compiler) for graphics
hardware that works on top of OpenGL or DirectX libraries and opens the graphics hardware to
the graphics developer. Since then, the GPUs have been programmed to solve problems other
than computer graphics - it was the beginning of the General Purpose computing with Graphics
Processing Units (GPGPU). A key to general purpose programming with this approach is the
mapping the algorithms to the highly specialized graphics pipeline methods[26]. In 2007 NVIDIA
releases the Compute Unified Device Architecture (CUDA) which is a new language (deeply based
on C) that is fully oriented towards GPGPU: it allows any programmer to use the full power of
the GPUs without the need of knowing anything about the graphics pipeline. The potential of
GPU based technologies led to the rise of Apple’s OpenCL (Open Computing Language) as an
industry standard for GPGPU. It’s worth mentioning that not only technologies from NVIDIA
exist, other projects such as the Brook project, LibSh or AMD Stream are also available to work
with.
1.2 GPU Technologies
From the hardware side, there are mainly two device makers, NVIDIA and ATI, and their
successive generations of devices. Generally each new generation of devices has its computing
power and programmability significantly increased.
In the software side, as said in section 1.1.1, there are two kinds of technology to compute
data on the GPUs: mapping the algorithms to the graphics pipeline or using the more recent
and general languages. Currently, in the first approach, there are two big families: NVIDIA Cg2
(which includes Microsoft High Level Shading Language3 ) and OpenGL Shading Language4 ; In
2http://developer.nvidia.com/page/cg main.html3http://msdn.microsoft.com/en-us/library/bb509561%28VS.85%29.aspx4http://www.opengl.org/documentation/glsl
4
the general language field, there are NVIDIA CUDA5, OpenCL6 and the Brook7 family (which
includes the AMD/ATI technology).
Even if at current date they have less interest (because of inherent additional complexity) there
was investigation in how to exploit the GPU potential using the graphics pipeline, i.e, in how to
map a specific problem to the graphics pipeline. For example: in physically based simulation of
fluid dynamics, a Navier Stokes solver algorithm oriented to GPUs[30] using a solver based on
the method of characteristics. A cloud simulator[20] using a jacobi solver and in the fluid flow
scientific domain, lattice-Boltzmann[10], finite element [27] methods were studied. Work on the
advection-diffusion problem using forward finite differences and the Crank-Nicholson methods has
been done[28]. In the linear algebra domain, a broader work [22, 23]has been done on matrix
multiplication [11].
With the release of the CUDA framework, the devices were completely opened to program-
mers and, since then, all kinds of computational applications have been released: video encoding,
weather prediction, molecular simulation, fluid dynamics, computer vision, cryptography, etc. In
the CFD context, lattice Boltzmann methods have been studied[32, 19, 37]. Apart from this the
Navier-Stokes equation has been solved using finite volume[7] and finite element[17] codes. In the
linear algebra domain codes have been developed. NVIDIA released a CUDA version of the stan-
dard BLAS library (routines that compute simple operations such as addition and multiplication
on vector and matrix level). At a higher level (linear system solvers), in the present two main
orientation seems to exist: in one side there is a big interest in maintaining the old LAPACK8
library interface, using the GPU as a high performance computational resource. The factorization
algorithms (such as the LU, QR or Cholesky factorizations) are being implemented[34] and hybrid
CPU-GPU approaches are being studied [33]. On the other side, a new generic approach to algebra
algorithms is being developed[15] and the GPUs are being used as a test framework to this new
approach[5]. Beside this two approaches, there is also work on sparse matrix algebra[16].
1.3 Objectives
The main objectives of the present work are:
• to investigate the concepts behind the technology, their implementation and what key mech-
anisms can lead to best performances.5http://www.nvidia.com/object/cuda home.html6http://www.khronos.org/opencl/7http://graphics.stanford.edu/projects/brookgpu8For complete information, search the lapack working notes in http://www.netlib.org
5
• to investigate the performance of GPU based computing in a class of CFD problems: solving
the advection-diffusion transport equation using finite difference methods.
1.4 Methodologies
To achieve the stated objectives, the following was done:
• Port a state of the art benchmark to the CUDA environment to understand the programming
paradigm and compare it with the CPU environment. The results are also compared with
other works. This is necessary as the technology is completely new and because of that
major errors can be done without notice.
• Implement tests that quantify the different memory access methods relative performance.
Unlike in the CPU paradigm, where two main performance aspects have to be considered
(with RAM and with cache), the GPUs present a vast number of options are available in
the hardware.
• Implement equivalent programs that solve the advection-diffusion uni-dimensional equation
in the both the CPU and GPU environments using compact schemes to compute the spatial
derivatives and the Runge-Kutta method to do the time integration.
• Two direct dense solvers are compared for each environment. A LU based solver and inverse
matrix based solver.
1.5 Outline
The present thesis contains 5 chapters which are organized in the following way:
In Chapter 1 the problem that originates this work as well as the objectives are presented
In Chapter 2, the concepts of parallel computing and their implementation on GPUs discussed.
The GPU programming paradigm is also presented
In Chapter 3, a GPU is deeply tested to show that the new technology environment is under-
stood as the results are compared with other works. Also some less documented aspects of the
GPU are clarified.
In Chapter 4, the uni-directional advection-diffusion equation is solved using GPU technologies
and the results compared with an equivalent sequential approach.
Chapter 5 concludes this work with the main conclusions drawn and some suggestions for
future work are made.
6
Chapter 2
CUDA Programming Overview
CUDA itself is just a programming model that follows the underlying GPU hardware pattern.
It’s important to properly understand how the GPUs work (even at the lower levels) in order
to take advantages of them. The purpose of the current chapter is to show the basic concepts
of parallel computing, the CUDA underlying hardware model and how this translates into the
CUDA programming paradigm.
2.1 Parallel Computing Overview
To better grasp the concepts related to GPGPU programming, a short review of parallel
computing concepts is first presented. Parallel computing can be understood as “ a collection of
processing elements that communicate and cooperate to solve large problems fast”[3]. This is of
course incomplete, parallel strategies can be pursued just to make the problem resolution possible,
e.g. problems with memory requirements only found in aggregated machines. So, depending on
the goals, distinct models for parallel computing can be found. The present work concerns itself
with just one field of the parallel computing world: High Performance Computing.
2.1.1 Parallel Systems
The processing element is usually associated to a digital computer. This computer can be
generally modelled by the von Neumann architecture, which is composed of a central processing
unit (CPU - is able to fetch instructions and data and process them); a memory system (that
holds the instructions and data); and an Input/Output system (I/O system - used to communicate
with the outside world); it’s a sequential model ; and it has one path connecting the memory
system to the CPU[24] (see figure 2.1). In this architecture the run of a program is resumed by a
7
continue iteration of the following cycle (execution cycle): 1) the CPU fetches an instruction from
the memory; 2) it decodes the instruction; 3) it fetches the data needed to process the instruction;
4) executes the instruction.
Memory
CPU I/O
Figure 2.1: The von Neumann model
A parallel computer can be thought as combinations of several of these units1 that can be used
together to fulfill a computational goal. The method to interconnect, the number of units and other
details its matter of the particular system architecture. This poses an inherent extra difficulty: not
only we Humans do not think in parallel, as the parallel model (unlike the sequential) is system
dependent. However, patterns do exists and some of them will be explained during the present
chapter.
With respect to the system’s execution cycle configuration, a method of classifying computer
systems by its parallelism is the Flynn taxonomy[14]. It states that computer systems can be
divided into four categories:
• Single Instruction, Single Data (SISD). This is the common sequential computer. There is
only one instruction being executed at a time operating, as well as one data stream.
• Single Instruction, Multiple Data (SIMD). In this architecture, there is one single instruction
running at a time, but there is a degree of parallelism of data streams.
• Multiple Instruction, Single Data (MISD). Multiple instructions running at the same time,
operating over the same data stream.
• Multiple Instruction, Multiple Data (MIMD). There are multiple instructions operating on
multiple data streams.
In the real world the parallel machines can in fact be combinations of these four models. For1In fact parallel computing isn’t just based on the von Neumann architecture as there other other models such
as data flow computing systolic arrays or neural networks[24, sec. 9.5]. But this architectures are out of the scope
of this document.
8
example, a single processor personal computer, which is considered to be a SISD system, can also
be considered a SIMD when using special instructions.
Regarding the parallel computer memory system, two main patterns exist: shared memory
systems (figure 2.2a), where the memory of the system is directly accessible by all the processors;
and distributed memory systems (figure 2.2b), where each processor has its own private memory
which isn’t directly accessible by any other processor. This design issue has a major implication in
terms of cooperation between the processors which is if the memory can be used to communicate
(by using predefined shared locations to exchange information) or if a message passing method
has to be implemented on top of the I/O system to exchange information.
CPU
Memory
CPU CPU...
(a) Shared memory system
CPU
mem CPU
mem
CPU
mem CPU
mem
...
...
inter−connect
(b) Distributed memory system
Figure 2.2: Parallel computing memory patterns
As said, the processors need to communicate and cooperate. Because no communication can
be performed instantaneously, there are always latencies associated with the communication2.
Even with shared-memory systems, where the communication isn’t done using the I/O subsystem
(which usually is slower than the CPU and memory), the difference between a simultaneous access
to memory and a non-simultaneous can mean the serialization of process (as memory components
aren’t elastic) and thus the lost of performance.
Lastly, if the ultimate goal of parallel computing is to solve large problems fast, the fundamental
metric used in comparisons between sequential and parallel systems is the speedup that’s defined
by equation 2.1, where Ts is the time of the sequential computation and Tp is the time of the
parallel computation.
S =Ts
Tp(2.1)
2These latencies, also exist in sequential systems but as sequential systems are assumed to have only one processor
they don’t play the central role that they do on parallel machines because depending on the number of processors
and how they communicate, the performance of an algorithm can be significantly different
9
2.1.2 Parallel Programming
Parallel programming is a general term to denote programming in parallel systems. The fact
that parallelism is system dependent, implies that parallel programming is also system dependent.
Common concepts and methodologies in parallel programming are now presented.
Four steps can be defined[9] in the process of coding a parallel program:
0. writing the sequential program;
1. decomposition of the program into computational tasks;
2. assignment of computational tasks to specific threads3;
3. orchestration of the threads, which is set of operations needed by the threads to correctly
cooperate;
4. mapping of the threads to specific processors. This step is usually implemented by the
underlying platform (i.e, the programmer does not have to think about it).
Decomposition defines the degree of concurrency4. The number of tasks should be a bal-
ance between maximizing processors usage and minimizing management resource usage, so that
the resources associated with the computation are actually bigger than the ones associated with
managing the concurrent environment. In assignment, the most important aspect to consider
is the load balancing, i.e, the correct distribution of computing, resources and communications
between the threads. Orchestration is the implementation of the cooperation, i.e, to define the
communication and synchronization methods. Several concepts are important to present:
• Race condition. Whenever a resource is shared in a concurrent environment a race condition
occurs. The problem risen is about coherency: if two threads read a shared memory location
and if both try to update it at the same time, the result is unpredictable. Figure 2.3a
illustrates the case where two processors access a shared variable which has an initial value
(a). At the same time they change that value and at end an update is done. However,
depending on the order of update (which isn’t known) the result will differ, so the final
result is unpredictable.
• Synchronization. Incoherent states may be created during the parallel thread execution.
Synchronization is the operation of ensuring that the incoherency is eliminated by commu-
nicating to a main resource holder.3A thread is the minimal unit of execution. However, depending on the context, this concept may be named
process or thread. There is a subtle difference: usually threads are associated with shared memory contexts and
processes with distributed memory contexts. In the present document the word thread is indefinitely used.4Concurrency is associated with parallelism and the access to common resources.
10
• Atomic operation. It’s the tool to deal with the race condition. An atomic operation is
an operation that cannot be interrupted; the operated resource is set unavailable until the
operation is finished, so no other thread can access the resource and create an incoherent
state. For example, in the figure 2.3b, the access to the shared resource is denied to the
CPU1 until the completes the sum operation.
• Starvation. The starvation condition happens if a resource is perpetually hold unavailable
and a thread needs access to it. The execution of this thread is hold forever.
These concepts are essential in shared memory systems since the memory is directly available to
several processors. However, in distributed memory systems they also hold. For example: if a
thread needs information about a resource on other thread, it may starve waiting for that resource
to come.
a
a
a
a+b
a+c
?
CPU1
CPU2
t
sharedresource
(a) Race condition
a
a
a+b a+b+c
a+b+c
CPU1
CPU2
sharedresource
X
t
a+b
a+b
(b) Atomic Operation
Figure 2.3
Laws of parallel speedup
Two theoretical laws describe parallel speedup: Amdahl’s Law [4] and Gustafson’s Law [18].
Amdahl’s law reflects how much speedup can be achieved for the same problem size, increasing the
number of available parallel processors. Algorithms always have a sequential, non-parallel fraction
where a fixed amount of time ts is spent, and another fraction that can in fact be parallelized (tp
will be the time spent in this fraction). Assuming perfect parallelization, a theoretical value for
speedup can then be found using Amdahl’s Law, expressed by equation 2.2. Its main implication
is that, even with an infinite number of processors (n → ∞), the maximum possible speedup is
limited by the fraction of the program that was parallelized (rp).
S =Ts
Tp=
Ts
ts + tp
n
=1
rs + rp
n
=1
(1− rp) + rp
n
(2.2)
On the other hand, Gustafson’s law reflects a concern for scalability, e.g. what is the expected
behavior of an algorithm when applied to increasing problem size and machine computing power
11
(more processors available).
Gustafson’s Law states that the speedup achieved using a parallel system (of n processors) to
compute a parallelized algorithm, when compared to the use of a sequential system to compute
for the same algorithm is given by equation 2.3.
S =Ts
Tp=ts + n · tpts + tp
= rs + n · rp = (1− rp) + n · rp (2.3)
From 2.3 it can be seen that speedup can be proportional to the number of processors, for
increasing size. The evolution of both laws is presented in figures 2.4a and 2.4b.
0
2
4
6
8
10
25 50 75 100 125 150 175 200
Speedup
tp=30% tp=60% tp=90%
(a) Amdahl’s law
0
40
80
120
160
200
25 50 75 100 125 150 175 200
tp=30% tp=60% tp=90%
(b) Gustafson’s law
Figure 2.4: Parallel Speedup Evolution
Parallel High Performance programming: state of the art
After several years, two main approaches have emerged as standards for parallel computing
in High Performance Computing : OpenMP, for shared memory machines, and MPI 5, targeted
mainly at clusters (PVM, one of the first efforts to achieve a standard model, has almost vanished
from the High Performance Computing world). A steady rise in hybrid, OpenMP-MPI codes, has
also been happening because of the increasing number of single node, multicore machines, being
incorporated into clusters.
OpenMP is a standard API for programming shared memory systems6 , using pragma direc-
tives7. There are bindings for fortran, C and C++ programming languages and is supported on
multiple hardware platforms and operating systems. The technology acts at two levels: at the5MPI is the acronym for Message Passing Interface6For a deep read on the technology the reading of [6] is recommended.7A pragma directive is a special preprocessor directive used to inform the compiler of a particular issue of a
specific portion of code.
12
compiler level and library level, the programmer uses special compiler directives to inform the
compiler of the parallel areas of the code, Decomposition is governed by environment variables
or code directives, leaving the burden of thread management coding to the compiler. Using the
library level routines (as well as other compiler directives) the programmer does the assignment
and orchestration of the program. Finally, the operating system does the mapping of the processes
to the hardware processors.
MPI is a standard specification for communication between computers. It is commonly used in
clusters (distributed memory systems) and operates over various networking protocols (the most
common is TCP over Ethernet). It enforces the user to explicitly provide for data transfers and
synchronization between processes.
2.2 GPU Hardware Model
The GPUs are expansion cards to use in a computer, i.e., they aren’t autonomous computers
system (the general configuration on a computer with a GPU is shown on figure 2.5). GPUs
usually are the computing core of a graphics card but there GPUs without the graphics output
module. In the present document (and, generally, in GPU lexical) a computer with a GPU is
called host and the GPU is called device. The model described in this document is based on the
NVIDIA GPUs but most of the aspects are similar.
deviceGPUm
emor
y
hostCPU
bridge
Figure 2.5: Host Computer
Each GPU is an aggregate of multi core processors (multi-processors) sharing a global memory.
Multi-processors don’t have any I/O system to communicate between them and, as a consequence,
cannot cooperate by any message passing system. So, the GPU (as a parallel system) is essentially a
shared memory system. Apart from the shared memory, the only available path of communication
is between the host and the device and is limited since the host is the one controlling it (if available,
the GPU can also output to the display sub-system, and thus to a monitor).
13
Figure 2.6: A GPU
Each multi-processor is composed of: a number of scalar cores which perform the computations
(these scalar cores are specialized in arithmetic instructions); a instruction unit responsible to
delivering instructions to the scalar cores; and on-chip shared memory that can be used in scalar
core communication (this memory isn’t accessible by the other multi-processors in the GPU). Each
multi-processor unit is also a shared-memory system. See A.1 for values properties of particular
devices.
The memory system of the current NVIDIA GPUs is complex. There are two main groups
of memory: on-chip (the memory is located inside each multi-processor) and off-chip or global
memory (the memory is located in the GPU and accessible by all multi-processors). Global memory
is organized into 4 types : linear memory, texture memory, constant memory and local memory.
The main implication of using each type is how multi-processors access the memory: any access
to linear memory means to use the shared bus; texture and constant memories are cached, so
the shared bus isn’t used in every single memory access. These caches are read only so multi-
processors cannot write on it. Because the bus to global memory is shared and serialization of
accesses occur, the GPU has the ability to coalesce8 some access patterns. On-chip memory has
two additional types: the shared memory, which is directly accessible by any scalar core inside8An access is said to be coalesced if with only one transaction, several requests are fulfilled.
14
each multi-processor; and the local registers that are private to each scalar core. If any scalar core
needs more memory than the available in registers, they can also use the global memory while
maintaining the local scope (this is the local memory). In order to reduce serialization within the
multi-processor, the shared memory is divided into banks that can be simultaneously accessed
without loss of performance (it’s up to the programmer to ensure correct use of this possibility).
In terms of execution in a GPU environment, the minimal computing task is the thread.
These threads are created, managed (scheduled) and destroyed by the GPU, i.e., the threads
live in hardware space. This is one of the major differences from other common parallel
environments: for example, in a multi-tasking9 operating system all the processing units are
scheduled in software. Its up to the operating system (not to the hardware) to decide which process
(or thread) runs at what time and on which particular processor. This is costly, in terms of memory
and processing cycles. This feature is the responsible for a virtually null cost when creating and
scheduling threads and for raising the bar of the number of threads up to the thousands; However,
GPU aren’t oriented towards general computing but only for data processing. In GPUs the threads
are grouped into sets of up to 32 threads called warps. The warp is the scheduling unit.
In the Flynn taxonomy, GPUs are best fit in the SIMD category since the instructions feed to
the scalar cores are the same but each thread can access different data. However, mainly because
the code running in each thread may automatically diverge (i.e., the programmer doesn’t have
manually take care of “if” clauses since branching supported by the hardware), NVIDIA defined
a new category (Single Instruction, Multiple Thread - SIMT)[25, sec 4.1].
2.3 CUDA Programming Model
“CUDA extends C by allowing the programmer to define C functions, called kernels, that when
called are executed N times in parallel by N different CUDA threads, as opposed to only once like
regular C functions”[25].
As said, the CUDA software model is an extension of C/C++ programming languages10 that
reflects the underlying GPU hardware. The main extensions are[25, sec 4.1]:
• function and variables qualifiers to specify whether the function or variable is referred to the
host or to the device;9A multi-tasking operating system is an operating system that can run simultaneously more than one process,
or thread. Examples of such systems are the Unix family - Linux, FreeBSD, MacOS - and the Windows OS. For
more information read[31].10Even that this new model is a superset of the C/C++ languages, there are some features in the C/C++
languages that aren’t possible to do on the device code, such as function recursion.
15
• a directive to configure and start a kernel.
A CUDA program is no more than a usual C/C++ program that make calls to CUDA kernels
(figure 2.8). A function to be run in the device must have the device or the global qualifiers
(line number 2 on figure 2.8). The former defines functions to be called by code run on the
device, the later defines kernels 11. By default functions with no specifier are considered to be host
functions.
In terms of variables, the environment defines the scope of the variables, i.e, in device functions
the variables belong to the device memory space and on host functions, the variables belong to the
host memory space. In other cases, qualifiers (similar to the function ones) are used. Neither the
device can directly access the host variables, neither can the host directly access the device’s ones.
The only direct interface existent is in the kernel call where the kernel parameters are automatically
copied to the device’s memory. The memory management (allocation, free and copies) is done by
the host using determined functions (in figure 2.8: lines 20 and 21 for allocation; 29 and 41 for
transfers; and 44 and 45 for free). The host holds the locations of the device’s data in its own
memory12 by using traditional pointers.
To launch a kernel, the CUDA API defines a new directive. This directive contains information
about the execution configuration (number and arrangement of threads). Regarding the execution
configuration, the threads are organized in a matrix like form called block; each block is attributed
to a multi-processor. The blocks are also organized in a matrix like form called grid (lines 32 to
36 in the example code). Within a multi-processor, each thread has built-in variables that can
be used to do the assignment of the tasks to the particular threads. Three examples are the
threadIdx, blockIdx and blockDim shown in the line 3 in the example. After launching the kernel
the mapping of each thread to the multi-processors and scalar cores is automatically done by the
hardware.
The execution of the device’s threads is asynchronous with respect to the host, i.e, the host
can execute another unrelated code while the device is processing the data. In the figure 2.7 is
shown the thread organization as well as the asynchronous run feature. The synchronization is
done using a function (see line 38), in which the host program waits until all threads in the device
have finished their work.11Additionally kernel functions must be void typed, i.e, cannot return any value.12If the programmer tries to access this locations using the CPU, the result is undefined and likely to have a
segmentation fault
16
Figure 2.7: Asynchronous execution of the device
17
1 // kerne l implementation
2 __global__ void vector_scale ( f loat ∗a , f loat ∗b , f loat k ) {
3 int n = threadIdx . x + blockDim . x ∗ blockIdx . x ;
4
5 a [ n ] = k∗b [ n ] ;
6 return ;
7 }
8
9
10 //main program
11 int main ( ) {
12
13 f loat ∗d_a , ∗ d_b ; // pointers to device ’ s memory space
14 f loat a [ 6 4∗6 4 ] , b [ 6 4 ∗ 6 4 ] ; // host memory
15 f loat k ;
16 int i
17 dim3 grid , block ;
18
19 // device memory a l l o ca t i on
20 i f ( cudaMalloc ( ( void∗∗) & da , 64∗64∗ s izeof ( f loat ) ) != 0) return 1 ;
21 i f ( cudaMalloc ( ( void∗∗) & db , 64∗64∗ s izeof ( f loat ) ) != 0) return 1 ;
22
23 // host data i n i t i a l i z a t i o n
24 for ( i=0; i<(64∗64) ; i++) {
25 b [ i ] = 1 .0 f ;
26 }
27
28 // data t rans f e r : device<−host
29 cudaMemcpy ( d_b , b , 64∗64∗ s izeof ( f loat ) , cudaMemcpyHostDevice ) ;
30
31 // execut ion environment conf igura t ion
32 grid . x = 64 ;
33 block . x = 64 ;
34
35 // kerne l c a l l
36 vector_scale <<<grid , block>>> ( d_a , d_b , k ) ;
37 // kerne l synchronizat ion
38 cudaThreadSynchronize ( ) ;
39
40 // data t rans f e r : host<−device
41 cudaMemcpy ( a , d_a , 64∗64∗ s izeof ( f loat ) , cudaMemcpyDeviceToHost ) ;
42 }
43 // device memory f ree
44 cudaFree ( d_a ) ;
45 cudaFree ( d_b ) ;
46 return 0 ;
Figure 2.8: CUDA program example
18
To use all the memory access methods existent in hardware, the following methods exist in the
CUDA API:
• Linear memory. The access is completely transparent as shown by the example code, where
the a,b and k reside in the linear global memory;
• Texture memory. The linear memory has to be binded to a texture13 .Special functions are
used with in kernels to access the memory.
• Constant memory. Its statically defined in the code with the constant qualifier. Specific
functions are used to copy data from the host to the device constant memory but access
within the device is transparent.
• Local memory. It’s automatically managed by the device.
• Shared memory. It’s statically defined in the code of the kernel using the qualifier shared .
Its access is transparent within a kernel.
• Local registers memory. It’s statically defined inside the kernel code. Its access is transparent
as shown in the example (variable n inside the kernel code).
Lastly, to do the orchestration there is a limited framework. Threads within a block may use
the syncthreads function to ensure that every thread got to a defined point. There are also
atomic functions that may be used by the device but they have a performance penalty. All the
memory in the GPU may be constantly in race condition so it’s left to the programmer to ensure
that the code is correctly implemented and that the outcome will be the correct, no matter of
the thread execution order. As said before, there is also a function that synchronizes the host
execution with the device.
13There is a special data layout, called CUDA arrays, that can lead to best performances when compared to the
linear memory binded to textures.
19
Chapter 3
CUDA Environment Tests
Some testes were done to understand which capabilities the available system have. Under-
standing the real bandwidths of the device as well as the configurations that lead to the best per-
formances is a crutial task in High Performance Computing . The tests are based in two NVIDIA
benchmarks and following [19], a port of the Stream 1 benchmark was implemented. There are
important issues to be made clear:
• Floating point operations and memory transactions are accounted following the Stream[2]
project (table 3.3).
• As in Stream, the first (slower) iteration is ignored.
• Unlike in Stream, the average time (instead of the minimum time) is used, since the sustained
point of view (vs a peak performance point of view) is more interesting to the present
document.
3.1 Test System
A Debian2 GNU/Linux 5.0.1 system was used with a 2.6.26 Linux kernel3. The 2.2 version of
the CUDA libraries are used. All the codes were compiled using the NVCC compiler (version 2.2)
or the GCC compiler (version 4.3.2). The C library used is the 2.7 version of the GNU C Library
compiled by the Debian Project. The hardware details are listed in the table 3.1.1http://www.cs.virginia.edu/stream/2http://www.debian.org3the kernel is the distribution’s stock kernel (i.e., non-optimized)
20
CPU Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
Motherboard Intel R© Desktop Board D5400XS
chipset Intel R© 5400 Chipset
PCI express 1.1 4GB/s
RAM 4x4096MB 667MHz DDR2 5.3GB/s
Table 3.1: Host system hardware details
The GPU used is a NVIDIA Tesla C1060; Its characteristics are listed in the table 3.2.
Multiprocessors 30
Clock Rate 1.3GHz
Memory 4 GB
Memory Clock Rate 800MHz
Memory bus width 512 bit
Memory bandwidth 102 GB/s
Peak Performance 933 GFLOPS
Device Capability 1.3
Table 3.2: Device properties
3.2 Metrics
The metrics used to benchmark are the number of operations per second, the bandwidth and
the time per byte (inverse of bandwidth). Like in Stream, measures are made using the GNU
libc version of the Posix standard function gettimeofday. This function returns a structure in two
64bit integer fields: the number of seconds and microseconds since the Unix Epoch. The value
is then converted to a double precision floating point number representing the seconds. All the
measures made are done computing the difference between the instant before the kernel is launched
to instant after the return of a cudaThreadSynchronize() function call. The byte and operation
normalization is done using the values in the table 3.3.
21
name kernel bytes/iter FLOPS/iter
COPY a(i) = b(i) 2*sizeof(word) 0
SCALE a(i) = q ∗ b(i) 2*sizeof(word) 1
SUM a(i) = b(i) + c(i) 3*sizeof(word) 1
TRIAD a(i) = b(i) + q ∗ c(i) 3*sizeof(word) 2
Table 3.3: Memory and operation accounting
One other concept important to this evaluation is the balance (equation 3.1). This relation is
important to evaluate if algorithms are processor or memory bounded.
Ba =memory transactions
operations(3.1)
This concept can be applied to the algorithm itself and to the hardware. The relation between
both ratios (i.e, how much the algorithm balance fits the hardware balance) it’s extremely difficult
to obtain (if possible), since the exact time that a computation takes can only be calculated in
particularly simple situations. In the present document a simple model (hardware oriented) is
used: theoretically, the hardware can deliver 933Gflops and has 102GB/s of memory bandwidth
(in single precision words is 25Gwords/s), so the hardware balance is approximately 0.03. This
point is considered neutral, i.e., the point where time spent with memory transactions is equal to
time spent in processing data. Values higher than 0.03 are considered to mean a memory bounded
and lower processor bounded. The results are summarized in table 3.4.
processor neutral memory
bounded bounded
algorithm 0 1 ∞
GPU 0 0.03 ∞
Table 3.4: Summary
3.3 Peak Throughput
In order to know what is the peak throughput achievable with the GPU, a NVIDIA test is
used. It consists in an unrolled loop with a series of FMAD4 instructions (B.1). The nature of the
test gives a real measure of the raw processing potential of the GPUs as well as the configurations4A FMAD instruction is a hardware instruction that computes an multiplication and a sum, e.g., a · b + c.
22
that lead to the best performances in processor bound kernels (Ba = 0). The performance is
evaluated in function of the number of blocks and block configuration.
As in the NVIDIA original benchmark, a 10 iterations loop with a total of 2048 FMAD in-
structions is used. The configuration is done using the relation expressed in equation 3.2.
Nb =(Tmp
Tb
)·Nmp (3.2)
where:
• Nb is the number of blocks;
• Tmp is the number of threads per multiprocessor;
• Tb is the number of threads per block;
• Nmp is the number of multiprocessors.
Figure 3.1 shows the evolution of the time length of the kernel. There is a staircase like
evolution with 32 threads per block. The step width is 240 blocks, i.e, the number of scalar cores.
This clearly shows that whenever there are processors idle (thus available) the kernel has order
O(1) but when there are no more processors available, the process is serialized and its order is
automatically converted in O(n) which leads to the global linear trend of the staircase. With other
configurations, the step width is shorter because, for each new block of threads, there are 64, 128
and 512 new threads and thus less blocks are needed to reach the 240 hardware limit. In the last
case, 512 threads per block, has more than 240 new threads, so its implied that every new block
is serialized, so its evolution is linear. This serialization also explains the fact that the slope is
significantly higher with blocks that contain more threads.
It’s not clear why the evolution isn’t a perfect staircase, but it may be related with fact that
the warp (the scheduling unit) not being a multiple of the number of the processors and other
scheduling related issues. Information documenting the thread scheduling wasn’t found so it’s
hard understand the real origins of this behaviour.
23
0
5
10
15
20
0 500 1000 1500 2000
Tota
l ti
me o
f kern
el
(ms)
Number of blocks
32 threads/block64 threads/block
128 threads/block512 threads/block
Figure 3.1: FLOP test, total time of execution
In terms of FLOP performance, i.e., the number of floating point operations per second, figure
3.2, the maximum FLOP value achieved is 617GFLOPs. This value differs from the value stated
by the hardware maker, 933GFLOPs, because the highest value is only achievable in certain
special usage conditions (dual issue)[21]. This maximum value is not steadily attained from the
beginning, for block sizes of 32, 64 and 128. Looking at the data it’s found that a total value of
3840 threads leads approximately to 378GFLOPs (60% of the peak performance). In terms of
threads per scalar core, this is equivalent to 16 - which is a half warp (or, by another perspective, 4
full warps). So it seems that even that all the cores have work to do (i.e, more than 240 threads),
full performance is only achieved if the scheduler has 32 or more threads to take care of. A simple
model that describes the evolution seen in figure 3.2 could be given by the following equation:
FLOP =k1T
sT + k2tk
sT <<k2tk−−−−−−−→ k1
k2
Ttk
where sT is a scheduler constant time penalty, T is the number of threads, tk is the time the kernel
takes and k1 and k2 are constants. With large configurations (that take longer time), the scheduler
penalty starts to be negligible; The total time tends to be approximately proportional to the total
number of threads because of the serialization (which means order O(n), thus proportionality).
The observable oscillations are a direct consequence of the staircase like form previously described.
Even at steady state, there are block configurations that perform better than others: for the
present kernel, the configurations with more threads per block perform better.
24
150
300
450
600
0 500 1000 1500 2000
Gfl
ops
Number of blocks
32 threads/block64 threads/block
128 threads/block512 threads/block
Figure 3.2: FLOP performance
3.4 Bandwidth
3.4.1 Host - device transfers
A benchmark that copies blocks of data from the host to the device’s linear memory and
vice-versa using all the methods that the API provides was implemented (i.e.: standard malloced
memory, page-lock and write-combine alloced memory and mapped memory). In order to compare
the performance of mapped memory, an initialization loop (write in the main memory) and a final
read loop is considered. The total time (normalized by the total transfer size, eq. 3.3) is evaluated
as a function of the number of elements transfered as well as each parcel.
T =twrite + thost→gpu + thost←gpu + tread
bytes transfered(3.3)
In present test, host memory is allocated using the following functions: the system malloc
function and the CUDA cudaHostMalloc function with its flag parameter equal to 0 (page-locked
memory), to cudaHostAllocWriteCombined and to cudaHostAllocMapped. The transfer is done
using cudaMemcpy (except for mapped memory). The terms thost→gpu and thost←gpu are missing
in the mapped memory case because the data transfer is automatic, i.e., there is no explicit memcpy
command. Yet a thread synchronization call is done after both write and read. The initialization
write and final read loops are traditional for loops, with no optimization done.
In the figure 3.3, the total time per byte is represented (T in equation 3.3) as a function of the
number of elements transfered. For the usual malloc reserved and the page-lock memory there is a
25
high penalty in performance when transferring small quantities of data. For large transfers (more
than 106 elements) a global stable value was achieved. The Write-combined mode is missing from
the figure because the comparison considers the read operation from the CPU and this operation
is highly expensive in this mode. Mapped memory had the best performance.
0
4
8
12
102 103 104 105 106 107 108
Tim
e/B
yte
(ns/
Byte
)
Number of floating point elements
malloc page-locked mapped
Figure 3.3: Time of the total transfer cycle
Analyzing partial times in figure 3.4, its understood that the PCI-express transfer is the re-
sponsible for the big performance loss for small transfers. The time spent on uploading data to
the device differs from the downloading time significantly in the time of the malloced memory
(downloading from the device is slower). The values for the read and write operations in figure
3.4 do not represent the system RAM bandwidth5. Instead they represent just half of it because
in each loop iteration there is a read and write operation. Also, since a naive approach is used in
the loop (i.e., there are no explicit data transfer optimizations), just half of the bus width, due to
single precision usage is being used. This explains the difference to the system RAM theoretical
bandwidth (table 3.1).
5the unit in the figure is time, but the bandwidth can be obtain by computing B = 1/T . In the this case
B ≈ 1111MB/s
26
0
1
2
3
4
5
6
7
8
writehost > device
device > host
read
Tim
e/B
yte
(ns/
Byte
)
mallocpage-lockwrite-combinemapped
(a) 103 elements
0
1
2
3
4
5
6
7
8
writehost > device
device > host
read
(b) 5 · 103 elements
Figure 3.4: Details of the data transfers for two different sizes
3.4.2 Device - device transfers
The main intra-device transfers performance is evaluated, namely: accesses from the global
memory, texture memory and constant memory. In the first case, the benchmark is similar to
the host-device (but only tdevice→device is considered). For the other cases, different vector copy
operation versions are implemented using read operations from each one of the available memory
spaces and a write to the global memory, i.e.:
1. read from global memory and write to global memory;
2. read from texture memory and write to global memory;
3. read from constant memory and write to global memory.
Because the texture and constant memories are cached, a vector form of the sum reduction oper-
ation (eq. 3.4) is also implemented to take advantage of it; and because of the global memory is
not cached, a software based cache6 is implemented with shared memory.
a(i) =M∑
j=1
b(j) i = 1 · · ·N (3.4)
The purpose is to cache the b vector: M consecutive reads are issued from the cache, making just
one write operation to the global memory.
In this implementation the data is distributed in the following way: For a vector of size Nv
and for Nt threads, each thread does Nv/Nt operations - or Nv/Nt + 1 if Nv is not a multiple of
Nt . The figure 3.5 is an illustrative example that shows the workload distribution for Nv = 8
and Nt = 3. In this example, threads 0 and 1 process 3 data elements (8/3 + 1) and thread 26in fact, no true cache mechanism[24] was implemented. The code takes advantage of knowing previously what
memories will be used with higher frequency.
27
only 2 data elements. The numbers in the corners represent the loop sequence in each thread;
This approach was chosen because it respects the coalescing considerations made in the CUDA
manual[25, sec. 5.1.2.1], that reduce memory transactions 7. The grid configuration is once again
calculated using equation 3.2 and the number of blocks is limited to 4096;
Figure 3.5: Workload distribution for a vector of size 7 and 3 threads
The figure 3.6a presents the evolution of global memory performance for 2 byte and 3 byte
data type sizes (normalized by the theoretical maximum value) relative to the simple vector copy
operation . Because of the limitations in the constant memory sizes and texture addressing with
CUDA arrays (see section 2.2), a test with small vector sizes is implemented; the results are shown
in figure 3.6b detailing all memory access methods.
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
103 104 105 106 107 108
Bandw
idth
(%
of
the b
us)
number of vector elements
float3 float
(a) General perspective
0 %
2 %
4 %
6 %
8 %
0 5 k 10 k 15 k 20 k
number of vector elements
global texture constant
(b) Small sizes
Figure 3.6: Intra device memory transfers
The most important observation, in the vector copy test, is that the device memory performance
was completely destroyed with small vector dimensions (figures 3.6b and 3.6a). The theoretical7coalescing memory transactions is not only dependent on the access pattern but also on device capability
28
bandwidth is 102GB/s but, for small sizes (N < 16k), only 7GB/s or less are achieved. In the
literature researched no mention of this issue was found. The origin of this disruption of perfor-
mance seems responsible for the fact that both curves in figure 3.6a intercept: it was supposed
that the float3 performance was always worse than in the float case because of non-coalesced
memory transactions. In fact, the time taken by the float3 test is always greater than the float
test - it’s only faster because there is more information (3 times more) per transaction . This
behaviour seems analog to the FLOP test where there was a minimum number of threads to
achieve full performance. In this case it seems that there is some kind of barrier in the number of
transactions per thread, but the results do not show a clear number. Also, coalescing shouldn’t
be affected by the number of transactions but only with the access pattern. Nevertheless, the test
scales in performance and, for sufficiently large vector sizes (N > 220), significant performances
were achieved (B > 60GB/s). A maximum performance of 84GB/s (83%) with 64 bit data types
(double precision floating point, dual single precision, float2, or long integer) and a vector size
of 227 elements. The effect of the non coalesced memory transactions is the loss of performance,
clearly shown in 3.6a for large vectors as the gap between the two lines. In [35] is reported that
greater performances were achieved (89%) but the devices that were used aren’t the same.
In terms of relative performance between each access type, the direct access to the global
memory and the access through texture cache perform identically, but access to the constant
memory is slower.
In the cached access test the same vector dimensions were used. The implementation of the
algorithm was done using M = N in equation 3.4. The bandwidth shown is calculated using
B = 4N(N+1)∆T . The results are significantly different from the previous test, in even with small
sets and all access types the performance exceeded the memory’s bandwidth. Comparing the
non-cached result (black continuous line) with the previous test, only coalescing to each B(j)
(coordinated broadcast pattern) may explain the boost in performance (up to 126%). For the
cached accesses, the implemented software cache with the share memory presents by far the best
results with large sizes attaining a peak of 287GB/s (280%). Access through texture memory also
outperformed the global memory limit by a 178% factor.
With small sizes, the constant memory accesses can be also compared and, for this access
pattern, it revealed to be as fast as the shared memory, in opposition to the previous test that
showed worse performance for constant memory.
29
0 %
75 %
150 %
225 %
300 %
103 104 105 106 107
Bandw
idth
(%
of
the b
us)
number of vector elements
global shared texture
(a) General perspective
0 %
75 %
150 %
225 %
300 %
0 4 k 8 k 12 k 16 k 20 k
number of vector elements
globalshared
textureconstant
(b) Small sizes
Figure 3.7: Intra device memory transfers w/ cache
3.5 Stream benchmark
The Stream benchmark consists in 4 vector operations: vector copy, product by scalar, vector
sum and vector sum plus scalar product (operations given in table 3.3). This set of simple opera-
tions allows us to evaluate the performance of bandwidth bound algorithms in the GPU context
(balance > 0.03). The performance is evaluated as a function of the number of blocks and the
block configuration. The previous results showed completely different performances for small and
large size problems, so two different vector sizes are evaluated.
Since the main results of the Stream copy were already present in section 3.4.2 and because of
the general results are similar, only scale and triad tests are now presented. In figures 3.8a and
3.8b the evolution of the bandwidth for a vector size of 215 and 227 elements is represented. The
phenomenon of bad performances for small sizes is maintained: the peak performance depends on
kernel configuration and varies between 13.7GB/s (for block size 32, 64, and 512) and 14.6GB/s
(for 128 and 256). One thing that representing data as a function of grid size doesn’t show,
is that the peak has one common parameter constant for all configurations: 256 threads per
multi-processor. With 256 threads per multi-processor, there are exactly 32 threads per scalar
core, which is the size of the warp; so it seems a match of the hardware parameters with the
problem. This result is identical to the FLOP test, but now with memory transactions in play.
The drop in performance after the peak is also explained by the excessive number of threads:
for the 256 and 512 block configurations (even with 256 threads per multi-processor) there are
more threads than elements to compute, so it is guaranteed that there are threads doing nothing.
The logic of having 1 thread per element doesn’t hold: for example for 32 threads per block the
best performance is achieved with 4 elements per thread (this is expected because of the device
30
balance: it would be necessary 33 floating operations for each memory transaction - theoretically
- to achieve neutrality).
For large vector dimensions (in the scale test, figure 3.8b a peak of 81GB/s as achieved with
64 threads per block. The results are essentially flat curves. At the beginning of the curves
there are to few threads: performances above 70GB/s have always 7680 (i.e., 256 threads per
multi-processor) or more threads, which means 17500 or less elements per thread; at the end, the
excessive thread condition is again responsible for performance disruption. There is an important
implication of the flat form in assignment of tasks to threads: above a certain number of threads
there is no performance gain by launching more threads.
In the triad test for small sizes better peak performances are achieved (19.6− 20.8GB/s) but
the 256 threads per multi-processor is again the configuration that best performs.
In the last test (triad with big vector sizes) there is odd thing with no explanation: the worse
configuration for the scale test (block size of 32 threads) is now the best.
0
2 k
5 k
8 k
10 k
12 k
15 k
101 102 103 104 105
Bandw
ith (
MB
/s)
number of blocks
32 64 128 256 512
(a) Scale test. N = 215
0
25 k
50 k
75 k
100 k
101 102 103 104 105
Bandw
ith (
MB
/s)
number of blocks
32 64 128 256 512
(b) Scale test. N = 227
0
2 k
5 k
8 k
10 k
12 k
15 k
18 k
20 k
23 k
101 102 103 104 105
Ban
dw
ith
(M
B/s
)
number of blocks
32 64 128 256 512
(c) Triad test. n = 215
0
25 k
50 k
75 k
100 k
101 102 103 104 105
Ban
dw
ith
(M
B/s
)
number of blocks
32 64 128 256 512
(d) Triad test. N = 227
Figure 3.8: Stream benchmark
31
Finally, between the host CPU performance and the device performance (we are not considering
host-device transfers). The results are presented in the table 3.5.
operation cpu gpu speedup cpu gpu speedup
(MB/s) (MB/s) (MB/s) (MB/s)
N = 2E3 N = 2E6
copy 2314 2003 0.87 3245 74482 22.9
scale 2274 1973 0.87 3181 74814 23.5
add 3247 3004 0.93 3188 77136 24.2
triad 3247 3004 0.93 3310 76666 23.1
Table 3.5: Stream benchmark results
3.6 Summary
The two main objectives of the current chapter are to show if the new concepts and technologies
were successfully acquired and to clarify some aspects less documented. In the first section a raw
performance test was passed and the most important result is that, independently of the block
configuration, full performance isn’t achieved from using 240 threads (1 thread per scalar core)
but only from 8 full warps (16 times more threads than physical scalar cores) or 4800 threads.
Regarding the device ↔ host communication, it was concluded that one of two approaches are
recommend:ed either using mapped memory or making big transfers. By using mapped memory
it’s possible to completely hide the latency of the transfer but it should be carefully used to
avoid race conditions. By making big transfers, all the initial costs are diluted and maximum
performance is achieved. Within device transfers the most important result is that full performance
is only achieved by using a massive number of transfers. When memory transfers can benefit from
a cache, the choice between the available methods has to be oriented towards the actual problem.
Lastly the Stream benchmark results were presented and two remarks have to be made: first, the
small problems (in a GPU scale) are very sensitive to a correct number of threads because their
size is similar (and not massively bigger) than the GPU hardware parameters. Second, for large
problem sizes it’s irrelevant to launch more threads, because the process is fully serialized and full
performance is achieved. When the results are compared with the CPU, all can be summed in
one conclusion: if the problem is small it should be computed by the CPU, otherwise significant
speedups can be achieved with GPUs.
32
Chapter 4
Burgers equation solver
4.1 Mathematical Model
The model used to test the computational performance speedup of GPGPU programming is a
linearized version of the Burgers equation or the uni-directional transport equation.
∂u
∂t+ U0
∂u
∂x= ν
∂2u
∂x2(4.1)
where U0 and ν are real constants and u is a continuous field.
4.2 Computational Model
To solve this equation in order to u, the equation 4.1 is first transformed into an explicit form
(equation 4.2):
F (t, x) =∂u
∂t= ν
∂2u
∂x2− U0
∂u
∂x(4.2)
4.2.1 Computational Methods
Time Integration
As shown by equation 4.2, this is an initial value problem. To solve it, the classic 4th order
Runge-Kutta method[13] is used.
33
un+1 = un +∆t6
(u′1 + 2u′2 + 2u′3 + u′4) (4.3)
u′1 = f(tn, un) (4.4)
u′2 = f(tn + ∆t/2, un + ∆t/2 · u′1) (4.5)
u′3 = f(tn + ∆t/2, un + ∆t/2 · u′2, ) (4.6)
u′4 = f(tn + ∆t, un + ∆t · u′3, ) (4.7)
Spatial Differentiation
The spatial derivatives (first and second) are calculated using 4th order finite difference compact
schemes [29]. These methods are a particular case of central difference schemes and the derivatives
are calculated by solving a linear equation system, Ax = b where A is an N ×N element matrix
and x and b are N element vectors; For the particular case, the compact scheme methods are
better represented by equations 4.8 and 4.9.
A1ux = B1u (4.8)
A2uxx = B2u (4.9)
The A and B matrices of compact schemes methods are band matrices, in particular, the A
matrices of the 4th order compact schemes are pentadiagonal. However, the approach taken in the
present research is a dense algebra one.
In a matrix form, the problem is now formulated as:
u′ ≈ νA−12 B2u− U0A−1
1 B1u (4.10)
Computational Domain
The computational domain of the problem is defined by the following constraints:
• the domain of the spatial coordinate is normalized: x ∈ [0, 1];
• x is an uniform mesh of N points (hence ∆x = 1N−1 );
• the time step, ∆t, is constant and it’s given by the Courant number (eq. 4.11);
• The simulation has NT time steps.
34
C =U0∆t∆x
(4.11)
The problem’s constants:
• U0 is normalized (U0 = 1);
• the viscosity coefficient ν is calculated as a function of the grid Fourier number1 (eq. 4.12);
F = ν∆t
∆x2(4.12)
Either equation 4.11, or 4.12 are adimensional parameters that are derived from the finite
difference discretization applied to the transport equation. The conditions imposed are within the
limits to impose numerical stability.
Boundary Conditions
As stated, the compact scheme methods are a particular case of central differences, so spe-
cial considerations have to be taken at both boundaries of the spatial domain (u(t, x = 0) and
u (t, x = 1)). On the right side, a null Dirichlet condition is imposed, i.e, u(t, x = 1) = 0. On the
opposite side, the model order reduction presented in [29] was implemented and on this boundary
the problem is represented by a forward 3rd order scheme.
4.3 Implementation
The implementation consists on two versions of the code: a C based serial (single processor)
version and a CUDA based version. The code of each version is as identical as possible. The code
is structured in the following layers (bottom first):
1. algebra operations;
2. numerical integration and differentiation;
3. simulation;1Even if calculating the physical constant isn’t a natural practice in problems of fluid mechanics - where the
objective is to calculate the flow for a given fluid - the focus of the current work is computational; So we compute
the problem’s parameters in function of computational significant units (as the number of points and iterations)
and numerical stability.
35
For the algebra operations layer the ATLAS2 implementation of LAPACK and BLAS libraries
is used in the serial version. In the CUDA version there are calls to routines from the ATLAS
project and from the CuBLAS library.
For the numerical methods layer, a library was created. The design of this library follows an
object oriented philosophy. To save resources (memory and time), the current implementation
of both libraries is not thread safe (the thread safeness is referred to host threads): there are
non-reentrant functions (static variables are used). The simulation layer it a program that uses
both layers.
The memory is allocated during the initialization of the program, minimizing the number
of allocations, ensuring that the distinct allocator implementations (the host ’s malloc and the
device’s malloc) and their implications will interfere the minimum as possible.
4.3.1 Data structures
The data structures used reflect the first two layers. The main data structure for algebra
operations is the floating point single precision (32 bit) array. Two data structures were created
to store the configuration data (orders, matrices, pivot arrays) for the Runge-Kutta and for the
compact scheme methods.
4.3.2 Program and algorithms
The core of the simulation is a simple sequential loop in the time variable. In each loop
iteration, the velocity field u is updated by the Runge-Kutta integration. All data is logged into
memory and dumped to a file at the end. The pseudocode for the program is shown in figure 4.1.
In the CUDA version, after the problem initialization, all the necessary data is copied to the
GPU memory. Only in the end, the data is downloaded back into the main memory.
For the linear system solver, two direct methods were considered : an LU (with partial piv-
oting) solver and a matrix inversion solver. In both methods all constants are computed during
initialization and always using a serial CPU method: the LU solver computes the pivots, L and U
during initialization; The A−1 matrices are computed once (using the CPU) during initialization.
Using explicit scheme was considered (replacing the compact schemes) but, inside the main loop,
it would be equivalent to the inverse approach in terms of computations.2http://math-atlas.sourceforge.net/
36
begin
init compact()
init RK4()
init u0()
for n := 0 to NT{
t = n ∗ dt;
u = RK4(t, u0, F (u));
store(u);
u0 = u;
}
dump();
where
proc F(u) ≡
U0 ∗ derivative1(u) + ν ∗ derivative2(u).
end
Figure 4.1: Main program
Solving an LU factorized linear system on GPU
The only routine that, inside the main loop, wasn’t implemented by any CUDA based package
is the equivalent of the LAPACK sgetrs routine. This routine forms a pair with the sgetrf routine:
sgetrf computes the LU factorization with partial pivoting of a general M ×N matrix and sgetrs
solves a linear system AX = B, with that previously factorized A matrix.
The netlib version of the sgetrs function is used as a guide to port the function to the CUDA
architecture. The netlib version of the routine makes calls to BLAS library routines that are
available in the CuBLAS package, so they were used. It also calls a LAPACK internal routine,
slaswp, that was implemented.
The slaswp is a routine that applies a given permutation (in form of row interchanges) to a
matrix. It receives an integer array with the indices of the rows that need to be permuted and
the matrix to operate on. The algorithm, as implemented by LAPACK, its inherently sequential
because the order of row interchange matters: in the LAPACK standard, the indices in the pivot’s
array returned by the sgetrf routine may be repeated. This leads to differences if the interchange
is applied in different orders. An example is illustrated on figure 4.2: it’s a simple case where the
pivot vector is full of ones. Now, if the predetermined (sequential) order isn’t followed, the final
37
output will differ and thus will lead to an erroneous solution. If a naive decomposition is done, the
order of access isn’t known and, for example, the solution of the case 1 in the figure (where the
thread responsible for changing the second row would be the first to do it; then the one responsible
for the first row; and then for the third row) is different from the case 2. However, the columns of
the solution matrix are completely independent, so a decomposition may be done mapping each
task to a column. For square matrices the order gets reduced from O(N2) to O(N/p)3 . For
the vector case (i.e., when the size of the matrix is N × 1) there is no performance gain. Texture
memory is used to access the pivot vector. This approach was chosen mainly because of the degree
of flexibility that it presents.
Initial condition: Correct solution (with predetermined order 3→ 2→ 1)
A =
a b c
d e f
g h i
and P = [1, 1, 1] A =
g h i
a b c
d e f
1. sequence: 2→ 1→ 3;
a b c
d e f
g h i
(init,2)−−−−−→
d e f
a b c
g h i
(2,1)−−−→
d e f
a b c
g h i
(1,3)−−−→
g h i
a b c
d e f
2. sequence: 1→ 3→ 2;
a b c
d e f
g h i
(init,1)−−−−−→
a b c
d e f
g h i
(1,3)−−−→
g h i
d e f
a b c
(3,2)−−−→
d e f
g h i
a b c
Figure 4.2: Problem of the row interchange order
If, for example, the pivot vector was computed in a way that no repetitions exist, the order of
substitution wouldn’t be relevant, the task decomposition could be row oriented and the algorithm
further optimized. A drawback of this approach is the usage of more memory: there is an inherent
race condition: since the row interchange isn’t an atomic operation, the operation can lead to
incoherent states. To prevent this to happen, the values are updated on a distinct block of3p is the number of processors.
38
memory.
4.3.3 Computational Resources
Processing considerations
The Runge-Kutta method uses the following number of calculations:
• the field F (t, x) in equation 4.2 is computed 4 times;
• 6 scalar-vector products;
• 7 vector sum.
Each time that equation 4.2 is computed, the program needs to compute two derivatives (and
each derivative needs to calculate one vector-matrix product and the linear equation solution,
which depends on the solver, LU or inverse) and:
• 2 scalar-vector products;
• 1 vector sum.
The table 4.1 shows all the operations done per time iteration.
Table 4.1: Resume of computational operations
Operation LU inverse
scalar-vector product 10 10
vector-matrix product 1 3
vector sum 9 9
LU solve 2 0
All the previous operations but the LU solve method fit in the massive parallel processing
paradigm as shown in the previous chapter. The LU solver is detailed on section 4.3.2.
Memory considerations
The table 4.2 gives a model to the memory usage. This model doesn’t account for the temporary
memory usage within the algebra routines.
39
Table 4.2: memory usage in floating point elements
Data memory
A1,A2,B1,B2 4N2
temporary memory 5N
simulation log NT ×N
total N(4N + 5 +NT )
The current GPUs have at least 512MB of memory so there are no constraints: a simulation
with N = NT = 1000 is expected to use as little as 20MB.
4.4 Metrics
The essential metric in this work is the time ratio between the serial version and the CUDA
version (equation 4.13).
G =tserial
tcuda(4.13)
Since the initialization is always done on the CPU, another important measure in the CUDA
version is the ratio of times between the initialization part and the total time (equation 4.14).
R =tinit
ttotal(4.14)
The time measures of the main loop include the data transfer from the device in the end of
the program.
Since hardware architectures (i.e., the Intel CPU and the NVIDIA GPU) differ, their imple-
mentations of the IEEE-754 floating point standard may differ. More, for performance, GPUs
don’t implement all functions in a compliant manner[25, A2]. It’s important to check if the dif-
ferences between the two solutions of the same problem are negligible. To measure the distance
between both solutions, the mean square error is used. The mean square error is computed along
the line for a given time and computed its average and maximum, i.e, the average and maximum
of equation 4.15 (where Cin is the solution point for (t, x) ⇐ (n, i) computed with the CPU and
Gin computed with the GPU).
MSEn =1Nx
Nx∑i=1
(Cin −Gi
n)2 (4.15)
40
4.5 Simulations Results
All the simulations respected the following constrains:
• Courant number: C = 0.3;
• Fourier number: F = 0.1
• Number of time steps: 500.
Performance is evaluated as a function of the problem size.
4.5.1 Performance
In figure 4.3 the evolution of speedup of both methods is represented. On the left side axis
is the scale for the inverse method while in the right side axis it’s the scale for the LU method.
The results are significantly different for each method. The speedup obtained with the CUDA
version for the inverse method is always greater than unity: the lowest value is 6.5 (N = 500)
and the maximum is 15.2 (N = 2000). With the LU method only in one case a speedup of 1
was achieved; all other cases, the performance was poorer when compared with the serial version.
Another important measure is the comparison of the best method for each platform, i.e., inverse
in the GPU with the lu in the CPU (TClu/TGinv
), which is represented in the figure 4.3 with the
dashed blue line and the left side scale. The computation is now 3.5 to 10.8 faster.
0
4
8
12
16
0 1000 2000 3000 4000 5000 0.4
0.6
0.8
1
1.2
Spee
dup (
inver
se a
nd b
est
met
hod)
Spe
edup
(LU
met
hod)
Problem size
inverse solverbest case
lu solver
Figure 4.3: Absolute speedup
41
The evolution of time ratio between the initialization part and the total simulation (done with
the GPU) is represented in figure 4.4. In the left axis is represented the scale for the inverse
method and on the right side is the scale for the LU . As it can be seen, as the number of points
grows, the initial inversion becomes very significant. With N = 500 the ratio is about 30% of
the total; With N = 2000 (the speedup peak), the initialization takes 57% of the total and for
N = 5000 it’s 82%. With the LU , the evolution is completely different as the initialization fraction
is negligible (< 3%). In the same figure is presented the ratio between the initialization part of the
inverse method done on the CPU and done on GPU (TCinit−inv/TGinit−inv
). The scale is the one
on the left side axis. This relation represents essentially the weight of the PCI-express transfer:
the distance of the curve to 1 represents the difference between each version initialization runtime
(in the code, the only differences that exist are the memory allocation and transfer to the device).
As seen in the figure, for N = 500 it takes around 25% of the time; for N ≥ 2000 it represents no
more than 3%. This clearly explains the disruption in performance showed in figure 4.3: as the
fraction of the initialization becomes the predominant one, and because only a very small fraction
of the time is spent on bus transfers, the inversion becomes the main computational problem.
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 50000.75%
1%
1.25%
1.5%
1.75%
2%
2.25%
2.5%
2.75%
invers
e i
nit
iali
zati
on
lu in
itial
izat
ion
Problem size
inverse solvergpu/cpu inverse
lu solver
Figure 4.4: Initialization ratio
The initialization is now known to be a big burden in the problem. In the figure 4.5 is
represented what would be expected if the initialization becomes insignificant. This could be
achieved by two means: by using an explicit scheme for the derivatives. Or, increasing the time
iterations (thus diminishing the sequential part of the problem). The speedup of the inverse
method increases steadily up to a maximum of 43.3. The LU method maintains the behaviour as
expected.
42
0
10
20
30
40
50
0 1000 2000 3000 4000 5000 0.4
0.6
0.8
1
1.2
Spee
dup (
inver
se)
Spe
edup
(LU
met
hod)
Problem size
inverse solver lu solver
Figure 4.5: Loop speedup
Mixed Strategy Approach
From the previous results is known that, in the inverse method, performance is greatly affected
by the computation of the inverse itself. More, it’s also known that the LU method should perform
better on systems which the relation rows/columns of the solution matrix B is equal to 1 or less.
To compute the inverse of a matrix is to solve a particular linear system for which that relation is
exactly 1. And, in fact, the method used to compute the inverse matrix is the LU method: first the
matrix to be inverted is factorized into the L and U matrices and then, the system LUA−1 = I,
where I is the identity matrix. The A matrix is still computed on the CPU (using the LAPACK
routine) but the inverse computation is done in the GPU. So, when compared to the compact
scheme linear system, it should perform much better.
In the figure 4.7 the results from the new method are presented. The black continuous line
(referred to the left axis scale) refers to the new initialization fraction. This fraction now belongs
to the interval 14% to 60% (instead of 30% to 82% ). When comparing directly the initialization
speedups (blue-dashed line and right side scale), the best case is for N = 1500, where the gain is
nearly 4.1. Then, the gains continuously decreases (in the window of observation it goes to 3.1).
43
10%
20%
30%
40%
50%
60%
0 1000 2000 3000 4000 5000 3
3.25
3.5
3.75
4
4.25
gpu i
nver
se i
nit
iali
zati
on r
atio
initi
aliz
atio
n sp
eedu
p
Problem size
new inverse cpu /gpu
Figure 4.6: Initialization with the inverse computed on the GPU
Finally, in the figure 4.7 the results of the global speedup obtained are presented: on the left
side axis is the scale to the speedup obtained when compared with the case of computing the
inverse matrix on the CPU (black continuous line); there are two remarks to do (they are both
consequence of the initialization being a considerable fraction):
1. the performance gain is always greater than 1 which means that, in the end, for every
problem size a performance boost was achieved;
2. The boost is continuously increasing - even at a slow rate - which means that the initialization
burden got somehow mitigated.
The updated best method comparison is presented in the same figure with the blue-dashed
line and the right side axis scale: the results were boosted as the black continuous line had shown.
The speedups are now between 4.3 to 18.8;
44
1.2
1.4
1.6
1.8
2
2.2
2.4
0 1000 2000 3000 4000 5000 4
6
8
10
12
14
16
18
20
Spee
dup (
inver
se)
Spe
edup
(LU
met
hod)
Problem size
cpu/gpu inversecpu lu/gpu inverse
Figure 4.7: Speedup with the inverse computed on the GPU
4.5.2 Numeric errors
All the solutions given by the GPU with the CPU were compared. While differences do exist ,
they are negligible: all errors accounted for by using equation 4.15, were less than 0.1%. The only
exceptions (where the code became unstable) were in the inverse method on the CPU with sizes
N = 4500 and N = 5000.
However, even it couldn’t be found precisely on which situations - there is only guarantee that
the simulations results are not affected because of the verification done - the implemented LU
solver presents some instabilities for some A matrices.
4.6 Summary
The present chapter objective is to implement and present all the knowledge acquired during
this research in a test case. A brief presentation of the numeric methods behind the implemented
solution of the unidimensional transport equation were presented. Two direct methods for solv-
ing linear systems were implemented and compared with the sequential solution. An additional
method was used and the speedup was increased. The knowledge acquired from the previous chap-
ter was essential since it provided a practical framework of experience in terms of block number
and configuration and memory transfers (either host-device transfers or intra device transfers).
Because of its novelty, there is no known literature to compare results of direct dense linear solvers
45
using GPUs. However, the mixed approach 18.8 result is quite promising. It was also shown that
the old sequential best methods may not be as good in parallel approaches.
46
Chapter 5
Conclusion
5.1 Summary
The work developed in the present thesis investigates the potential of the GPUs as scientific
computing devices and, in particular, the usage of GPUs in the solution of the uni dimensional
convection-diffusion problem. The motivation is clear: currently the only solution to significantly
increase the performance in scientific problem solving - being the objective to solve more problems
in the same time or to solve larger problems, augmenting precision or the size of the problem - is
going parallel. GPUs are a low cost solution when compared with the other choices available in
the market.
In the present work, the concepts related to parallel computing (as well as their implementation
on the devices) were studied.
The platform was tested in all major aspects: processing, communication with the host com-
puter and in-device memory transfers. The results were compared with equivalent operations done
on the CPU. The inherent complexity of parallel systems results in many configurations and pos-
sible strategies. The combinations that achieve higher performances are related with the hardware
itself.
Finally, a particular problem to be solved using GPU based technologies was presented. Be-
cause the technology is new, there isn’t yet a software framework equivalent to the one existing
for serial computing. Two direct methods were studied in order to solve a linear system: using an
LU -based method and using the inverse matrix. The LU -based method reveals very poor perfor-
mances for linear systems for which the solution matrix has more columns than rows. The inverse
based approach has significant speedups. In order to improve the inverse method performance,
47
the implemented LU method was used to invert the matrix. This approach resulted in better
performances.
5.2 Conclusions
The main objective of this work was to investigate whether a class of problems in the compu-
tational fluid domain could benefit from the possibility open by the GPU based computing. This
objective was accomplished with success as speedups between 4 and 18 were obtained.
The study of the parallel computing paradigm and its influence on the device’s model emerge
from the fact that the use of GPUs for scientific problem solving is a novelty. Because of the
device’s design, best performances are generally obtained with massive size problems. Because of
this, even that parallel like programming is used, there is a global serialization effect; whenever a
parallel like performance is visible (i.e., whenever an order reduction in computation time is clear)
means that the code isn’t benefiting of the full potential of the GPU.
This size factor leads to the fact that communication between the host and the device (with the
hardware used) becomes continually less significant when compared with the traditional sequential
access pattern to the variables. This gap opens a possibility to completely hide the bus transfer
and that is easily and transparently obtained by using mapped memory (one of the CUDA’s
possibilities). However, using mapped memory (and in a general sense, asynchronous operations)
may pose additional problems, as race conditions may occur (simultaneous accesses to the same
memory region by the host and device).
In respect to the device’s memory system, majors differences from the host memory were
observed: in the current (multi-core) computers, the system’s memory bus is shared by 4 cores
while in the GPU, the equivalent bus is shared among at least 8 scalar cores (as concrete example,
in the GPU used in this work there are 240 scalar cores). This implies that the knowledge of
how to explore the device’s memory system have a significant impact on the results obtained. It
was verified that the use of memory access patterns that the device is able to coalesce (delivering
memory requests to many threads with a single memory transaction) is crucial for achieving
maximum performance. Yet, the number of requests is also important and, because of this, the
access pattern seems less important for a small number of requests. When repeated access to a
vector variable is needed, there are several approaches to obtain a cached (thus faster) access to it:
using the hardware based texture and constant caches (read access only) or implementing a cache
mechanism with shared memory. Each one of them have their benefits and limitations. Regarding
constant memory: it has to be statically defined and its size is relatively small (64KB), so it’s
impossible to use it on large problems. The cached access for the studied cases performed as good
48
as shared memory. Regarding texture memory: the fact that it is possible to use linear memory
as a texture, make it possible to use in most cases. Access through texture cache performed worse
than the constant or shared memory but was faster than the direct use of global memory. Lastly,
the shared memory was the fastest memory but poses one big problem: implementing the cache
mechanism in software is not a trivial task. Because of bank conflicts the access to shared memory
can become serialized and thus slower. When compared with the typical cache memories1 (that
have hardware hard-coded strategies to define which memory is on and off the cache), shared
memory presents the advantage of, even at a hard cost, being possible to implement intelligent
cache strategies oriented towards the algorithm itself.
Finally to achieve the main goal of the present work, a uni-dimensional convection-diffusion
transport equation solver was implemented. In this way, a particular finite difference scheme (the
compact scheme) was used for the spatial derivatives and a explicit iterative method (Runge-Kutta
4) was used to solve the time derivative. The partial differential equation is transformed into a
linear algebra specific problem: solving a linear system. Two direct methods for solving the linear
systems were compared: an LU method and the inverse method. Both methods allow to compute
initial constants on the host , transfer them and work only in the device domain with it during
the main loop. The LU method (which is widely used and known to be fast method for the
CPU) as standardized in LAPACK, is an inherently sequential in respect to the rows but can be
parallelized with respect to the columns. On the other side, the inverse method is slower on CPU
implementations. The implemented LU method performs poorly on the device for this problem
as its solution it’s just one column. The inverse method outperforms the LU which clearly shows
how an algorithm better suited for sequential computation is uncorrelated with its performance
on parallel computing.
The fact that the matrix inversion on the CPU is an expensive computation implies that the
method’s scalability is doomed for problems with sizes larger than 2000. This fact also completely
hides the eventual problem of the latency added by data transfer to the device. The way that
the inverse is computed on the CPU (it uses the same factorization used in the LU case) is the
slowest part when computing the inverse matrix itself. This fact, and having a linear system solver
implemented, drove to use the solver to invert the matrix, i.e, the factorization is still done on
the CPU but now the process of obtaining the inverse matrix (which is a particular linear system
solving problem with the number of columns equal to the number of rows) is done on the GPU.
This strategy resulted on a performance increase factor of approximately 2 when compared with
the previous strategy of computing the matrix on the CPU.1for example the CPU’s caches, the constant and texture caches on each multi-processor
49
5.3 Future Work
The use of GPUs in the scientific computing is a completely new world. When compared with
the CPU approaches there is much to be done in a vast meaning. The knowledge acquired by this
thesis opens a framework of optimizations to be applied on the code developed. But this is just a
small fraction of what could be done. In a future perspective the following ideas are suggested:
• improve the knowledge of the device’s scheduler as the complete understanding of it will lead
to best performances;
• test with more devices. The empirical knowledge learn in this work should be confirmed
with other devices (of different sizes and capabilities);
• algorithms which constantly need to transfer memory for and from the device weren’t studied.
In the literature there are several mentions to benefits using mixed approaches, being the
benefit in form of higher performances or in improved precisions;
• when solving the Burgers eq., a final massive data download is done. Even if the main
performance obstacle is the initialization fraction, higher performances could be achieved if
asynchronous (but smaller) data transfers are made within the loop itself, hiding this cost
completely;
• a general dense direct approach to solve the linear system was selected. In this way, other
methods should be studied, namely iterative and band methods;
• the bottleneck in the LU system should be clearly detailed. There is also work on factorizations[36]
so the performance can be seriously improved;
• the numeric instabilities of the LU solver should be deeply studied and understood as the
implications can be of great importance if this fact is correlated with a blind mix of results
obtained in the CPU with the ones on the GPU;
• the nature of the LU solver is clearly adapted to the 2.5D problem, where there is a method
that governs one dimension and another method that is responsible for the other two di-
mensions (transversal section). This problem leads to a right side of AX = B, which the
number of columns in B is proportional to the area of the transversal section;
• in the current solution, improved performance could have by computing the first and second
derivatives at the same time and eventually, the results made it pertinent, using the CPU
and the GPU at the same time to use the system as a whole;
50
• the use of multiple GPUs wasn’t explored. This strategy poses the problem of the PCI-
express bus sharing that should be dealt with;
• clustering the GPUs to solve even larger problems;
• due to time constrains, a similar solution using typical cluster technologies wasn’t imple-
mented. Would have been important to compare both parallel solutions.
51
Bibliography
[1] Ram meenakshisundaram’s transputer home page. http://www.classiccmp.org/
transputer/atw800.htm.
[2] Stream benchmark data counting. http://www.cs.virginia.edu/stream/ref.html\
#counting.
[3] G. S. Almasi and A. Gottlieb. Highly parallel computing. Benjamin-Cummings Publishing
Co., Inc., Redwood City, CA, USA, 1989.
[4] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing
capabilities. pages 79–81, 2000.
[5] Sergio Barrachina, Maribel Castillo, Francisco D. Igual, and Gregorio Quintana-Ortı.
Rafael Mayo, Enrique S. Quintana-Ortı. Exploiting the capabilities of modern gpus for dense
matrix computations. Technical report, Universidad Jaime I, 2008.
[6] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared
Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press,
2007.
[7] Jonathan Cohen and Michael Garland. Solving computational problems with gpu computing.
Computing in Science and Engineering, 11(5):58–63, 2009.
[8] NVIDIA Corporation. Transform & lighting technical brief.
[9] David Culler, J. P. Singh, and Anoop Gupta. Parallel Computer Architecture: A
Hardware/Software Approach (The Morgan Kaufmann Series in Computer Architecture and
Design). Morgan Kaufmann, August 1998.
[10] Zhe Fan, Feng Qiu, Arie Kaufman, and Suzanne Y. Stover. Gpu cluster for high performance
computing. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing,
pages 47+, Washington, DC, USA, 2004. IEEE Computer Society.
52
[11] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of gpu al-
gorithms for matrix-matrix multiplication. In HWWS ’04: Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 133–137, New York,
NY, USA, 2004. ACM.
[12] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide to
Programmable Real-Time Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA, 2003.
[13] Joel H. Ferziger and Peric Milovan. Computational Methods for Fluid Dynamics. Springer,
2 edition, 1997.
[14] Michael J. Flynn. Some computer organizations and their effectiveness. Computers, IEEE
Transactions on, C-21(9):948–960, Sept. 1972.
[15] Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortı, and Gregorio
Quintana-Ortı. Introducing: The libflame library for dense matrix computations. CiSE,
page 9.
[16] Michael Garland. Sparse matrix computations on manycore gpu’s. In DAC ’08: Proceedings
of the 45th annual Design Automation Conference, pages 2–6, New York, NY, USA, 2008.
ACM.
[17] Dominik Goddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick,
Hilmar Wobker, Christian Becker, and Stefan Turek. Using gpus to improve multigrid solver
performance on a cluster. Int. J. Comput. Sci. Eng., 4(1):36–55, 2008.
[18] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532–533, 1988.
[19] Joanes Habich. Performance evaluation of numeric compute kernels on nvidia gpus. Master’s
thesis, FRIEDRICH-ALEXANDER-UNIVERSITAT, 2008.
[20] Mark Harris, William Baxter, Thorsten Scheuermann, and Anselmo Lastra. Simulation of
cloud dynamics on graphics hardware. In Proc. Graphics Hardware, 2003.
[21] David Kanter. Nvidia’s gt200: Inside a parallel processor. Real World Technologies, page
http://realworldtech.com/page.cfm?ArticleID=RWT090808195242\&p=1, Augst. 2008.
[22] Jens Kruger. Linear algebra on gpus. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses,
page 73, New York, NY, USA, 2005. ACM.
[23] Jens Kruger and Rudiger Westermann. Linear algebra operators for gpu implementation of
numerical algorithms. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Papers, pages 908–916,
New York, NY, USA, 2003. ACM.
53
[24] Linda Null and Julia Lobur. Essentials of Computer Organization and Architecture. Jones
and Bartlett Publishers, Inc., USA, 2003.
[25] Nvidia. CUDA Programming Guide.
[26] Matt Pharr and Randima Fernando. GPU Gems 2: Programming Techniques for
High-Performance Graphics and General-Purpose Computation (Gpu Gems). Addison-
Wesley Professional, 2005.
[27] Martin Rumpf and Robert Strzodka. Using graphics cards for quantized fem computations.
In in IASTED Visualization, Imaging and Image Processing Conference, pages 193–202, 2001.
[28] Allen R. Sanderson, Miriah D. Meyer, Robert M. Kirby, and Chris R. Johnson. A framework
for exploring numerical solutions of advection–reaction–diffusion equations
using a gpu-based approach. Comput. Vis. Sci., 12(4):155–170, 2009.
[29] Lele Sanjiva K. Compact finite difference schemes with spectral-like resolution. Journal of
Computational Physics, 103:16–42, 1992.
[30] Jos Stam. Stable fluids. In SIGGRAPH 99 Conference Proceedings, Annual Conference
Series, pages 121–128, 1999.
[31] Andrew S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River,
NJ, USA, 2007.
[32] J. Tolke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J.
Comput. Fluid Dyn., 22(7):443–456, 2008.
[33] Stanimire Tomov, Jack Dongarra, and Marc Baboulin. Towards dense linear algebra for
hybrid gpu accelerated manycore systems. Technical Report 210, LAPACK Working Note,
October 2008.
[34] Vasily Volkov and James Demmel. Lu, qr and cholesky factorizations using vector capabili-
ties of gpus. Technical report, Electrical Engineering and Computer Sciences, University of
California at Berkeley, 2008.
[35] Vasily Volkov and James W. Demmel. Benchmarking gpus to tune dense linear algebra. In
SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–11,
Piscataway, NJ, USA, 2008. IEEE Press.
[36] Vasily Volkov and James W. Demmel. LU, QR and Cholesky factorizations using vector
capabilities of GPUs. LAPACK Working Note 202, May 2008.
[37] Ye Zhao. Lattice boltzmann based pde solver on the gpu. Vis. Comput., 24(5):323–333, 2008.
54
Appendix A
Additional Informations
A.1 Properties of some GPUs
55
Num
ber
ofm
ulti
-pro
cess
orC
lock
(MH
z)M
emor
y(M
B)
Mem
.C
lock
(MH
z)B
usW
idth
Mem
.B
andw
idth
(GB
/s)
GeF
orce
8600
GT
414
5025
670
012
822
.4
GeF
orce
8800
GT
1415
0051
290
025
657
.6
GeF
orce
9400
GT
214
0051
240
012
812
.8
GeF
orce
9600
GT
816
5051
290
025
657
.6
Qua
dro
FX
1800
814
0076
880
019
238
.4
Tes
laC
1060
3013
0040
9680
051
210
2.0
Tab
leA
.1:
Pro
pert
ies
ofse
vera
lG
PU
s
56
Appendix B
Code Listings
B.1 Benchmarks
B.1.1 FLOP benchmark
Listing B.1: FLOP test1 /∗
2 ∗ C o p y r i g h t 1993−2007 NVIDIA C o r p o r a t i o n . A l l r i g h t s r e s e r v e d .
3 ∗
4 ∗ NOTICE TO USER :
5 ∗
6 ∗ Th i s s o u r c e c o d e i s s u b j e c t t o NVIDIA o w n e r s h i p r i g h t s u n d e r U . S . and
7 ∗ i n t e r n a t i o n a l C o p y r i g h t l a w s . U s e r s and p o s s e s s o r s o f t h i s s o u r c e c o d e
8 ∗ a r e h e r e b y g r a n t e d a n o n e x c l u s i v e , r o y a l t y − f r e e l i c e n s e t o u s e t h i s c o d e
9 ∗ i n i n d i v i d u a l and c omme r c i a l s o f t w a r e .
10 ∗
11 ∗ NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE
12 ∗ CODE FOR ANY PURPOSE . IT IS PROVIDED ”AS IS ” WITHOUT EXPRESS OR
13 ∗ IMPLIED WARRANTY OF ANY KIND . NVIDIA DISCLAIMS ALL WARRANTIES WITH
14 ∗ REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF
15 ∗ MERCHANTABILITY , NONINFRINGEMENT , AND FITNESS FOR A PARTICULAR PURPOSE .
16 ∗ IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL , INDIRECT , INCIDENTAL ,
17 ∗ OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
18 ∗ OF USE , DATA OR PROFITS , WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
19 ∗ OR OTHER TORTIOUS ACTION , ARISING OUT OF OR IN CONNECTION WITH THE USE
20 ∗ OR PERFORMANCE OF THIS SOURCE CODE .
21 ∗
22 ∗ U . S . Gove rnmen t End U s e r s . T h i s s o u r c e c o d e i s a ” c omme r c i a l i t em ” a s
23 ∗ t h a t t e rm i s d e f i n e d a t 48 C . F . R . 2 . 1 0 1 (OCT 1 9 9 5 ) , c o n s i s t i n g o f
24 ∗ ” c omme r c i a l c ompu t e r s o f t w a r e ” and ” c omme r c i a l c ompu t e r s o f t w a r e
25 ∗ d o c u m e n t a t i o n ” a s s u c h t e rm s a r e u s e d i n 48 C . F . R . 1 2 . 2 1 2 ( SEPT 1 9 9 5 )
26 ∗ and i s p r o v i d e d t o t h e U . S . Gove rnmen t o n l y a s a c omme r c i a l end i t em .
27 ∗ C o n s i s t e n t w i t h 48 C . F . R . 1 2 . 2 1 2 and 48 C . F . R . 227 .7202−1 t h r o u g h
28 ∗ 227 .7202−4 ( JUNE 1 9 9 5 ) , a l l U . S . Gove rnmen t End U s e r s a c q u i r e t h e
29 ∗ s o u r c e c o d e w i t h o n l y t h o s e r i g h t s s e t f o r t h h e r e i n .
30 ∗
31 ∗ Any u s e o f t h i s s o u r c e c o d e i n i n d i v i d u a l and c omme r c i a l s o f t w a r e mus t
32 ∗ i n c l u d e , i n t h e u s e r d o c u m e n t a t i o n and i n t e r n a l comment s t o t h e code ,
33 ∗ t h e a b o v e D i s c l a i m e r and U . S . Gove rnmen t End U s e r s N o t i c e .
34 ∗/
57
35
36 /∗
37 Th i s s amp l e i s i n t e n d e d t o mea s u r e t h e p e a k c o m p u t a t i o n r a t e o f t h e GPU i n GFLOPs
38 ( g i g a f l o a t i n g p o i n t o p e r a t i o n s p e r s e c o n d ) .
39
40 I t e x e c u t e s a l a r g e number o f m u l t i p l y −add o p e r a t i o n s , w r i t i n g t h e r e s u l t s t o
41 s h a r e d memory . The l o o p i s u n r o l l e d f o r maximum p e r f o rm a n c e .
42
43 Dep e n d i n g on t h e c o m p i l e r and h a r dw a r e i t m i g h t n o t t a k e a d v a n t a g e o f a l l t h e
44 c o m p u t a t i o n a l r e s o u r c e s o f t h e GPU , s o t r e a t t h e r e s u l t s p r o d u c e d b y t h i s c o d e
45 w i t h some c a u t i o n .
46 ∗/
47
48 #include <s t d l i b . h>
49 #include <s td i o . h>
50 #include <s t r i n g . h>
51 #include <math . h>
52
53 #include <c u t i l . h>
54
55 #ifndef NUM SMS
56 # define NUM SMS (30) // 16
57 #endif
58 #ifndef NUM THREADS PER SM
59 # define NUM THREADS PER SM (1000) // 384
60 #endif
61 #ifndef NUM THREADS PER BLOCK
62 # define NUM THREADS PER BLOCK (512) // 192
63 #endif
64 #define NUM BLOCKS ((NUM THREADS PER SM / NUM THREADS PER BLOCK) ∗ NUM SMS)
65 #define NUM ITERATIONS 10
66 #i f NUM BLOCKS == 0
67 #define NUM BLOCKS 1
68 #endif
69 // 128 MAD i n s t r u c t i o n s
70 #define FMAD128(a , b) \
71 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
72 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
73 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
74 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
75 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
76 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
77 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
78 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
79 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
80 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
81 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
82 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
83 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
84 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
85 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
86 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
87 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
88 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
89 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
90 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
91 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
92 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
93 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
94 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
95 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
96 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
97 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
98 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
99 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
100 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; \
101 a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; a = b ∗ a + b ; \
102 b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ; b = a ∗ b + a ;
103
104 _ _ s h a r e d _ _ f loat r e s u l t [ N U M _ T H R E A D S _ P E R _ B L O C K ] ;
105
58
106 _ _ g l o b a l _ _ void g f l o p s ( )
107 {
108 f loat a = r e s u l t [ t h r e a d I d x . x ] ; // t h i s e n s u r e s t h e mads don ’ t g e t c o m p i l e d o u t
109 f loat b = 1.01 f ;
110
111 for ( int i = 0; i < N U M _ I T E R A T I O N S ; i++)
112 {
113 F M A D 1 2 8 ( a , b ) ;
114 F M A D 1 2 8 ( a , b ) ;
115 F M A D 1 2 8 ( a , b ) ;
116 F M A D 1 2 8 ( a , b ) ;
117 F M A D 1 2 8 ( a , b ) ;
118 F M A D 1 2 8 ( a , b ) ;
119 F M A D 1 2 8 ( a , b ) ;
120 F M A D 1 2 8 ( a , b ) ;
121 F M A D 1 2 8 ( a , b ) ;
122 F M A D 1 2 8 ( a , b ) ;
123 F M A D 1 2 8 ( a , b ) ;
124 F M A D 1 2 8 ( a , b ) ;
125 F M A D 1 2 8 ( a , b ) ;
126 F M A D 1 2 8 ( a , b ) ;
127 F M A D 1 2 8 ( a , b ) ;
128 F M A D 1 2 8 ( a , b ) ;
129 }
130 r e s u l t [ t h r e a d I d x . x ] = a + b ;
131 }
132
133
134
135 int
136 m a i n ( int a r g c , char∗∗ a r g v )
137 {
138 C U T _ D E V I C E _ I N I T ( a r g c , a r g v ) ;
139 unsigned int t i m e r = 0;
140
141 // warmup
142 g f l o p s <<<N U M _ B L O C K S , N U M _ T H R E A D S _ P E R _ B L O C K >>>() ;
143 C U D A _ S A F E _ C A L L ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;
144
145 // e x e c u t e k e r n e l
146 C U T _ S A F E _ C A L L ( c u t C r e a t e T i m e r ( &t i m e r ) ) ;
147 C U T _ S A F E _ C A L L ( c u t S t a r t T i m e r ( t i m e r ) ) ;
148
149 g f l o p s <<<N U M _ B L O C K S , N U M _ T H R E A D S _ P E R _ B L O C K >>>() ;
150
151 C U D A _ S A F E _ C A L L ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;
152 C U T _ S A F E _ C A L L ( c u t S t o p T i m e r ( t i m e r ) ) ;
153 f loat t i m e = c u t G e t T i m e r V a l u e ( t i m e r ) ;
154
155 // o u t p u t r e s u l t s
156 f p r i n t f ( s t d e r r , ”#block th/sms gr id f l o p s / cyc l e Time(ms) f l o p s (G)\n” ) ;
157 f p r i n t f ( s t d o u t , ”%3d %5d %5d %10ld %7.3 f %7.3 f \n” , N U M _ T H R E A D S _ P E R _ B L O C K ,
158 N U M _ T H R E A D S _ P E R _ S M , N U M _ B L O C K S , 128 ∗ 16 ∗ 2 ∗ N U M _ I T E R A T I O N S , t i m e ,
159 128.0 ∗ 16 .0 ∗ 2 .0 ∗ N U M _ I T E R A T I O N S ∗ N U M _ B L O C K S ∗ N U M _ T H R E A D S _ P E R _ B L O C K / t i m e ∗ 1 e−6) ;
160
161
162 C U T _ E X I T ( a r g c , a r g v ) ;
163 }
164
165
166
167 /∗ v im : s e t f t =cpp : ∗/
B.1.2 Bandwidth
59
Listing B.2: memory access1
2 #include <s td i o . h>
3 #include <s t d l i b . h>
4 #include <s t r i n g . h>
5 #include <f loat . h>
6 #include <cuda . h>
7 #include <cuda runtime . h>
8
9 #include ”aux . h”
10
11
12 #i f CUDART VERSION < 2020
13 #error ”This CUDART ver s i on does not support mapped memory !\n”
14 #endif
15
16
17 #define NMP 30
18 #define TpMP 1024
19 #define TpB 64
20
21 #define GLOBAL 1
22 #define TEX 2
23 #define CONST 3
24 #define GLOBAL NC 4
25 #define GLOBAL C 5
26 #define TEX C 6
27 #define CONST C 7
28
29
30 #ifndef N
31 # define N (1<<26)
32 //# d e f i n e N 14720
33 #endif
34
35 #ifndef NROUNDS
36 # define NROUNDS 10
37 #endif
38
39
40 #ifndef DTYPE
41 # define DTYPE f loat
42 # define BpW 4
43 #endif
44
45
46 typedef struct _ _ a l i g n _ _ (16) { f loat a [ 3 ] ; } f 3 ;
47
48 #define SIZE (N∗BpW)
49 #define mSIZE(x ) (x∗ s izeo f (DTYPE) )
50
51
52
53 #i f SIZE <= 1<<16
54 _ _ c o n s t a n t _ _ D T Y P E d _ c [ N ] ;
55 #endif
56
57
58
59 _ _ g l o b a l _ _ void g p u _ C O P Y ( D T Y P E ∗ , D T Y P E ∗ , int ) ;
60 _ _ g l o b a l _ _ void c o p y _ t e x ( D T Y P E ∗a , int ) ;
61 _ _ g l o b a l _ _ void c o p y _ c t e ( D T Y P E ∗a , int ) ;
62
63 void g o l d e n _ C O P Y ( D T Y P E ∗ , D T Y P E ∗) ;
64
65
66
67
68 int N _ E ;
69 d i m 3 g r i d , b l o c k ;
60
70 t e x t u r e <f loat , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;
71
72
73 int
74 c h e c k ( D T Y P E ∗ x , D T Y P E ∗ y , int n n )
75 {
76 int i ;
77 for ( i=0; i<n n ; i++ ) {
78 i f ( x [ i ] != y [ i ] ) {
79 f f l u s h ( s t d o u t ) ;
80 f p r i n t f ( s t d e r r , ” e r r o r at index %d : (x , y )=(%f ,% f )\n” , i , x [ i ] , y [ i ] ) ;
81 f f l u s h ( s t d e r r ) ;
82 }
83 }
84 return 0 ;
85 }
86
87
88 extern ”C” {
89 #include <sys / time . h>
90 }
91 double m c l o c k ( )
92 {
93 struct t i m e v a l t 1 ;
94 // s t r u c t t i m e z o n e t z ;
95 g e t t i m e o f d a y (& t1 , N U L L ) ;
96 return (double ) t 1 . t v _ s e c + (double ) t 1 . t v _ u s e c ∗ 1 e−6;
97 }
98
99 stat ic void
100 o u t p u t ( d i m 3 g , d i m 3 b , double t i m e s [ N R O U N D S ] , s i z e _ t e l e m e n t s , char s [ ] )
101 {
102 double a v g t i m e = 0 , m a x t i m e=0 , m i n t i m e = F L T _ M A X ;
103 int i ; s i z e _ t b y t e s = e l e m e n t s ∗ s izeo f ( D T Y P E ) ;
104
105 for ( i=1; i<N R O U N D S ; i++ ) {
106 a v g t i m e += t i m e s [ i ] ;
107 m a x t i m e = ( m a x t i m e > t i m e s [ i ] ) ? m a x t i m e : t i m e s [ i ] ;
108 m i n t i m e = ( m i n t i m e < t i m e s [ i ] ) ? m i n t i m e : t i m e s [ i ] ;
109 }
110 a v g t i m e /= (double ) ( N R O U N D S −1) ;
111
112
113 p r i n t f ( ”%5d %3d %10d %11d %8.2 f %8.2 f \n” , g . x , b . x , e l e m e n t s , b y t e s , a v g t i m e ∗1 e 6 , ( b y t e s ∗ 1 e−6)
/ a v g t i m e ) ;
114 f f l u s h ( s t d o u t ) ;
115 }
116
117
118
119
120 int
121 m a i n ( int a r g c , char ∗∗ a r g v )
122 {
123 D T Y P E ∗ h _ a , ∗ h _ b ;
124 D T Y P E ∗ d _ a , ∗ d d ;
125 int i , j ;
126 s i z e _ t b y t e s , s i z e ;
127 #i f N<= 8192
128 c u d a A r r a y ∗ d _ b ;
129 #e l i f SIZE <= (1<<16) && N> 8192
130 D T Y P E ∗ d _ b ;
131 #else
132 D T Y P E ∗ d _ b ;
133 D T Y P E ∗ d _ c = N U L L ;
134 #endif
135 char ∗ l a b e l s [ ] = {” g l oba l ” , ” texture ” , ” constant ” } ;
136 double t i m e s [ N R O U N D S ] ;
137 int o p [ ] = { G L O B A L , T E X , C O N S T } ;
138
139 c u d a _ i n i t ( a r g c , a r g v ) ;
61
140
141 b l o c k . x= T p B ;
142
143
144 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ a , S I Z E , 0) ) ;
145 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ b , S I Z E , 0) ) ;
146
147 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ a , N∗ s izeo f ( D T Y P E ) ) ) ;
148 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &dd , N∗ s izeo f ( D T Y P E ) ) ) ;
149 c u d a M e m s e t ( dd , 0 , N∗ s izeo f ( D T Y P E ) ) ;
150
151 m e m s e t ( h _ a , 0 , S I Z E ) ;
152 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ a , h _ a , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
153
154 #i f N <= (8912)
155 c u d a _ e r r o r _ e ( c u d a M a l l o c A r r a y (& d _ b , &t e x . c h a n n e l D e s c , S I Z E , 1) ) ;
156 c u d a _ e r r o r _ e ( c u d a M e m c p y T o A r r a y ( d _ b , 0 , 0 , (void∗) d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
157 t e x . n o r m a l i z e d = f a l s e ;
158 c u d a B i n d T e x t u r e T o A r r a y ( t e x , d _ b ) ;
159 #else
160 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ b , S I Z E ) ) ;
161 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ b , d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
162 t e x . n o r m a l i z e d = f a l s e ;
163 // c u d aB i n dT e x t u r e ( 0 , t e x , d b , c u d aC r e a t e C h a n n e l D e s c (8∗ s i z e o f (DTYPE) , 0 , 0 , 0 ,
c u d aC h a n n e l F o rm a tK i n d F l o a t ) , SIZE ) ;
164 c u d a B i n d T e x t u r e (0 , t e x , d _ b , c u d a C r e a t e C h a n n e l D e s c (24∗ s izeo f ( D T Y P E ) , 0 , 0 , 0 ,
c u d a C h a n n e l F o r m a t K i n d F l o a t ) , S I Z E ) ;
165 #endif
166
167 #i f SIZE <= (1<<16)
168 c u d a _ e r r o r _ e ( c u d a M e m c p y T o S y m b o l ( d _ c , h _ a , S I Z E ) ) ;
169 #endif
170
171
172
173
174
175 /∗ g l o b a l ∗/
176 p r i n t f ( ”#g loba l \n” ) ;
177 p r i n t f ( ”%5s %3s %10s %11s %9s %9s\n” , ” g r id ” , ” blck ” , ” po int s ” , ” bytes ” , ” avgtime” , ”bandwidth” ) ;
178 for ( s i z e = 1<<10; s i z e <= N ; s i z e=s i z e <<1) {
179 b y t e s = s i z e ∗ s izeo f ( D T Y P E ) ;
180 i f ( s i z e > N M P ∗ T p M P )
181 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
182 else
183 g r i d . x = ( s i z e / b l o c k . x )+1 ;
184
185 for ( i=0; i<N R O U N D S ; i++ ) {
186 t i m e s [ i ] = m c l o c k ( ) ;
187 g p u _ C O P Y <<<g r i d , b l o c k >>> ( dd , d _ a , s i z e ) ;
188 c u d a T h r e a d S y n c h r o n i z e ( ) ;
189 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
190 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
191 // c h e c k ( h a , h b , s i z e ) ;
192 }
193 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ s i z e , l a b e l s [ 0 ] ) ;
194 }
195
196 /∗ t e x t u r e ∗/
197 p r i n t f ( ”#texture\n” ) ;
198 p r i n t f ( ”%5s %3s %10s %11s %9s %9s\n” , ” g r id ” , ” blck ” , ” po int s ” , ” bytes ” , ” avgtime” , ”bandwidth” ) ;
199 for ( s i z e = 1<<10; s i z e <= N ; s i z e=s i z e <<1) {
200 b y t e s = s i z e ∗ s izeo f ( D T Y P E ) ;
201 i f ( s i z e > N M P ∗ T p M P )
202 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
203 else
204 g r i d . x = ( s i z e / b l o c k . x )+1 ;
205
206 for ( i=0; i<N R O U N D S ; i++ ) {
207 t i m e s [ i ] = m c l o c k ( ) ;
208 c o p y _ t e x <<<g r i d , b l o c k >>> ( dd , s i z e ) ;
62
209 c u d a T h r e a d S y n c h r o n i z e ( ) ;
210 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
211 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
212 // c h e c k ( h a , h b , s i z e ) ;
213 }
214 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ s i z e , l a b e l s [ 0 ] ) ;
215 }
216 #i f SIZE <= (1<<16)
217 /∗ c o n s t a n t ∗/
218 p r i n t f ( ”#constant\n” ) ;
219 p r i n t f ( ”%5s %3s %10s %11s %9s %9s\n” , ” g r id ” , ” blck ” , ” po int s ” , ” bytes ” , ” avgtime” , ”bandwidth” ) ;
220 for ( b y t e s=1<<12; b y t e s <=S I Z E ; b y t e s +=1024 ) {
221 // f o r ( b y t e s =1<<12; b y t e s <=SIZE ; b y t e s = b y t e s <<1 ) {
222 long s i z e = b y t e s / s izeo f ( D T Y P E ) ;
223 i f ( s i z e > N M P ∗ T p M P )
224 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
225 else
226 g r i d . x = ( s i z e / b l o c k . x )+1 ;
227
228 for ( i=0; i<N R O U N D S ; i++ ) {
229 t i m e s [ i ] = m c l o c k ( ) ;
230 c o p y _ c t e <<<g r i d , b l o c k >>> ( dd , s i z e ) ;
231 c u d a T h r e a d S y n c h r o n i z e ( ) ;
232 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
233 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
234 // c h e c k ( h a , h b , s i z e ) ;
235 }
236 o u t p u t ( g r i d , b l o c k , t i m e s , 2∗ b y t e s , l a b e l s [ 0 ] ) ;
237 }
238 #endif
239 return 0 ;
240 }
241
242
243
244
245
246
247 _ _ g l o b a l _ _ void
248 g p u _ C O P Y ( D T Y P E ∗a , D T Y P E ∗ b , int n n )
249 {
250 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
251 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
252 int n ;
253 int d e l t a ;
254
255 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
256
257 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<n n ; n += d e l t a ){
258 a [ n ] = b [ n ] ;
259 }
260
261 return ;
262 }
263
264
265 _ _ g l o b a l _ _ void
266 c o p y _ t e x ( D T Y P E ∗a , int N N )
267 {
268 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
269 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
270 int n ;
271 int d e l t a ;
272
273 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
274
275 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a )
276 // a [ n ] = t e x 1 D f e t c h ( t e x , n ) ;
277 return ;
278 }
279
63
280
281
282 #i f SIZE <= (1<<16)
283 _ _ g l o b a l _ _ void
284 c o p y _ c t e ( D T Y P E ∗a , int N N )
285 {
286 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
287 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
288 int n ;
289 int d e l t a ;
290
291 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
292
293 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){
294 a [ n ] = d _ c [ n ] ;
295 }
296 return ;
297 }
298 #endif
299
300 void
301 g o l d e n _ C O P Y ( D T Y P E ∗ a , D T Y P E ∗ b )
302 {
303 int i ;
304
305 for ( i=0; i<N ; i++ ) {
306 a [ i ] = b [ i ] ;
307 }
308 return ;
309 }
310
311
312
313
314 /∗ v im : s e t f t =cpp : ∗/
315 /∗ EOF ∗/
Listing B.3: cached access1
2 #include <s td i o . h>
3 #include <s t d l i b . h>
4 #include <s t r i n g . h>
5 #include <f loat . h>
6 #include <cuda . h>
7 #include <cuda runtime . h>
8
9 #include ”aux . h”
10
11
12 #i f CUDART VERSION < 2020
13 #error ”This CUDART ver s i on does not support mapped memory !\n”
14 #endif
15
16 #define NMP 30
17 #define TpMP 4096
18 #define TpB 64
19
20 #define GLOBAL 1
21 #define TEX 2
22 #define CONST 3
23 #define GLOBAL NC 4
24 #define GLOBAL C 5
25 #define TEX C 6
26 #define CONST C 7
27
28
29 #ifndef N
64
30 # define N (1<<14)
31 #endif
32 #define SIZE (N∗4)
33
34 #ifndef NROUNDS
35 # define NROUNDS 2
36 #endif
37
38 #ifndef DTYPE
39 # define DTYPE f loat
40 #endif
41
42
43
44 #i f SIZE <= 1<<16
45 _ _ c o n s t a n t _ _ D T Y P E d _ c [ N ] ;
46 #endif
47
48 _ _ g l o b a l _ _ void g p u _ C O P Y _ c ( D T Y P E ∗ , D T Y P E ∗ , int , int ) ;
49 _ _ g l o b a l _ _ void g p u _ C O P Y _ n c ( D T Y P E ∗ , D T Y P E ∗ , int , int ) ;
50 _ _ g l o b a l _ _ void c o p y _ t e x _ c ( D T Y P E ∗a , int , int ) ;
51 _ _ g l o b a l _ _ void c o p y _ c t e _ c ( D T Y P E ∗a , int , int ) ;
52
53 void g o l d e n _ C O P Y ( D T Y P E ∗ , D T Y P E ∗) ;
54
55
56
57
58 int N _ E ;
59 d i m 3 g r i d , b l o c k ;
60 t e x t u r e <f loat , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;
61 f loat ∗ d _ o u t ;
62
63
64 int
65 c h e c k ( D T Y P E ∗ x , D T Y P E ∗ y , int n n )
66 {
67 int i ;
68 for ( i=0; i<n n ; i++ ) {
69 i f ( x [ i ] != y [ i ] ) {
70 f f l u s h ( s t d o u t ) ;
71 f p r i n t f ( s t d e r r , ” e r r o r at index %d : (x , y )=(%f ,% f )\n” , i , x [ i ] , y [ i ] ) ;
72 f f l u s h ( s t d e r r ) ;
73 }
74 }
75 return 0 ;
76 }
77
78 extern ”C” {
79 #include <sys / time . h>
80 }
81 double m c l o c k ( )
82 {
83 struct t i m e v a l t 1 ;
84 // s t r u c t t i m e z o n e t z ;
85 g e t t i m e o f d a y (& t1 , N U L L ) ;
86 return (double ) t 1 . t v _ s e c + (double ) t 1 . t v _ u s e c ∗ 1 e−6;
87 }
88
89
90 stat ic void
91 o u t p u t ( d i m 3 g , d i m 3 b , double t i m e s [ N R O U N D S ] , s i z e _ t e l e m e n t s , char s [ ] )
92 {
93 double a v g t i m e = 0 , m a x t i m e=0 , m i n t i m e = F L T _ M A X ;
94 int i ;
95 s i z e _ t b y t e s = (1+ e l e m e n t s ) ∗ e l e m e n t s ∗ s izeo f ( D T Y P E ) ;
96 int t _ m p = ( g . x / N M P ) ∗ b . x ;
97
98 for ( i=1; i<N R O U N D S ; i++ ) {
99 a v g t i m e += t i m e s [ i ] ;
100 m a x t i m e = ( m a x t i m e > t i m e s [ i ] ) ? m a x t i m e : t i m e s [ i ] ;
65
101 m i n t i m e = ( m i n t i m e < t i m e s [ i ] ) ? m i n t i m e : t i m e s [ i ] ;
102 }
103 a v g t i m e /= (double ) ( N R O U N D S −1) ;
104
105
106 p r i n t f ( ”%5d %3d %4d %6d %10d %11d %12.2 f %8.2 f \n” , g . x , b . x , t _ m p , g . x∗ b . x , e l e m e n t s , b y t e s ,
a v g t i m e ∗1 e 6 , ( b y t e s ∗ 1 e−6) / a v g t i m e ) ;
107 f f l u s h ( s t d o u t ) ;
108 }
109
110
111
112
113 int
114 m a i n ( int a r g c , char ∗∗ a r g v )
115 {
116 D T Y P E ∗ h _ a , ∗ h _ b ;
117 D T Y P E ∗ d _ a ;
118 int i ;
119 s i z e _ t b y t e s , s i z e ;
120 #i f N<= 8192
121 c u d a A r r a y ∗ d _ b ;
122 #e l i f SIZE <= 1<<16 && N> 8192
123 D T Y P E ∗ d _ b ;
124 #else
125 D T Y P E ∗ d _ b ;
126 D T Y P E ∗ d _ c = N U L L ;
127 #endif
128 char ∗ l a b e l s [ ] = {” g loba l−nc ” , ” g loba l−l ” , ” texture−l ” , ” constant−l ” } ;
129 double t i m e s [ N R O U N D S ] ;
130 int o p [ ] = { G L O B A L _ N C , G L O B A L _ C , T E X _ C , C O N S T _ C } ;
131
132 c u d a _ i n i t ( a r g c , a r g v ) ;
133
134 b l o c k . x = T p B ;
135
136 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ a , S I Z E , 0) ) ;
137 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &h _ b , S I Z E , 0) ) ;
138
139 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ a , S I Z E ) ) ;
140 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ o u t , S I Z E ) ) ;
141
142 for ( i=0; i<N ; i++ ) {
143 h _ b [ i ] = 1 .0 f ;
144 }
145 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ a , h _ b , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
146
147 #i f N <= (8912)
148 c u d a _ e r r o r _ e ( c u d a M a l l o c A r r a y (& d _ b , &t e x . c h a n n e l D e s c , S I Z E , 1) ) ;
149 c u d a _ e r r o r _ e ( c u d a M e m c p y T o A r r a y ( d _ b , 0 , 0 , (void∗) d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
150 t e x . n o r m a l i z e d = f a l s e ;
151 c u d a B i n d T e x t u r e T o A r r a y ( t e x , d _ b ) ;
152 #else
153 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ b , S I Z E ) ) ;
154 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ b , d _ a , S I Z E , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
155 t e x . n o r m a l i z e d = f a l s e ;
156 c u d a B i n d T e x t u r e (0 , t e x , d _ b , c u d a C r e a t e C h a n n e l D e s c (8∗ s izeo f ( D T Y P E ) , 0 , 0 , 0 ,
c u d a C h a n n e l F o r m a t K i n d F l o a t ) , S I Z E ) ;
157 #endif
158
159 #i f SIZE <= 1<<16
160 c u d a _ e r r o r _ e ( c u d a M e m c p y T o S y m b o l ( d _ c , h _ b , N∗ s izeo f ( D T Y P E ) ) ) ;
161 #endif
162
163 g o l d e n _ C O P Y ( h _ a , h _ b ) ;
164
165 /∗ g l o b a l −nc ∗/
166 p r i n t f ( ”#g loba l \n” ) ;
167 p r i n t f ( ”%5s %3s %4s %6s %10s %11s %12s %8s\n” , ” g r id ” , ” blck ” , ”TpMP” , ” threads ” , ” po int s ” , ” bytes ”
, ”avgtime” , ”bandwidth” ) ;
168 for ( s i z e = 1<<10; s i z e <=N ; s i z e = s i z e <<1 ) {
66
169 i f ( s i z e > N M P ∗ T p M P )
170 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
171 else
172 g r i d . x = ( s i z e / b l o c k . x )+1 ;
173
174 for ( i=0; i<N R O U N D S ; i++ ) {
175 t i m e s [ i ] = m c l o c k ( ) ;
176 g p u _ C O P Y _ n c <<<g r i d , b l o c k >>> ( d _ o u t , d _ a , s i z e , s i z e ) ;
177 c u d a T h r e a d S y n c h r o n i z e ( ) ;
178 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
179 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
180 // c h e c k ( h a , h b , s i z e ) ;
181 }
182 o u t p u t ( g r i d , b l o c k , t i m e s , s i z e , l a b e l s [ 0 ] ) ;
183 }
184
185 /∗ g l o b a l −c ∗/
186 p r i n t f ( ”#global−c\n” ) ;
187 p r i n t f ( ”%5s %3s %4s %6s %10s %11s %12s %8s\n” , ” g r id ” , ” blck ” , ”TpMP” , ” threads ” , ” po int s ” , ” bytes ”
, ”avgtime” , ”bandwidth” ) ;
188 for ( s i z e = 1<<10; s i z e <=N ; s i z e = s i z e <<1 ) {
189 i f ( s i z e > N M P ∗ T p M P )
190 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
191 else
192 g r i d . x = ( s i z e / b l o c k . x )+1 ;
193
194 for ( i=0; i<N R O U N D S ; i++ ) {
195 t i m e s [ i ] = m c l o c k ( ) ;
196 g p u _ C O P Y _ c <<<g r i d , b l o c k >>> ( d _ o u t , d _ a , s i z e , s i z e ) ;
197 c u d a T h r e a d S y n c h r o n i z e ( ) ;
198 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
199 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
200 // c h e c k ( h a , h b , s i z e ) ;
201 }
202 o u t p u t ( g r i d , b l o c k , t i m e s , s i z e , l a b e l s [ 0 ] ) ;
203 }
204
205 /∗ t e x−c ∗/
206 p r i n t f ( ”#texture\n” ) ;
207 p r i n t f ( ”%5s %3s %4s %6s %10s %11s %12s %8s\n” , ” g r id ” , ” blck ” , ”TpMP” , ” threads ” , ” po int s ” , ” bytes ”
, ”avgtime” , ”bandwidth” ) ;
208 for ( s i z e = 1<<10; s i z e <=N ; s i z e = s i z e <<1 ) {
209 i f ( s i z e > N M P ∗ T p M P )
210 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
211 else
212 g r i d . x = ( s i z e / b l o c k . x )+1 ;
213
214 for ( i=0; i<N R O U N D S ; i++ ) {
215 t i m e s [ i ] = m c l o c k ( ) ;
216 c o p y _ t e x _ c <<<g r i d , b l o c k >>> ( d _ o u t , s i z e , s i z e ) ;
217 c u d a T h r e a d S y n c h r o n i z e ( ) ;
218 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
219 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
220 // c h e c k ( h a , h b , s i z e ) ;
221 }
222 o u t p u t ( g r i d , b l o c k , t i m e s , s i z e , l a b e l s [ 0 ] ) ;
223 }
224
225 #i f SIZE <= 1<<16
226 /∗ c t e−c ∗/
227 p r i n t f ( ”#constant\n” ) ;
228 p r i n t f ( ”%5s %3s %4s %6s %10s %11s %12s %8s\n” , ” g r id ” , ” blck ” , ”TpMP” , ” threads ” , ” po int s ” , ” bytes ”
, ”avgtime” , ”bandwidth” ) ;
229 // f o r ( b y t e s =1<<12; b y t e s <=SIZE ; b y t e s = b y t e s <<1 ) {
230 for ( s i z e = 1<<10; s i z e <=N ; s i z e = s i z e <<1 ) {
231 i f ( s i z e > N M P ∗ T p M P )
232 g r i d . x = ( T p M P / b l o c k . x ) ∗ N M P ;
233 else
234 g r i d . x = ( s i z e / b l o c k . x )+1 ;
235
236 for ( i=0; i<N R O U N D S ; i++ ) {
67
237 t i m e s [ i ] = m c l o c k ( ) ;
238 c o p y _ c t e _ c <<<g r i d , b l o c k >>> ( d _ o u t , s i z e , s i z e ) ;
239 c u d a T h r e a d S y n c h r o n i z e ( ) ;
240 t i m e s [ i ] = m c l o c k ( ) − t i m e s [ i ] ;
241 // cudaMemcpy ( h b , dd , b y t e s , c u d aMemcpyDev i c eToHo s t ) ;
242 // c h e c k ( h a , h b , s i z e ) ;
243 }
244 o u t p u t ( g r i d , b l o c k , t i m e s , s i z e , l a b e l s [ 0 ] ) ;
245 }
246 #endif
247
248 return 0 ;
249 }
250
251
252
253 /∗∗∗∗∗∗∗∗∗∗ COPY KERNELs ∗∗∗∗∗∗∗∗∗∗/
254 _ _ g l o b a l _ _ void
255 g p u _ C O P Y _ n c ( D T Y P E ∗a , D T Y P E ∗ b , int NN , int t )
256 {
257 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
258 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
259 int n , k ;
260 int d e l t a ;
261 D T Y P E t m p ;
262
263 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
264
265 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){
266 t m p = 0.0 f ;
267 for ( k=0; k<t ; k++) {
268 t m p += b [ k ] ;
269 }
270 a [ n ] = t m p ;
271 }
272 return ;
273 }
274
275 _ _ g l o b a l _ _ void
276 g p u _ C O P Y _ c ( D T Y P E ∗a , D T Y P E ∗ b , int NN , int t )
277 {
278 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
279 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
280 int n , i , k ;
281 int d e l t a ;
282 #define BANK SIZE 512
283 _ _ s h a r e d _ _ D T Y P E s b [ B A N K _ S I Z E ] ;
284 // i n t dd = BANK SIZE / ( b l o c kD im . x∗ b l o c kD im . y ) ;
285 int d d = 16;
286 D T Y P E t m p ;
287
288 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
289
290 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){
291 t m p = 0.0 f ;
292 for ( i=0; i< t ; i+=d d ) {
293 i f ( t i d<d d ) {
294 s b [ t i d ] = b [ i+t i d ] ;
295 }
296 _ _ s y n c t h r e a d s ( ) ;
297 for ( k=0; k<d d ; k++ ) {
298 i f ( i + k < t )
299 t m p += s b [ k ] ;
300 }
301 }
302 a [ n ] = t m p ;
303 }
304
305 return ;
306 }
307
68
308
309 _ _ g l o b a l _ _ void
310 c o p y _ t e x _ c ( D T Y P E ∗a , int NN , int t )
311 {
312 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
313 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
314 int n , k ;
315 int d e l t a ;
316 D T Y P E t m p ;
317
318 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
319
320 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){
321 t m p = 0.0 f ;
322 for ( k=0; k<t ; k++ )
323 t m p += t e x 1 D f e t c h ( t e x , k ) ;
324 a [ n ] = t m p ;
325 // a [ n ] = t e x 1D ( t e x , ( f l o a t ) k ) ;
326 }
327 return ;
328 }
329
330 #i f SIZE <= 1<<16
331 _ _ g l o b a l _ _ void
332 c o p y _ c t e _ c ( D T Y P E ∗a , int NN , int t )
333 {
334 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
335 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
336 int n , k ;
337 int d e l t a ;
338 D T Y P E t m p ;
339
340 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
341
342 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<N N ; n += d e l t a ){
343 t m p = 0.0 f ;
344 for ( k=0; k<t ; k++)
345 t m p += d _ c [ k ] ;
346 a [ n ] = t m p ;
347 }
348 return ;
349 }
350 #endif
351
352
353
354
355
356
357
358
359 void
360 g o l d e n _ C O P Y ( D T Y P E ∗ a , D T Y P E ∗ b )
361 {
362 int i , j ;
363 f loat t m p ;
364
365 for ( i=0; i<N ; i++ ) {
366 t m p = 0.0 f ;
367 for ( j=0; j<N ; j++ )
368 t m p += b [ j ] ;
369 a [ i ] = t m p ;
370 }
371 return ;
372 }
373
374
375
376
377 /∗ v im : s e t f t =cpp : ∗/
378 /∗ EOF ∗/
69
B.1.3 Stream
Listing B.4: Stream benchmark1
2 #include <s td i o . h>
3 #include <s t d l i b . h>
4 #include <s t r i n g . h>
5 #include <f loat . h>
6 #include <cuda . h>
7 #include <cuda runtime . h>
8
9 #include ”aux . h”
10
11
12 #i f CUDART VERSION < 2020
13 #error ”This CUDART ver s i on does not support mapped memory !\n”
14 #endif
15
16 #define COPY 0
17 #define SCALE 1
18 #define ADD 2
19 #define TRIAD 3
20
21 const char ∗ l a b e l s [ ] = {”COPY ” , ”SCALE ” , ”ADD ” ,
22 ”TRIAD ” } ;
23
24 const int s b y t e s [ ] = { 2 , 2 ,3 ,3} ;
25 #ifndef DTYPE
26 # define DTYPE f loat
27 #endif
28
29 #i f de f ined (NN)
30 # define N NN
31 #else
32 # define N (1<<15)
33 #endif
34
35 #ifndef NROUNDS
36 # define NROUNDS 10
37 #endif
38
39 #ifndef OPERATION
40 # define OPERATION COPY
41 #endif
42
43 #define SIZE (N∗ s izeo f (DTYPE) )
44 #define SIZEs (N∗ s izeo f (DTYPE)∗ sbytes [OPERATION] )
45
46
47
48 #define NMP 30
49 #define TB SIZE 5
50 #define TMP SIZE 10
51
52 typedef struct {
53 D T Y P E ∗ a ;
54 D T Y P E ∗ b ;
55 D T Y P E ∗ c ;
56 D T Y P E k ;
57 D T Y P E s i z e ;
58 } s t r e a m ;
59
60 _ _ g l o b a l _ _ void g p u _ C O P Y ( s t r e a m ) ;
61 _ _ g l o b a l _ _ void g p u _ A D D ( s t r e a m s ) ;
62 _ _ g l o b a l _ _ void g p u _ S C A L E ( s t r e a m s ) ;
63 _ _ g l o b a l _ _ void g p u _ T R I A D ( s t r e a m s ) ;
64 void g o l d e n _ C O P Y ( s t r e a m s ) ;
70
65 void g o l d e n _ S C A L E ( s t r e a m s ) ;
66 void g o l d e n _ A D D ( s t r e a m s ) ;
67 void g o l d e n _ T R I A D ( s t r e a m s ) ;
68
69 void
70 c h e c k ( s t r e a m h , s t r e a m d )
71 {
72 int i ;
73 D T Y P E ∗a , ∗b , ∗ c ;
74
75 a = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;
76 b = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;
77 c = ( D T Y P E ∗) m a l l o c ( S I Z E ) ;
78
79 c u d a _ e r r o r _ e ( c u d a M e m c p y ( a , d . a , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;
80 c u d a _ e r r o r _ e ( c u d a M e m c p y ( b , d . b , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;
81 c u d a _ e r r o r _ e ( c u d a M e m c p y ( c , d . c , S I Z E , c u d a M e m c p y D e v i c e T o H o s t ) ) ;
82 i f ( h . k != d . k ) {
83 f p r i n t f ( s t d e r r , ”k mismatch” ) ;
84 e x i t (3) ;
85 }
86
87 for ( i=0; i<h . k ; i++ ) {
88 i f ( h . a [ i ] != a [ i ] ) {
89 f p r i n t f ( s t d e r r , ”not check : a on index %d” , i ) ;
90 e x i t (3) ;
91 }
92 i f ( h . b [ i ] != b [ i ] ) {
93 f p r i n t f ( s t d e r r , ”not check : b on index %d” , i ) ;
94 e x i t (3) ;
95 }
96 i f ( h . c [ i ] != c [ i ] ) {
97 f p r i n t f ( s t d e r r , ”not check : c on index %d” , i ) ;
98 e x i t (3) ;
99 }
100 }
101 f r e e ( a ) ; f r e e ( b ) ; f r e e ( c ) ;
102 return ;
103 }
104
105 extern ”C” {
106 #include <sys / time . h>
107 }
108 double m c l o c k ( )
109 {
110 struct t i m e v a l t 1 ;
111 g e t t i m e o f d a y (& t1 , N U L L ) ;
112 return (double ) t 1 . t v _ s e c + (double ) t 1 . t v _ u s e c ∗ 1 e−6;
113 }
114
115
116 d i m 3 b l o c k , g r i d ;
117
118
119 int
120 m a i n ( int a r g c , char ∗∗ a r g v )
121 {
122 int i , j ;
123 double t i m e s [ T B _ S I Z E ] [ T M P _ S I Z E ] ;
124 int t b _ s i z e s [ ] = {32 , 64 , 128 , 256 , 512 } ;
125 int t m p _ s i z e s [ ]={30 ,60 ,120 ,240 ,480 ,960 ,1920 ,3840 ,7680 ,15360} ;
126 s t r e a m d _ s , h _ s ;
127
128 t _ i n i t ( a r g c , a r g v ) ;
129 c u d a _ i n i t ( a r g c , a r g v ) ;
130
131 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . a ) , S I Z E , 0) ) ;
132 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . b ) , S I Z E , 0) ) ;
133 c u d a _ e r r o r _ e ( c u d a H o s t A l l o c ( ( void∗∗) &( h _ s . c ) , S I Z E , 0) ) ;
134
135 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . a , S I Z E ) ) ;
71
136 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . b , S I Z E ) ) ;
137 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗) &d _ s . c , S I Z E ) ) ;
138
139 for ( i=0; i<N ; i++ ) {
140 h _ s . a [ i ] = 1 .0 f ;
141 h _ s . b [ i ] = 1 .0 f ;
142 h _ s . c [ i ] = 0 .0 f ;
143 }
144 d _ s . s i z e=h _ s . s i z e=N ;
145 d _ s . k = h _ s . k = 2.0 f ;
146 p r i n t f ( ”#operat ion(%d) : %s vector s i z e : %d , data s i z e : %d \n” , O P E R A T I O N , l a b e l s [ O P E R A T I O N ] , N ,
S I Z E ) ;
147 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . a , h _ s . a , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
148 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . b , h _ s . b , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
149 c u d a _ e r r o r _ e ( c u d a M e m c p y ( d _ s . c , h _ s . c , S I Z E , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
150 for ( i=0; i<T B _ S I Z E ; i++ ) {
151 b l o c k . x = t b _ s i z e s [ i ] ;
152 for ( j=0; j<T M P _ S I Z E ; j++ ) {
153 g r i d . x = t m p _ s i z e s [ j ] ;
154 // g r i d . x = NMP∗ ( t m p s i z e s [ j ] / b l o c k . x ) ;
155 i f ( g r i d . x == 0 ) {
156 t i m e s [ i ] [ j ] = 0 .0 f ;
157 continue ;
158 }
159 t i m e s [ i ] [ j ] = m c l o c k ( ) ;
160
161 #i f OPERATION == COPY
162 g p u _ C O P Y <<<g r i d , b l o c k >>> ( d _ s ) ;
163 #e l i f OPERATION == SCALE
164 g p u _ S C A L E <<<g r i d , b l o c k >>> ( d _ s ) ;
165 #e l i f OPERATION == ADD
166 g p u _ A D D <<<g r i d , b l o c k >>> ( d _ s ) ;
167 #e l i f OPERATION == TRIAD
168 g p u _ T R I A D <<<g r i d , b l o c k >>>(d _ s ) ;
169 #endif
170 c u d a _ e r r o r _ e ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;
171 t i m e s [ i ] [ j ] = m c l o c k ( ) − t i m e s [ i ] [ j ] ;
172 }
173 }
174
175 #i f OPERATION == COPY
176 g o l d e n _ C O P Y ( h _ s ) ;
177 #e l i f OPERATION == SCALE
178 g o l d e n _ S C A L E ( h _ s ) ;
179 #e l i f OPERATION == ADD
180 g o l d e n _ A D D ( h _ s ) ;
181 #e l i f OPERATION == TRIAD
182 g o l d e n _ T R I A D ( h _ s ) ;
183 #endif
184
185 c h e c k ( h _ s , d _ s ) ;
186 p r i n t f ( ”%6s ” , ” g r id ” ) ;
187 for ( i=0; i<T B _ S I Z E ; i++ ) {
188 p r i n t f ( ”%7s %8s %9d %9d” , ” t /mp” , ” threads ” , t b _ s i z e s [ i ] , t b _ s i z e s [ i ] ) ;
189 }
190 p r i n t f ( ”\n” ) ;
191
192 for ( j=0; j<T M P _ S I Z E ; j++ ) {
193 p r i n t f ( ”%6d ” , t m p _ s i z e s [ j ] ) ;
194 for ( i=0; i<T B _ S I Z E ; i++ ) {
195 // s i z e t gd = ( t m p s i z e s [ j ] / t b s i z e s [ i ] ) ∗ NMP;
196 s i z e _ t g d = ( t m p _ s i z e s [ j ] / N M P ) ∗ t b _ s i z e s [ i ] ;
197
198 i f ( t i m e s [ i ] [ j ] == 0.0 f )
199 p r i n t f ( ”%7d %8d %9.2 f %9s ” , gd , t b _ s i z e s [ i ]∗ t m p _ s i z e s [ j ] , ( t i m e s [ i ] [ j ]∗1 e 6 ) , ”−” ) ;
200 else
201 p r i n t f ( ”%7d %8d %9.2 f %9.2 f ” , gd , t b _ s i z e s [ i ]∗ t m p _ s i z e s [ j ] , ( t i m e s [ i ] [ j ]∗1 e 6 ) , S I Z E s /(
t i m e s [ i ] [ j ]∗1 e 6 ) ) ;
202 }
203 p r i n t f ( ”\n” ) ;
204 }
72
205
206 return 0 ;
207 }
208
209
210
211 /∗∗∗∗∗∗∗∗∗∗ COPY KERNELs ∗∗∗∗∗∗∗∗∗∗/
212
213 _ _ g l o b a l _ _ void
214 g p u _ C O P Y ( s t r e a m s )
215 {
216 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
217 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
218 int n ;
219 int d e l t a ;
220 int n t = s . s i z e ;
221 f loat ∗ a = s . a ;
222 f loat ∗ b = s . b ;
223
224 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
225
226 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<n t ; n += d e l t a ){
227 a [ n ] = b [ n ] ;
228 }
229
230 return ;
231 }
232
233 void
234 g o l d e n _ C O P Y ( s t r e a m s )
235 {
236 int i ;
237
238 for ( i=0; i<s . s i z e ; i++ ) {
239 s . a [ i ] = s . b [ i ] ;
240 }
241 return ;
242 }
243
244
245 void
246 g o l d e n _ S C A L E ( s t r e a m s )
247 {
248 int i ;
249
250 for ( i=0; i<s . s i z e ; i++ ) {
251 s . c [ i ] = s . k∗ s . b [ i ] ;
252 }
253 return ;
254 }
255
256 _ _ g l o b a l _ _ void
257 g p u _ S C A L E ( s t r e a m s )
258 {
259 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
260 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
261 int n ;
262 int d e l t a ;
263 D T Y P E l k = s . k ;
264
265 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
266
267 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<s . s i z e ; n += d e l t a ){
268 s . c [ n ] = l k ∗ s . a [ n ] ;
269 }
270
271 return ;
272 }
273
274 /∗∗∗∗∗∗∗∗∗∗ ADD KERNELs ∗∗∗∗∗∗∗∗∗∗/
275 _ _ g l o b a l _ _ void
73
276 g p u _ A D D ( s t r e a m s )
277 {
278 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
279 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
280 int n ;
281 int d e l t a ;
282
283 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
284
285 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<s . s i z e ; n += d e l t a ){
286 s . c [ n ] = s . a [ n ] + s . b [ n ] ;
287 }
288
289 return ;
290 }
291
292 void
293 g o l d e n _ A D D ( s t r e a m s )
294 {
295 int i ;
296
297 for ( i=0; i<N ; i++ ) {
298 s . c [ i ] = s . a [ i ] + s . b [ i ] ;
299 }
300 return ;
301 }
302
303 _ _ g l o b a l _ _ void
304 g p u _ T R I A D ( s t r e a m s )
305 {
306 int b i d = g r i d D i m . x∗ b l o c k I d x . y + b l o c k I d x . x ;
307 int t i d = b l o c k D i m . x∗ t h r e a d I d x . y + t h r e a d I d x . x ;
308 int n ;
309 int d e l t a ;
310
311 d e l t a = b l o c k D i m . x∗ b l o c k D i m . y∗ g r i d D i m . x∗ g r i d D i m . y ;
312
313 for ( n = t i d+b i d ∗ b l o c k D i m . x ; n<s . s i z e ; n += d e l t a ){
314 s . c [ n ] = s . a [ n ] + s . k∗ s . b [ n ] ;
315 }
316
317 return ;
318 }
319
320 void
321 g o l d e n _ T R I A D ( s t r e a m s )
322 {
323 int i ;
324
325 for ( i=0; i<N ; i++ ) {
326 s . c [ i ] = s . a [ i ] + s . k∗ s . b [ i ] ;
327 }
328 return ;
329 }
330
331
332 /∗ v im : s e t f t =cpp : ∗/
333 /∗ EOF ∗/
B.2 Burgers equation solver
B.2.1 Linear Algebra
74
Listing B.5: sgetrs routine1
2 #include ” cuda lapack . h”
3 #include ”aux . h”
4
5
6 extern ”C” int
7 c u d a _ s g e t r s ( const enum C B L A S _ O R D E R O r d e r , const enum C B L A S _ T R A N S P O S E T r a n s A ,
8 const int N , const int N R H S , const f loat ∗A , const int l d a , const int∗ i p i v ,
9 f loat ∗B , const int l d b )
10 {
11 char N O T R A N ;
12 int _ n r o w s , _ n c o l s ;
13 const f loat O N E = 1.0 f ;
14 int i n f o ;
15
16
17 _ n r o w s = ( O r d e r == C b l a s R o w M a j o r ) ? N : l d a ;
18 _ n c o l s = ( O r d e r == C b l a s C o l M a j o r ) ? l d a : N ;
19
20
21 i n f o = 0;
22 N O T R A N = ( T r a n s A == C b l a s T r a n s ) ? 0 : 1 ;
23 i f ( T r a n s A != C b l a s N o T r a n s && T r a n s A != C b l a s T r a n s && T r a n s A != C b l a s C o n j T r a n s ){
24 i n f o = −1;
25 }
26 else i f ( _ n c o l s < 0){
27 i n f o = −2;
28 }
29 else i f ( N R H S < 0){
30 i n f o = −3;
31 }
32 else i f ( _ n r o w s < m a x (1 , _ n c o l s ) ){
33 i n f o = −5;
34 }
35 else i f ( l d b < m a x (1 , _ n r o w s ) ){
36 i n f o = −8;
37 }
38
39 i f ( i n f o != 0){
40 return i n f o ;
41 }
42
43 i f ( _ n r o w s==0 | | N R H S==0){
44 return i n f o ;
45 }
46
47
48
49
50 i f ( O r d e r == C b l a s R o w M a j o r ) {
51 i f ( N O T R A N ) {
52 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,
l d a , B , l d b ) ;
53 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a ,
B , l d b ) ;
54 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , −1) ;
55 }
56 else {
57
58 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , 1) ;
59 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s N o T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a
, B , l d b ) ;
60 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s N o T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,
l d a , B , l d b ) ;
61 }
62 }
63 else {
64 i f ( N O T R A N ) {
65 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , 1) ;
75
66 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s N o T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a
, B , l d b ) ;
67 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s N o T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,
l d a , B , l d b ) ;
68
69 }
70 else {
71 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s U p p e r , C b l a s T r a n s , C b l a s N o n U n i t , _ n r o w s , N R H S , O N E , A ,
l d a , B , l d b ) ;
72 c u d a _ s t r s m ( O r d e r , C b l a s L e f t , C b l a s L o w e r , C b l a s T r a n s , C b l a s U n i t , _ n r o w s , N R H S , O N E , A , l d a ,
B , l d b ) ;
73 c u d a _ s l a s w p ( C b l a s C o l M a j o r , N R H S , B , l d b , 0 , _ n r o w s −1, i p i v , −1) ;
74 }
75
76 }
77 return i n f o ;
78 }
79
80 /∗ v im : s e t f t =cpp tw =78 t s =4 : ∗/
81 /∗ EOF ∗/
Listing B.6: sgetri routine1
2
3 #include ” cuda lapack . h”
4 #include ”aux . h”
5
6
7 void _ _ g l o b a l _ _
8 c r e a t e _ i d e n t i t y ( f loat∗A , int N )
9 {
10 int d 1 = b l o c k D i m . x∗ b l o c k D i m . y ;
11 int d 2 = g r i d D i m . x∗ g r i d D i m . y ∗ d 1 ;
12 int t i d = t h r e a d I d x . x+b l o c k I d x . x∗ d 1 ;
13 int n ;
14
15 for ( n = t i d ; n<N ; n+=d 2 ) {
16 ∗( A+ ( N+1)∗ n ) = 1 .0 f ;
17 }
18
19 return ;
20 }
21
22
23 extern ”C” int
24 c u d a _ s g e t r i ( const int N , f loat ∗A , int ∗ i p i v )
25 {
26 int i n f o = 0;
27 f loat ∗ t A ;
28 d i m 3 b l o c k , g r i d ;
29
30 c u d a M a l l o c ( ( void∗∗) &tA , N∗ N∗ s izeo f ( f loat ) ) ;
31 c u d a M e m c p y ( tA , A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ;
32 c u d a M e m s e t ( A , 0 , N∗ N∗ s izeo f ( f loat ) ) ;
33
34 b l o c k . x = 64;
35 g r i d . x = N / b l o c k . x ; //
36 i f ( g r i d . x == 0 ) g r i d . x++;
37
38 c r e a t e _ i d e n t i t y <<<g r i d , b l o c k >>> ( A , N ) ;
39 c u d a T h r e a d S y n c h r o n i z e ( ) ;
40
41 c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , N , N , tA , N , i p i v , A , N ) ;
42 c u d a F r e e ( t A ) ;
43
44 return i n f o ;
45 }
76
46
47 /∗ v im : s e t f t =cpp tw =78 t s =4 : ∗/
48 /∗ EOF ∗/
Listing B.7: slaswp routine1
2
3 #include <s td i o . h>
4 #include <s t d l i b . h>
5
6 #include ” cuda lapack . h”
7 #include ”aux . h”
8
9
10
11 #define NB 64
12
13 /∗ CUDA HELPERS∗/
14
15 t e x t u r e <int , 1 , c u d a R e a d M o d e E l e m e n t T y p e > t e x ;
16
17 // row ma j o r v e r s i o n
18 stat ic _ _ g l o b a l _ _ void
19 _ s l a s w p _ d _ r m ( const int N , f loat∗A , const int l d a , const int K1 , const int K2 , const int∗ I P I V , const int
i n c x )
20 {
21 int d _ j = b l o c k D i m . x∗ b l o c k D i m . y ;
22 int i , j , c o l , r o w ;
23 f loat t m p ;
24 int k1 , k2 , i n c , ix , i x 0 ;
25 int t i d = t h r e a d I d x . x+b l o c k D i m . x∗ t h r e a d I d x . y ;
26
27 i f ( i n c x > 0 ) {
28 k 1 = K 1 ;
29 k 2 = K 2 +1;
30 i n c= 1;
31 i x 0= k 1 ;
32 }
33 else i f ( i n c x < 0){
34 k 1 = K 2 ;
35 k 2 = K1 −1;
36 i n c= −1;
37 i x 0= −K 2 ∗ i n c x ;
38 }
39
40 for ( c o l=t i d ; c o l<l d a ; c o l+= ( d _ j ∗ b l o c k I d x . x ) ) {
41 /∗ DANGEROUS CONDITION ! DO NOT BRAKE : i != k2 ∗/
42 for ( i=k1 , i x=i x 0 ; i != k 2 ; i+=i n c , i x+=i n c x ){
43 r o w = I P I V [ i x ] ;
44 // row = t e x 1 D f e t c h ( t e x , i x ) ;
45 i f ( r o w != i ) {
46 t m p = ∗( A+r o w ∗ l d a+c o l ) ;
47 ∗( A+r o w ∗ l d a+c o l ) = ∗( A+i∗ l d a+c o l ) ;
48 ∗( A+i∗ l d a+c o l ) = t m p ;
49 }
50 }
51
52 }
53
54 return ;
55 }
56
57 // co l umn Major v e r s i o n
58 stat ic _ _ g l o b a l _ _ void
59 _ s l a s w p _ d _ c m ( const int N , f loat∗A , const int l d a , const int K1 , const int K2 , const int∗ I P I V , const int
i n c x )
60 {
77
61 int d _ j = b l o c k D i m . x∗ b l o c k D i m . y ;
62 int i , j , c o l , r o w ;
63 f loat t m p ;
64 int k1 , k2 , i n c , ix , i x 0 ;
65
66 i f ( i n c x > 0 ) {
67 k 1 = K 1 ;
68 k 2 = K 2 +1;
69 i n c= 1;
70 i x 0= k 1 ;
71 }
72 else i f ( i n c x < 0){
73 k 1 = K 2 ;
74 k 2 = K1 −1;
75 i n c= −1;
76 i x 0= −K 2 ∗ i n c x ;
77 }
78
79 for ( j=0; j<N ; j+=d _ j ){
80 c o l = j + t h r e a d I d x . x + b l o c k D i m . x∗ t h r e a d I d x . y ;
81 i f ( c o l>=N ){
82 return ;
83 }
84 /∗ DANGEROUS CONDITION ! DO NOT BRAKE : i != k2 ∗/
85 for ( i=k1 , i x=i x 0 ; i != k 2 ; i+=i n c , i x+=i n c x ){
86 r o w = I P I V [ i x ] ;
87 i f ( r o w != i ) {
88 t m p = ∗( A+c o l ∗ l d a+r o w ) ;
89 ∗( A+c o l ∗ l d a+r o w ) = ∗( A+c o l ∗ l d a+i ) ;
90 ∗( A+c o l ∗ l d a+i ) = t m p ;
91 }
92
93
94 }
95 }
96
97 return ;
98 }
99
100 extern ”C” void
101 c u d a _ s l a s w p ( const enum C B L A S _ O R D E R o r d e r , const int N , f loat∗A , const int l d a , const int K1 , const int
K2 , const int∗ I P I V , int I N C X )
102 {
103 d i m 3 b l o c k _ d i m , g r i d ;
104 void (∗ _ s l a s w p _ d ) ( const int , f loat ∗ , const int , const int , const int , const int ∗ , const int ) ;
105 int r o w _ m a j o r ;
106 int m , n ;
107
108 r o w _ m a j o r = ( o r d e r == C b l a s R o w M a j o r ) ? 1 : 0 ;
109 n = ( r o w _ m a j o r ) ? N : l d a ;
110 m = ( r o w _ m a j o r ) ? l d a : N ;
111 _ s l a s w p _ d = ( r o w _ m a j o r ) ? _ s l a s w p _ d _ r m : _ s l a s w p _ d _ c m ;
112
113
114 // b l o c k d i m . x = im in ( (m / 64 + 1 ) ∗ 64 , 512 ) ;
115 b l o c k _ d i m . x = 64;
116 g r i d . x = ( m / b l o c k _ d i m . x ) ;
117 i f ( g r i d . x == 0) g r i d . x = 1;
118 i f ( g r i d . x > 30∗ (4096 / b l o c k _ d i m . x ) ) g r i d . x = 30∗(4096 / b l o c k _ d i m . x ) ;
119
120
121 i f ( ( K1 <0|| K2>=n | | K1>K 2 ) && I N C X >0 ){
122 f p r i n t f ( s t d e r r , ” [ arg e r r o r ] l im i t s K1 or K2 out o f bounds : (K1 , K2)=(%d,%d)\n” , K1 , K 2 ) ;
123 return ;
124 }
125
126 i f ( ( K2 <0|| K1>=n | | K1>K 2 ) && I N C X <0 ){
127 f p r i n t f ( s t d e r r , ” [ arg e r r o r ] l im i t s K1 or K2 out o f bounds : (K1 , K2)=(%d,%d)\n” , K1 , K 2 ) ;
128 return ;
129 }
78
130 // c u d aB i n dT e x t u r e ( 0 , t e x , IPIV , c u d aC r e a t e C h a n n e l D e s c (8∗ s i z e o f ( i n t ) , 0 , 0 , 0 ,
c u d aC h a n n e l F o rm a tK i n d F l o a t ) , N∗ s i z e o f ( i n t ) ) ;
131 _ s l a s w p _ d <<<g r i d , b l o c k _ d i m >>>(N , A , l d a , K1 , K2 , I P I V , I N C X ) ;
132 c u d a _ e r r o r ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;
133 return ;
134 }
135
136
137
138
139 /∗ v im : s e t f t =cpp : ∗/
140 /∗ EOF ∗/
B.2.2 Numerical Methods
Listing B.8: compact schemes header file1 #ifndef RK4 H
2 #define RK4 H
3
4
5
6 struct _ r k 4 {
7 f loat d t ;
8 int (∗ F ) ( int , f loat ∗ , f loat ∗) ;
9 } ;
10
11 typedef struct _ r k 4 R K 4 ;
12
13 int r k 4 _ i n i t ( R K 4 ∗ , f loat , int (∗ F ) ( int , f loat ∗ , f loat ∗) ) ;
14 int r k 4 _ i n t e g r a t e ( R K 4 ∗ , int , f loat ∗ , f loat ∗) ;
15
16
17 #endif
Listing B.9: RK4 header file1 #ifndef RK4 H
2 #define RK4 H
3
4
5
6 struct _ r k 4 {
7 f loat d t ;
8 int (∗ F ) ( int , f loat ∗ , f loat ∗) ;
9 } ;
10
11 typedef struct _ r k 4 R K 4 ;
12
13 int r k 4 _ i n i t ( R K 4 ∗ , f loat , int (∗ F ) ( int , f loat ∗ , f loat ∗) ) ;
14 int r k 4 _ i n t e g r a t e ( R K 4 ∗ , int , f loat ∗ , f loat ∗) ;
15
16
17 #endif
Listing B.10: compact schemes CUDA implmentation
79
1
2 #include <s t r i n g . h>
3 #include <s t d l i b . h>
4
5
6 #include <c lapack . h>
7 #include ”muti l . h”
8 #include ” compact schemes cuda . h”
9
10
11
12
13
14 #define FST DER 0
15 #define SND DER 5
16
17 #define ALFAC 0
18 #define BETAC 1
19 #define AC 2
20 #define BC 3
21 #define CC 4
22 #define DC BETAC
23
24
25 #define ALFA2C 5
26 #define BETA2C 6
27 #define A2C 7
28 #define B2C 8
29 #define C2C 9
30 #define D2C BETA2C
31 #define E2C 10
32
33 int _ c o m p a c t _ c a l c _ c o e f ( C o m p a c t ∗ , int , f loat [ ] ) ;
34 int _ c o m p a c t _ c a l c _ c o e f 2 ( C o m p a c t ∗ , int , f loat [ ] ) ;
35 int _ c o m p a c t _ i n i t _ A ( C o m p a c t ∗) ;
36 int _ c o m p a c t _ i n i t _ A 2 ( C o m p a c t ∗) ;
37 int _ c o m p a c t _ i n i t _ B ( C o m p a c t ∗) ;
38 int _ c o m p a c t _ i n i t _ B 2 ( C o m p a c t ∗) ;
39
40
41 /∗Compact r e l a t e d f u n c t i o n s ∗/
42
43
44 #include <cuda . h>
45 #include <cuda runtime . h>
46
47 #include <aux . h>
48 #include <cuda blas . h>
49 #include <cuda lapack . h>
50
51 int c o m p a c t _ i n i t ( C o m p a c t ∗ s e l f , f loat h , int N , int o r d e r , f loat ∗ c o e f )
52 {
53 f loat ∗ f _ p ,∗ d f _ p ,∗ A ,∗ B ,∗ A2 ,∗ B 2 ;
54 int ∗ i _ p , ∗ d i _ p , ∗ A p i v o t s , ∗ A 2 p i v o t s ;
55
56 f _ p = d f _ p= N U L L ;
57 i _ p = d i _ p= N U L L ;
58
59 s e l f−>h = h ;
60 s e l f−>N = N ;
61
62 f _ p = ( f loat ∗) c a l l o c (4∗ N∗N , s izeo f ( f loat ) ) ;
63 i _ p = ( int ∗) c a l l o c ( 2∗ N , s izeo f ( int ) ) ;
64 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗)&d f _ p , 4∗ N∗ N∗ s izeo f ( f loat ) ) ) ;
65 c u d a _ e r r o r _ e ( c u d a M a l l o c ( ( void∗∗)&d i _ p , 2∗ N∗ s izeo f ( int ) ) ) ;
66 i f ( f _ p == N U L L | | i _ p == N U L L | | d f _ p == N U L L | | d i _ p == N U L L ) {
67 return −1;
68 }
69
70 s e l f−>A = f _ p ;
71 s e l f−>B = f _ p + N∗ N ;
80
72 s e l f−>A 2= f _ p + N∗ N ∗2;
73 s e l f−>B 2= f _ p + N∗ N ∗3;
74 s e l f−>A p i v o t s = i _ p ;
75 s e l f−>A 2 p i v o t s= i _ p + N ;
76
77 A = d f _ p ;
78 B = d f _ p + N∗ N ;
79 A 2= d f _ p + N∗ N ∗2;
80 B 2= d f _ p + N∗ N ∗3;
81
82 A p i v o t s = d i _ p ;
83 A 2 p i v o t s= d i _ p + N ;
84
85 _ c o m p a c t _ c a l c _ c o e f ( s e l f , o r d e r , c o e f ) ;
86 _ c o m p a c t _ i n i t _ A ( s e l f ) ;
87
88 #ifde f INVERSE
89 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A , s e l f−>A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
90 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A p i v o t s , s e l f−>A p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
91 c u d a _ e r r o r _ e ( c u d a _ s g e t r i ( N , A , A p i v o t s ) ) ;
92 #else
93 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A , s e l f−>A , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
94 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A p i v o t s , s e l f−>A p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
95 #endif
96
97 _ c o m p a c t _ i n i t _ B ( s e l f ) ;
98 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( B , s e l f−>B , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
99
100 _ c o m p a c t _ c a l c _ c o e f 2 ( s e l f , o r d e r , &( c o e f [ S N D _ D E R ] ) ) ;
101 _ c o m p a c t _ i n i t _ A 2 ( s e l f ) ;
102 #ifde f INVERSE
103 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A2 , s e l f−>A2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
104 c u d a _ e r r o r _ e ( c u d a M e m c p y ( A 2 p i v o t s , s e l f−>A 2 p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
105 c u d a _ e r r o r _ e ( c u d a _ s g e t r i ( N , A2 , A 2 p i v o t s ) ) ;
106 #else
107 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A2 , s e l f−>A2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
108 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( A 2 p i v o t s , s e l f−>A 2 p i v o t s , N∗ s izeo f ( int ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
109 #endif
110 _ c o m p a c t _ i n i t _ B 2 ( s e l f ) ;
111 c u d a _ e r r o r _ e ( c u d a M e m c p y A s y n c ( B2 , s e l f−>B2 , N∗ N∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e , 0) ) ;
112
113
114
115 s e l f−>A = d f _ p ;
116 s e l f−>B = d f _ p + N∗ N ;
117 s e l f−>A 2= d f _ p + N∗ N ∗2;
118 s e l f−>B 2= d f _ p + N∗ N ∗3;
119
120 s e l f−>A p i v o t s = d i _ p ;
121 s e l f−>A 2 p i v o t s= d i _ p + N ;
122
123 c u d a _ e r r o r _ e ( c u d a T h r e a d S y n c h r o n i z e ( ) ) ;
124 f r e e ( i _ p ) ;
125
126 return 0 ;
127 }
128
129
130 /∗
131 ∗ C a l c u l a t e s f i r s t d e r i v a t i v e
132 ∗
133 ∗/
134 int
135 c o m p a c t _ d e r i v a t i v e ( C o m p a c t ∗ s e l f , f loat ∗f , f loat ∗ d f _ b , f loat ∗ f _ b ,
136 f loat ∗ Y )
137 {
138 stat ic f loat ∗ t m p 1 = N U L L ;
139 f loat a l p h a = 1.0 f ;
140 f loat b e t a = 0.0 f ;
141 int s o l v e r _ m = 1;
142 int s o l v e r _ n = s e l f−>N ;
81
143 int s o l v e r _ i n f o = 0;
144
145 i f ( t m p 1 == N U L L ) {
146 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , s o l v e r _ n ∗ s izeo f ( f loat ) ) ) ;
147 }
148
149
150 /∗SOLVE∗/
151
152 c u d a _ s g e m v ( C b l a s C o l M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,
153 s e l f−>B , s o l v e r _ n , f , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;
154 #ifde f INVERSE
155 c u d a _ s g e m v ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,
156 s e l f−>A , s o l v e r _ n , Y , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;
157 #else
158 s o l v e r _ i n f o = c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ m ,
159 s e l f−>A , s o l v e r _ n , s e l f−>A p i v o t s , Y , s o l v e r _ n ) ; // tmp2 , s o l v e r n ) ;
160
161 D P R I N T ( ” s o l v e r return value : %d\n” , s o l v e r _ i n f o ) ;
162 #endif
163
164 return 0 ;
165 }
166
167
168
169
170
171 /∗
172 ∗ c a l c u l a t e s t h e c o e f c i e n t s f o r a l g o r i t h m .
173 ∗ R e c e i v e s :
174 ∗ ∗ t h e a l g o r i t h m d a t a s t r u c t u r e ;
175 ∗ ∗ t h e o r d e r o f t h e e r r o r p r e t e n d e d ;
176 ∗ ∗ An a r r a y o f c o e f i c i e n t s . A l l n e g a t i v e a r e c o n s i d e r e d
177 ∗ u n i n i t i a l i z e d
178 ∗/
179
180 int c o m p a c t _ g e t _ c o e f ( int o r d e r , f loat x [ 5 ] , f loat y [ 5 ] )
181 {
182
183 f loat t m p [ 5 ] ;
184
185
186 switch ( o r d e r ){
187 case 4 :
188 i f ( x [ A L F A C ] < 0 .0 f ){
189 t m p [ A L F A C ] = 1 . / 3 . ;
190 } else{ t m p [ A L F A C ] = x [ A L F A C ] ; }
191 i f ( x [ B E T A C ] < 0 .0 f ){
192 t m p [ B E T A C ] = 0 . 0 ;
193 } else{ t m p [ B E T A C ] = x [ B E T A C ] ; }
194 i f ( x [ C C ] < 0 .0 f ){
195 t m p [ C C ] = 0 .0 f ;
196 } else{ t m p [ C C ] = x [ C C ] ; }
197 y [ A L F A C ] = t m p [ A L F A C ] ;
198 y [ B E T A C ] = t m p [ B E T A C ] ;
199 y [ C C ] = t m p [ C C ] ;
200 y [ B C ] = 1 . / 3 .∗ ( 4 .∗ t m p [ A L F A C ]−1.+
201 22.0∗ t m p [ B E T A C ]−8.0∗ t m p [ C C ] ) ;
202 y [ A C ] = 1 . / 3 .∗ ( 2 .∗ t m p [ A L F A C ]+4.+
203 16.0∗ t m p [ B E T A C ]−5.0∗ t m p [ C C ] ) ;
204 break ;
205 case 6 :
206 i f ( x [ A L F A C ] < 1 . 0 ) {
207 t m p [ A L F A C ] = 3 .0 f /8 .0 f ;
208 } else{ t m p [ A L F A C ] = x [ A L F A C ] ; }
209 i f ( x [ B E T A C ] < 1 .0 f ){
210 t m p [ B E T A C ] = 0 .0 f ;
211 } else{ t m p [ B E T A C ] = x [ B E T A C ] ; }
212 y [ A L F A C ] = t m p [ A L F A C ] ;
213 y [ B E T A C ] = t m p [ B E T A C ] ;
82
214 y [ C C ] = 1 .0 f /10.0 f ∗ ( 1 . 0 f−3.0 f∗ t m p [ A L F A C ]+
215 12.0∗ t m p [ B E T A C ] ) ;
216 y [ B C ] = 1 .0 f /15.0 f ∗ (−9.0 f+32.0 f∗ t m p [ A L F A C ]+
217 62.0∗ t m p [ B E T A C ] ) ;
218 y [ A C ] = 1 .0 f /6 .0 f ∗ (9 .0 f+t m p [ A L F A C ] −
219 20.∗ t m p [ B E T A C ] ) ;
220 break ;
221 default :
222 return −1;
223 }
224
225 return 0 ;
226 }
227
228 int _ c o m p a c t _ c a l c _ c o e f ( C o m p a c t ∗ s e l f , int o r d e r , f loat c o e f [ 5 ] )
229 {
230 f loat t m p ;
231
232 c o m p a c t _ g e t _ c o e f ( o r d e r , c o e f , s e l f−>c o e f ) ;
233
234 s e l f−>_ c o e f [ A C ] = s e l f−>c o e f [ A C ] / (2 .0∗ s e l f−>h ) ;
235 s e l f−>_ c o e f [ B C ] = s e l f−>c o e f [ B C ] / (4 .0∗ s e l f−>h ) ;
236 s e l f−>_ c o e f [ C C ] = s e l f−>c o e f [ C C ] / (6 .0∗ s e l f−>h ) ;
237
238 t m p = 3.0 f ;
239 s e l f−>b o u n d a r y _ c o e f [ A L F A C ] = t m p ;
240 s e l f−>b o u n d a r y _ c o e f [ A C ] = −1.0 f ∗ (11 .0 f+2.0 f∗ t m p ) / 6 .0 f ;
241 s e l f−>b o u n d a r y _ c o e f [ B C ] = (6 . 0 f−t m p ) /2 ;
242 s e l f−>b o u n d a r y _ c o e f [ C C ] = (2 . 0 f∗ t m p −3.0 f ) /2 .0 f ;
243 s e l f−>b o u n d a r y _ c o e f [ D C ] = (2 . 0 f−t m p ) / 6 .0 f ;
244
245 return 0 ;
246 }
247
248
249 /∗
250 ∗ I n i t a l i z e s t h e a l g o r i t h m m a t r i x A
251 ∗/
252 int _ c o m p a c t _ i n i t _ A ( C o m p a c t ∗ s e l f )
253 {
254 int N = s e l f−>N ;
255 f loat ∗ A = N U L L ;
256 int ∗ p i v o t s = N U L L ;
257 f loat c o e f [ 5 ] = {1.0 f /4 .0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;
258 int i , s t a t u s = −1;
259
260 A = s e l f−>A ;
261 p i v o t s = s e l f−>A p i v o t s ;
262
263
264 /∗∗∗∗ Bounda ry node : f ’1+ a l p h a f ’ 2 = a f 1+ b f 2+ c f 3+d+ f 4 ∗∗∗/
265 M S E T ( A , N−1, N−1, N , 1 .0 f ) ;
266 M S E T ( A , N−1, N−2, N , s e l f−>b o u n d a r y _ c o e f [ A L F A C ] ) ;
267
268 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 4 t h o r d e r e r r o r s ∗∗∗/
269 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
270
271 M S E T ( A , 0 , 0 , N , 1 .0 f ) ;
272 M S E T ( A , 0 , 1 , N , c o e f [ A L F A C ] ) ;
273
274 M S E T ( A , 1 , 0 , N , c o e f [ A L F A C ] ) ;
275 M S E T ( A , 1 , 1 , N , 1 .0 f ) ;
276 M S E T ( A , 1 , 2 , N , c o e f [ A L F A C ] ) ;
277 M S E T ( A , N−2, N−3, N , c o e f [ A L F A C ] ) ;
278 M S E T ( A , N−2, N−2, N , 1 .0 f ) ;
279 M S E T ( A , N−2, N−1, N , c o e f [ A L F A C ] ) ;
280
281 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 6 t h o r d e r e r r o r s ∗∗∗/
282 c o e f [ A L F A C ] = 1 .0 f /3 .0 f ;
283 c o e f [ B E T A C ] = −1.0 f ; c o e f [ A C ] = −1.0 f ; c o e f [ B C ] = −1.0 f ; c o e f [ C C ] = −1.0 f ;
284 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
83
285
286 M S E T ( A , 1 , 1 , N , c o e f [ A L F A C ] ) ;
287 M S E T ( A , 2 , 2 , N , 1 .0 f ) ;
288 M S E T ( A , 2 , 3 , N , c o e f [ A L F A C ] ) ;
289 M S E T ( A , N−3, N−4, N , c o e f [ A L F A C ] ) ;
290 M S E T ( A , N−3, N−3, N , 1 .0 f ) ;
291 M S E T ( A , N−3, N−2, N , c o e f [ A L F A C ] ) ;
292
293
294 for ( i=3; i<N−3; i++){
295 M S E T ( A , i , i−2, N , s e l f−>c o e f [ B E T A C ] ) ;
296 M S E T ( A , i , i−1, N , s e l f−>c o e f [ A L F A C ] ) ;
297 M S E T ( A , i , i , N , 1 .0 f ) ;
298 //MSET(A , i , i +1 , N , s e l f −>c o e f [ ALFAC ] ) ;
299 //MSET(A , i , i +2 , N , s e l f −>c o e f [ BETAC ] ) ;
300 }
301 s t a t u s = c l a p a c k _ s g e t r f ( C b l a s R o w M a j o r , N , N , A , N , p i v o t s ) ;
302 //DPRINT (” f a c t o r i z a t i o n r e t u r n v a l u e : %d\n ” , s t a t u s ) ;
303 i f ( s t a t u s != 0){
304 return s t a t u s ;
305 }
306 return 0 ;
307 }
308
309 int _ c o m p a c t _ i n i t _ B ( C o m p a c t ∗ s e l f )
310 {
311 int N = s e l f−>N ;
312 int i ;
313 f loat ∗ B = N U L L ;
314 f loat c o e f [ 5 ] = {1.0 f /4 .0 f , −1.0 f , −1.0 f , −1.0 f , −1.0 f } ;
315
316 B = s e l f−>B ;
317
318 /∗∗∗∗ Bounda ry node : f ’1+ a l p h a f ’ 2 = a f 1+ b f 2+ c f 3+d+ f 4 ∗∗∗/
319 M S E T ( B , N−1, N−1 , N , −s e l f−>b o u n d a r y _ c o e f [ A C ] / s e l f−>h ) ;
320 M S E T ( B , N−1, N−2 , N , −s e l f−>b o u n d a r y _ c o e f [ B C ] / s e l f−>h ) ;
321 M S E T ( B , N−1, N−3 , N , −s e l f−>b o u n d a r y _ c o e f [ C C ] / s e l f−>h ) ;
322 M S E T ( B , N−1, N−4 , N , −s e l f−>b o u n d a r y _ c o e f [ D C ] / s e l f−>h ) ;
323
324 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 4 t h o r d e r e r r o r s ∗∗∗/
325 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
326
327 M S E T ( B , 0 , 1 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
328 M S E T ( B , 0 , 0 , N , 0 .0 f ) ;
329
330 M S E T ( B , 1 , 0 , N , −c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
331 M S E T ( B , 1 , 1 , N , 0 .0 f ) ;
332 M S E T ( B , 1 , 2 , N , c o e f [ A C ] / (2∗ s e l f−>h ) ) ;
333 M S E T ( B , N−2, N−3 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
334 M S E T ( B , N−2, N−2 , N , 0 .0 f ) ;
335 M S E T ( B , N−2, N−1 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
336
337
338 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 6 t h o r d e r e r r o r s ∗∗∗/
339 c o e f [ A L F A C ] = 1 .0 f /3 .0 f ;
340 c o e f [ B E T A C ] = −1.0 f ; c o e f [ A C ] = −1.0 f ; c o e f [ B C ] = −1.0 f ; c o e f [ C C ] = −1.0 f ;
341 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
342 M S E T ( B , 2 , 0 , N ,− c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;
343 M S E T ( B , 2 , 1 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
344 M S E T ( B , 2 , 2 , N , 0 .0 f ) ;
345 M S E T ( B , 2 , 3 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
346 M S E T ( B , 2 , 4 , N , c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;
347 M S E T ( B , N−3, N−5 , N ,− c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;
348 M S E T ( B , N−3, N−4 , N ,− c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
349 M S E T ( B , N−3, N−3 , N , 0 .0 f ) ;
350 M S E T ( B , N−3, N−2 , N , c o e f [ A C ] / ( 2 . 0 f∗ s e l f−>h ) ) ;
351 M S E T ( B , N−3, N−1 , N , c o e f [ B C ] / ( 4 . 0 f∗ s e l f−>h ) ) ;
352
353 for ( i=3; i<N−3; i++){
354 M S E T ( B , i , i−3, N ,− s e l f−>_ c o e f [ C C ] ) ;
355 M S E T ( B , i , i−2, N ,− s e l f−>_ c o e f [ B C ] ) ;
84
356 M S E T ( B , i , i−1, N ,− s e l f−>_ c o e f [ A C ] ) ;
357 M S E T ( B , i , i , N , 0 .0 f ) ;
358 M S E T ( B , i , i+1, N , s e l f−>_ c o e f [ A C ] ) ;
359 M S E T ( B , i , i+2, N , s e l f−>_ c o e f [ B C ] ) ;
360 M S E T ( B , i , i+3, N , s e l f−>_ c o e f [ C C ] ) ;
361 }
362
363
364 return 0 ;
365 }
366
367
368
369
370
371
372
373
374 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 2 nd d e r i v a t i v e s t u f f ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
375
376 /∗
377 ∗ c a l c u l a t e s t h e c o e f c i e n t s f o r a l g o r i t h m ( 2 nd d e r i v a t i v e ) .
378 ∗ R e c e i v e s :
379 ∗ ∗ t h e a l g o r i t h m d a t a s t r u c t u r e ;
380 ∗ ∗ t h e o r d e r o f t h e e r r o r p r e t e n d e d ;
381 ∗ ∗ An a r r a y o f 5 c o e f i c i e n t s . A l l n e g a t i v e a r e c o n s i d e r e d
382 ∗ u n i n i t i a l i z e d
383 ∗/
384
385
386 int c o m p a c t _ g e t _ c o e f 2 ( int o r d e r , f loat x [ 5 ] , f loat y [ 5 ] )
387 {
388 f loat t m p [ 5 ] ;
389
390 switch ( o r d e r ){
391 case 4 :
392 i f ( x [ A L F A C ] < 0 .0 f ){
393 t m p [ A L F A C ] = 2 . / 1 1 . ;
394 } else{ t m p [ A L F A C ] = x [ A L F A C ] ; }
395 i f ( x [ B E T A C ] < 0 .0 f ){
396 t m p [ B E T A C ] = 0 . 0 ;
397 } else{ t m p [ B E T A C ] = x [ B E T A C ] ; }
398 i f ( x [ C C ] < 0 .0 f ){
399 t m p [ C C ] = 0 .0 f ;
400 } else{ t m p [ C C ] = x [ C C ] ; }
401 y [ A L F A C ] = t m p [ A L F A C ] ;
402 y [ B E T A C ] = t m p [ B E T A C ] ;
403 y [ C C ] = t m p [ C C ] ;
404
405 y [ A C ] = 1 . / 3 .∗ ( 4 . 0 f− 4 .0 f∗ t m p [ A L F A C ]
406 −40.0∗ t m p [ B E T A C ]+5.0∗ t m p [ C C ] ) ;
407 y [ B C ] = 1./3.∗(−1.0 f + 10.0 f∗ t m p [ A L F A C ]
408 +46.0 f∗ t m p [ B E T A C ]−8.0∗ t m p [ C C ] ) ;
409
410 break ;
411 case 6 :
412 i f ( x [ A L F A C ] < 1 . 0 ) {
413 t m p [ A L F A C ] = 2 .0 f /11.0 f ;
414 } else{ t m p [ A L F A C ] = x [ A L F A C ] ; }
415 i f ( x [ B E T A C ] < 1 .0 f ){
416 t m p [ B E T A C ] = 0 .0 f ;
417 } else{ t m p [ B E T A C ] = x [ B E T A C ] ; }
418 y [ A L F A C ] = t m p [ A L F A C ] ;
419 y [ B E T A C ] = t m p [ B E T A C ] ;
420
421 y [ A C ] = (6 . 0 f−9.0 f∗ t m p [ A L F A C ]
422 −12.0 f∗ t m p [ B E T A C ] ) / 4 .0 f ;
423 y [ B C ] = (−3.0 f+24.0 f∗ t m p [ A L F A C ]
424 −6.0∗ t m p [ B E T A C ] ) / 5 .0 f ;
425 y [ C C ] = (2 . 0 f−11.0 f∗ t m p [ A L F A C ]
426 +124.0∗ t m p [ B E T A C ] ) / 20 .0 f ;
85
427 break ;
428 default :
429 return −1;
430 }
431 return 0 ;
432 }
433
434 int _ c o m p a c t _ c a l c _ c o e f 2 ( C o m p a c t ∗ s e l f , int o r d e r , f loat c o e f [ 5 ] )
435 {
436 f loat t m p ;
437
438 c o m p a c t _ g e t _ c o e f 2 ( o r d e r , c o e f , &( s e l f−>c o e f [ 5 ] ) ) ;
439
440 s e l f−>_ c o e f [ A 2 C ] = s e l f−>c o e f [ A 2 C ] / ( s e l f−>h∗ s e l f−>h ) ;
441 s e l f−>_ c o e f [ B 2 C ] = s e l f−>c o e f [ B 2 C ] / ( 4 . 0 f∗ s e l f−>h∗ s e l f−>h ) ;
442 s e l f−>_ c o e f [ C 2 C ] = s e l f−>c o e f [ C 2 C ] / ( 9 . 0 f∗ s e l f−>h∗ s e l f−>h ) ;
443
444 t m p = 0.0 f ;
445 s e l f−>b o u n d a r y _ c o e f [ A L F A 2 C ] = t m p ;
446 s e l f−>b o u n d a r y _ c o e f [ A 2 C ] = (11 . 0 f∗ t m p +35.0 f ) / 12 .0 f ;
447 s e l f−>b o u n d a r y _ c o e f [ B 2 C ] = −(5.0 f∗ t m p +26.0 f ) / 3 .0 f ;
448 s e l f−>b o u n d a r y _ c o e f [ C 2 C ] = ( t m p +19.0 f ) / 2 .0 f ;
449 s e l f−>b o u n d a r y _ c o e f [ D 2 C ] = ( t m p −14.0 f ) / 3 .0 f ;
450 s e l f−>b o u n d a r y _ c o e f [ E 2 C ] = (11 . 0 f−t m p ) / 12 .0 f ;
451
452 return 0 ;
453 }
454
455
456
457 /∗
458 ∗ I n i t a l i z e s t h e a l g o r i t h m m a t r i x A
459 ∗/
460 int _ c o m p a c t _ i n i t _ A 2 ( C o m p a c t ∗ s e l f )
461 {
462 int N = s e l f−>N ;
463 f loat ∗ A = N U L L ;
464 int∗ p i v o t s = N U L L ;
465 f loat c o e f [ 5 ] = {1.0 f /10.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;
466 int i , s t a t u s =−1;
467
468 A = s e l f−>A 2 ;
469 p i v o t s = s e l f−>A 2 p i v o t s ;
470
471 /∗∗∗∗ Bounda ry node : f ’1+ a l p h a f ’ 2 = a f 1+ b f 2+ c f 3+d+ f 4 ∗∗∗/
472 M S E T ( A , N−1, N−1, N , 1 .0 f ) ;
473 M S E T ( A , N−1, N−2, N , s e l f−>b o u n d a r y _ c o e f [ A L F A 2 C ] ) ;
474
475 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 4 t h o r d e r e r r o r s ∗∗∗/
476 c o m p a c t _ g e t _ c o e f 2 (4 , c o e f , c o e f ) ;
477
478 M S E T ( A , 0 , 0 , N , 1 .0 f ) ;
479 M S E T ( A , 0 , 1 , N , c o e f [ A L F A C ] ) ;
480
481 M S E T ( A , 1 , 0 , N , c o e f [ A L F A C ] ) ;
482 M S E T ( A , 1 , 1 , N , 1 .0 f ) ;
483 M S E T ( A , 1 , 2 , N , c o e f [ A L F A C ] ) ;
484 M S E T ( A , N−2, N−3, N , c o e f [ A L F A C ] ) ;
485 M S E T ( A , N−2, N−2, N , 1 .0 f ) ;
486 M S E T ( A , N−2, N−1, N , c o e f [ A L F A C ] ) ;
487
488 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 6 t h o r d e r e r r o r s ∗∗∗/
489 c o e f [ A L F A C ] = 2 .0 f /11.0 f ;
490 c o e f [ B E T A C ] = −1.0 f ; c o e f [ A C ] = −1.0 f ; c o e f [ B C ] = −1.0 f ; c o e f [ C C ] = −1.0 f ;
491 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
492
493 M S E T ( A , 2 , 1 , N , c o e f [ A L F A C ] ) ;
494 M S E T ( A , 2 , 2 , N , 1 .0 f ) ;
495 M S E T ( A , 2 , 3 , N , c o e f [ A L F A C ] ) ;
496 M S E T ( A , N−3, N−4, N , c o e f [ A L F A C ] ) ;
497 M S E T ( A , N−3, N−3, N , 1 .0 f ) ;
86
498 M S E T ( A , N−3, N−2, N , c o e f [ A L F A C ] ) ;
499
500
501 for ( i=3; i<N−3; i++){
502 M S E T ( A , i , i−2, N , s e l f−>c o e f [ B E T A 2 C ] ) ;
503 M S E T ( A , i , i−1, N , s e l f−>c o e f [ A L F A 2 C ] ) ;
504 M S E T ( A , i , i , N , 1 .0 f ) ;
505 M S E T ( A , i , i+1, N , s e l f−>c o e f [ A L F A 2 C ] ) ;
506 M S E T ( A , i , i+2, N , s e l f−>c o e f [ B E T A 2 C ] ) ;
507 }
508
509 s t a t u s = c l a p a c k _ s g e t r f ( C b l a s R o w M a j o r , N , N , A , N , p i v o t s ) ;
510 i f ( s t a t u s != 0){
511 return s t a t u s ;
512 }
513
514
515 return 0 ;
516 }
517
518 int _ c o m p a c t _ i n i t _ B 2 ( C o m p a c t ∗ s e l f )
519 {
520 int N = s e l f−>N ;
521 int i ;
522 f loat ∗ B = N U L L ;
523 f loat c o e f [ 5 ] = {1.0 f /10.0 f , −1.0 f , −1.0 f , −1.0 f , −1.0 f } ;
524 const f loat h 2 = s e l f−>h∗ s e l f−>h ;
525
526 B = s e l f−>B 2 ;
527
528 /∗∗∗∗ Bounda ry node : f ’1+ a l p h a f ’ 2 = a f 1+ b f 2+ c f 3+d+ f 4 ∗∗∗/
529 M S E T ( B , N−1, N−1 , N , −s e l f−>b o u n d a r y _ c o e f [ A 2 C ] / h 2 ) ;
530 M S E T ( B , N−1, N−2 , N , −s e l f−>b o u n d a r y _ c o e f [ B 2 C ] / h 2 ) ;
531 M S E T ( B , N−1, N−3 , N , −s e l f−>b o u n d a r y _ c o e f [ C 2 C ] / h 2 ) ;
532 M S E T ( B , N−1, N−4 , N , −s e l f−>b o u n d a r y _ c o e f [ D 2 C ] / h 2 ) ;
533 M S E T ( B , N−1, N−5 , N , −s e l f−>b o u n d a r y _ c o e f [ E 2 C ] / h 2 ) ;
534
535 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 4 t h o r d e r e r r o r s ∗∗∗/
536 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
537
538 M S E T ( B , 0 , 0 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;
539 M S E T ( B , 0 , 1 , N , c o e f [ A C ] / h 2 ) ;
540
541 M S E T ( B , 1 , 0 , N , c o e f [ A C ] / h 2 ) ;
542 M S E T ( B , 1 , 1 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;
543 M S E T ( B , 1 , 2 , N , c o e f [ A C ] / h 2 ) ;
544 M S E T ( B , N−2, N−3 , N , c o e f [ A C ] / h 2 ) ;
545 M S E T ( B , N−2, N−2 , N , −2.0 f∗ c o e f [ A C ] / h 2 ) ;
546 M S E T ( B , N−2, N−1 , N , c o e f [ A C ] / h 2 ) ;
547
548
549 /∗∗∗∗ Bounda ry node : o r d e r r e d u c t i o n : t r i d i a g o n a l w/ 6 t h o r d e r e r r o r s ∗∗∗/
550 c o e f [ A L F A C ] = 2 .0 f /11.0 f ;
551 c o e f [ B E T A C ] = −1.0 f ; c o e f [ A C ] = −1.0 f ; c o e f [ B C ] = −1.0 f ; c o e f [ C C ] = −1.0 f ;
552 c o m p a c t _ g e t _ c o e f (4 , c o e f , c o e f ) ;
553 M S E T ( B , 2 , 0 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;
554 M S E T ( B , 2 , 1 , N , c o e f [ A C ] / h 2 ) ;
555 M S E T ( B , 2 , 2 , N , −2.0 f ∗( c o e f [ A C ]+ c o e f [ B C ] / 4 . 0 f ) / h 2 ) ;
556 M S E T ( B , 2 , 3 , N , c o e f [ A C ] / h 2 ) ;
557 M S E T ( B , 2 , 4 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;
558 M S E T ( B , N−3, N−5 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;
559 M S E T ( B , N−3, N−4 , N , c o e f [ A C ] / h 2 ) ;
560 M S E T ( B , N−3, N−3 , N ,−2.0 f ∗( c o e f [ A C ]+ c o e f [ B C ] / 4 . 0 f ) / h 2 ) ;
561 M S E T ( B , N−3, N−2 , N , c o e f [ A C ] / h 2 ) ;
562 M S E T ( B , N−3, N−1 , N , c o e f [ B C ] / (4 . 0 f∗ h 2 ) ) ;
563
564 /∗TODO: t e s t i f B i s NULL∗/
565 for ( i=3; i<N−3; i++){
566 M S E T ( B , i , i−3, N , s e l f−>_ c o e f [ C 2 C ] ) ;
567 M S E T ( B , i , i−2, N , s e l f−>_ c o e f [ B 2 C ] ) ;
568 M S E T ( B , i , i−1, N , s e l f−>_ c o e f [ A 2 C ] ) ;
87
569 M S E T ( B , i , i , N , −2.0 f / h 2 ∗( s e l f−>c o e f [ A 2 C ]+ s e l f−>c o e f [ B 2 C ] / 4 . 0 f+s e l f−>c o e f [ C 2 C ] / 9 . 0 f ) ) ;
570 M S E T ( B , i , i+1, N , s e l f−>_ c o e f [ A 2 C ] ) ;
571 M S E T ( B , i , i+2, N , s e l f−>_ c o e f [ B 2 C ] ) ;
572 M S E T ( B , i , i+3, N , s e l f−>_ c o e f [ C 2 C ] ) ;
573 }
574
575 return 0 ;
576 }
577
578
579
580
581
582
583 /∗
584 ∗ C a l c u l a t e s s e c o n d d e r i v a t i v e
585 ∗
586 ∗/
587 int c o m p a c t _ d e r i v a t i v e 2 ( C o m p a c t ∗ s e l f , f loat ∗f , f loat ∗ d f _ b , f loat ∗ f _ b , f loat ∗ Y )
588 {
589 stat ic f loat ∗ t m p 1 = N U L L ;
590 f loat a l p h a = 1.0 f ;
591 f loat b e t a = 0.0 f ;
592 int s o l v e r _ m = 1;
593 int s o l v e r _ n = s e l f−>N ;
594 int s o l v e r _ i n f o = 0;
595
596 i f ( t m p 1 == N U L L ) {
597 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , s o l v e r _ n ∗ s izeo f ( f loat ) ) ) ;
598 }
599
600 /∗SOLVE∗/
601 c u d a _ s g e m v ( C b l a s C o l M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,
602 s e l f−>B2 , s o l v e r _ n , f , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;
603 #ifde f INVERSE
604 c u d a _ s g e m v ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ n , a l p h a ,
605 s e l f−>A2 , s o l v e r _ n , Y , s o l v e r _ m , b e t a , Y , s o l v e r _ m ) ; // tmp2 , s o l v e r m ) ;
606 #else
607 s o l v e r _ i n f o = c u d a _ s g e t r s ( C b l a s R o w M a j o r , C b l a s N o T r a n s , s o l v e r _ n , s o l v e r _ m ,
608 s e l f−>A2 , s o l v e r _ n , s e l f−>A 2 p i v o t s , Y , s o l v e r _ n ) ; // tmp2 , s o l v e r n ) ;
609
610
611 D P R I N T ( ” s o l v e r return value : %d\n” , s o l v e r _ i n f o ) ;
612 #endif
613
614 return 0 ;
615 }
616
617
618
619
620 /∗ v i : s e t f o l d m e t h o d= s y n t a x tw =100 : ∗/
621 /∗EOF∗/
Listing B.11: RK4 CUDA implementation1
2 #include ” rk4 cuda . h”
3
4 #include <cuda . h>
5 #include <cuda runtime . h>
6
7 #include <aux . h>
8 #include <cuda blas . h>
9 #include <cuda lapack . h>
10
11 int r k 4 _ i n i t ( R K 4 ∗ s e l f , f loat dt , int (∗ F ) ( int , f loat ∗ , f loat ∗) )
12 {
88
13 i f ( F == N U L L ){
14 return −1;
15 }
16 i f ( d t <= 0.0 f ){
17 return −2;
18 }
19 s e l f−>d t = d t ;
20 s e l f−>F = F ;
21 return 0 ;
22 }
23
24
25
26 int r k 4 _ i n t e g r a t e ( R K 4 ∗ s e l f , int n , f loat∗ i n p u t , f loat∗ o u t p u t )
27 {
28 stat ic f loat ∗ t m p _ y=N U L L ;
29 stat ic f loat ∗ t m p _ x=N U L L ;
30
31 i f ( t m p _ x == N U L L ) {
32 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p _ x , n∗ s izeo f ( f loat ) ) ) ;
33 }
34 else {
35 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p _ x , 0 .0 f , n∗ s izeo f ( f loat ) ) ) ;
36 }
37 i f ( t m p _ y == N U L L ) {
38 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p _ y , n∗ s izeo f ( f loat ) ) ) ;
39 }
40 else {
41 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p _ y , 0 .0 f , n∗ s izeo f ( f loat ) ) ) ;
42 }
43
44
45 /∗ i n i t i a l s t a t u s ( k1 ) ∗/
46 s e l f−>F ( n , i n p u t , t m p _ y ) ;
47 c u d a _ e r r o r ( c u d a M e m c p y ( o u t p u t , t m p _ y , n∗ s izeo f ( f loat ) ,
48 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ o u t p u t = k1 ∗/
49
50 /∗ f i r s t m i d d l e s t e p ( k2 ) ∗/
51 c u d a _ e r r o r ( c u d a M e m c p y ( t m p _ x , i n p u t , n∗ s izeo f ( f loat ) ,
52 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ x0 ∗/
53 c u b l a s S a x p y ( n , s e l f−>d t /2 .0 f , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = 2 d t ∗ k1 +x0 ∗/
54 s e l f−>F ( n , t m p _ x , t m p _ y ) ; /∗ k2 ∗/
55 c u b l a s S a x p y ( n , 2 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 ∗/
56
57
58 /∗ s e c o n d m i d d l e s t e p ∗/
59 c u d a _ e r r o r ( c u d a M e m c p y ( t m p _ x , i n p u t , n∗ s izeo f ( f loat ) ,
60 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ x0 ∗/
61 c u b l a s S a x p y ( n , s e l f−>d t /2 .0 f , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = 2 d t ∗ k2+x0 ∗/
62 s e l f−>F ( n , t m p _ x , t m p _ y ) ; /∗ k3 ∗/
63 c u b l a s S a x p y ( n , 2 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 +2∗ k3 ∗/
64
65
66 /∗ l a s t m i d d l e s t e p ∗/
67 c u d a _ e r r o r ( c u d a M e m c p y ( t m p _ x , i n p u t , n∗ s izeo f ( f loat ) ,
68 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ x0 ∗/
69 c u b l a s S a x p y ( n , s e l f−>dt , t m p _ y , 1 , t m p _ x , 1) ; /∗ tmpx = d t ∗ k3+x0 ∗/
70 s e l f−>F ( n , t m p _ x , t m p _ y ) ; /∗ k4 ∗/
71 c u b l a s S a x p y ( n , 1 .0 f , t m p _ y , 1 , o u t p u t , 1) ; /∗ o u t p u t = k1 +2∗ k2 +2∗ k3+k4 ∗/
72
73
74 /∗ a v a r a g i n g s t e p ∗/
75 c u d a _ e r r o r ( c u d a M e m c p y ( t m p _ x , i n p u t , n∗ s izeo f ( f loat ) ,
76 c u d a M e m c p y D e v i c e T o D e v i c e ) ) ; /∗ x0 ∗/
77 c u b l a s S s w a p ( n , t m p _ x , 1 , o u t p u t , 1) ;
78 c u b l a s S a x p y ( n , s e l f−>d t /6 .0 f , t m p _ x , 1 , o u t p u t , 1) ; /∗ tmpx = 6 d t ∗ o u t+x0 ∗/
79
80 return 0 ;
81 }
82
83
89
84
85
86 /∗EOF∗/
B.2.3 Application
Listing B.12: Simulation implementation1
2 #define XOPEN SOURCE 500
3
4 #include <s td i o . h>
5 #include <math . h>
6 #include <time . h>
7 #include <malloc . h>
8 #include <s t r i n g . h>
9
10 #include <muti l . h>
11 #include <mNumeric . h>
12
13 #include <cuda . h>
14 #include <cuda runtime . h>
15
16 #include <aux . h>
17 #include <cuda blas . h>
18 #include <cuda lapack . h>
19
20 #define pi M PI
21 #define BENCH FNAME ” ./ tmp/ gbench burgers . l og ”
22 #define LOG FNAME ” ./ tmp/ gburgers . l og ”
23 #define X MIN 0.0
24 #define X MAX 1.0
25
26 #define K2 0.1
27 #define K1 0.3
28
29 long int N X ;
30 int N T = 500;
31 f loat n u ;
32 const f loat a = −10.0 f ;
33 C o m p a c t ∗ C A ;
34 f loat∗ d f _ b , ∗ f _ b ;
35
36 /∗
37 ∗ l i n s p a c e ( s t a r t , s t o p , num=50 , e n d p o i n t , r e t s t e p )
38 ∗/
39
40 int l i n s p a c e ( f loat s t a r t , f loat s t o p , int n u m , int e n d p o i n t , f loat ∗ s t e p , f loat ∗ Y )
41 {
42 f loat d x ;
43 int i , n ;
44
45 i f ( e n d p o i n t <= 0){
46 n = n u m + 1;
47
48 }
49 else{
50 n = n u m ;
51 }
52
53 d x = ( s t o p −s t a r t ) / ( n−1) ;
54 for ( i=0; i<n u m ; i++){
55 Y [ i ] = i∗ d x ;
56 }
57
90
58 ∗ s t e p = d x ;
59 return 0 ;
60 }
61
62
63
64 int F ( int nx , f loat∗ x , f loat∗ y )
65 {
66 stat ic f loat ∗ t m p 1 = N U L L ;
67 const int i n c x = 1;
68
69 i f ( t m p 1 == N U L L ) {
70 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&t m p 1 , n x ∗ s izeo f ( f loat ) ) ) ;
71 }
72 else {
73 c u d a _ e r r o r ( c u d a M e m s e t ( ( void∗) t m p 1 , 0 .0 f , n x ∗ s izeo f ( f loat ) ) ) ;
74 }
75
76 c o m p a c t _ d e r i v a t i v e 2 ( CA , x , d f _ b , f _ b , t m p 1 ) ;
77 c o m p a c t _ d e r i v a t i v e ( CA , x , d f _ b , f _ b , y ) ;
78 c u b l a s S s c a l ( nx , a , y , i n c x ) ;
79 c u b l a s S a x p y ( nx , nu , t m p 1 , i n c x , y , i n c x ) ;
80 return 0 ;
81 }
82
83 int f _ u 0 ( int nx , f loat∗x , f loat∗ y )
84 {
85 int i ;
86 f loat t m p ;
87
88 goto s i n u s o i d a l ;
89 for ( i=0; i<n x ; i++){
90 i f ( i == n x /10){
91 y [ i ] = 1 .0 f ;
92 }
93 else{
94 y [ i ] = 0 .0 f ;
95 }
96 }
97 goto e n d ;
98
99 s i n u s o i d a l :
100 for ( i=0; i<n x ; i++){
101 t m p = x [ i ] ;
102 i f ( t m p >= 0.05 && t m p <0.15){
103 y [ i ] = 0 .5 f∗ s i n f ( ( t m p −0.85) ∗2.0∗ p i /0 .1 ) ;
104 }
105 else{
106 y [ i ] = 0 .0 f ;
107 }
108 }
109 goto e n d ;
110 e n d :
111 return 0 ;
112 }
113
114
115
116 #include ”aux . h”
117
118
119 int m a i n ( int a r g c , char ∗ a r g v [ ] )
120 {
121 f loat ∗ xx , ∗ u 0 ;
122 f loat∗ L O G , ∗ h _ l o g , ∗ d _ u 0 ;
123 f loat dx , d t ;
124 int i , j ;
125 C o m p a c t C A _ [ 1 ] ;
126 R K 4 R K [ 1 ] ;
127 f loat c o e f [ ] = {−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f , −1.0 f ,−1.0 f ,−1.0 f ,−1.0 f ,−1.0 f } ;
128 F I L E ∗ l o g _ f i l e ;
91
129 c l o c k _ t t i m e s [ N T +10] ;
130
131 t i m e s [ 0 ] = c l o c k ( ) ;
132 m a l l o p t ( M _ M M A P _ M A X , 0) ;
133 c u d a _ i n i t ( a r g c , a r g v ) ;
134
135 i f ( a r g c > 1){
136 N X = s t r t o l ( a r g v [ 1 ] , N U L L , 10) ;
137 i f ( N X==L O N G _ M I N | | N X== L O N G _ M A X ){
138 p e r r o r ( ”Argument e r r o r ” ) ;
139 }
140 }
141 else N X = 512;
142 /∗ i n i t domain ∗/
143 x x = ( f loat ∗) c a l l o c ( NX , s izeo f ( f loat ) ) ;
144 u 0 = ( f loat ∗) c a l l o c ( NX , s izeo f ( f loat ) ) ;
145 f _ b = ( f loat ∗) c a l l o c (2 , s izeo f ( f loat ) ) ;
146 d f _ b =( f loat ∗) c a l l o c (2 , s izeo f ( f loat ) ) ;
147
148
149 l i n s p a c e ( X _ M I N , X _ M A X , NX , 1 , &dx , x x ) ;
150
151 d t = f a b s f ( ( K 1 ∗ d x ) / a ) ;
152 n u = ( K 2 ∗ d x ∗ d x ) / d t ;
153
154 C A = C A _ ;
155 c o m p a c t _ i n i t ( CA , dx , NX , 4 , c o e f ) ;
156
157 f _ u 0 ( NX , xx , u 0 ) ;
158 r k 4 _ i n i t ( RK , dt , F ) ;
159
160
161 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&d _ u 0 , N X ∗ s izeo f ( f loat ) ) ) ;
162 c u d a _ e r r o r ( c u d a M a l l o c ( ( void∗∗)&L O G , N T ∗ N X ∗ s izeo f ( f loat ) ) ) ;
163 h _ l o g = ( f loat ∗) m a l l o c ( N X ∗ N T ∗ s izeo f ( f loat ) ) ;
164
165 D P R I N T ( ”x domain : (x m , x M , dx )=(%1.2 f ,%1.2 f ,%1.2 f \n” , x x [ 0 ] , x x [ NX −1] , d x ) ;
166 D P R I N T ( ” t domain : ( t m , x m , dx )=(%2.5 f ,%2.5 f ,% f \n” , 0 .0 f , ( NT−1)∗ dt , d t ) ;
167
168 /∗ i n i t i a l c o n d i t i o n ∗/
169 c u d a _ e r r o r ( c u d a M e m c p y ( d _ u 0 , u0 , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y H o s t T o D e v i c e ) ) ;
170 c u d a _ e r r o r ( c u d a M e m c p y ( L O G , d _ u 0 , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
171 m e m c p y ( h _ l o g , u0 , N X ∗ s izeo f ( f loat ) ) ;
172
173 /∗ main l o o p ∗/
174
175 t i m e s [ 1 ] = c l o c k ( ) ;
176 for ( i=1; i<N T ; i++){
177 r k 4 _ i n t e g r a t e ( RK , NX , d _ u 0 , L O G+i∗ N X ) ;
178 c u d a _ e r r o r ( c u d a M e m c p y ( d _ u 0 , L O G+i∗ NX , N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o D e v i c e ) ) ;
179 }
180
181 c u d a _ e r r o r ( c u d a M e m c p y ( h _ l o g , L O G , N T ∗ N X ∗ s izeo f ( f loat ) , c u d a M e m c p y D e v i c e T o H o s t ) ) ;
182 t i m e s [ 2 ] = c l o c k ( ) ;
183
184 l o g _ f i l e = f o p e n ( B E N C H _ F N A M E , ”a” ) ;
185 // o u t p u t f o rm a t NX NT t i t l
186 f p r i n t f ( l o g _ f i l e , ”%04d %03d %0∗ ld %0∗ ld\n” , ( int ) NX , NT , 8 ,
187 ( t i m e s [1]− t i m e s [ 0 ] ) /1000 , 8 , ( t i m e s [2]− t i m e s [ 1 ] ) /1000) ;
188 f c l o s e ( l o g _ f i l e ) ;
189 f p r i n t f ( s t d e r r , ”DONE\n” ) ;
190 f f l u s h ( N U L L ) ;
191 /∗ OUTPUT ∗/
192
193 l o g _ f i l e = f o p e n ( L O G _ F N A M E , ”w” ) ;
194
195 for ( i=0; i<N X ; i++){
196 f p r i n t f ( l o g _ f i l e , ”%+1.5 f ” , x x [ i ] ) ;
197 for ( j=0; j<NT −1; j++){
198 f p r i n t f ( l o g _ f i l e , ”%+2.5 f ” , ∗( h _ l o g+j∗ N X+i ) ) ;
199 }
92
200 f p r i n t f ( l o g _ f i l e , ”%+2.5 f \n” , ∗( h _ l o g +( NT−1)∗ N X+i ) ) ;
201 }
202 f c l o s e ( l o g _ f i l e ) ;
203
204 return 0 ;
205 }
206
207
208
209 /∗ v i : s e t f o l d m e t h o d= s y n t a x : ∗/
210 /∗EOF∗/
93
top related