parallel computing: perspectives for more efficient hydrological modeling

Parallel Computing: Perspectives for more e�cient

hydrological modeling

Grigorios Anagnostopoulos

Internal Seminar, 11.10.2011

General Concepts GPU Programming CA Parallel implementation

What is parallel computing?

Simultaneous use of multiple computing resources to solve a singlecomputational problem.

The computing resources can be:

A single computer with multiple processors.

A number of computers connected to a network.

A combination of both.

Benefits of parallel computing:

The computational load is broken apart in discrete pieces of work

that can be treated simultaneously.

The total simulation time is much less using multiple computing

resources.

2 / 20

Parallel Computing: Perspectives for more e�cient hydrological modeling


Parallel Computer Models Classification

Flynn’s taxonomy: A widely used classification

Classify along two independent dimensions:

Instruction and Data.

Each dimension can have two possible states:

Single or Multiple.

Parallel Computer Classification

Flynn's taxonomy: a widely used classifications◦ Classify along two independent dimensions:

Instruction and Data◦ Each dimension can have two possible states:

Single or Multiple

S I S DSingle Instruction,

Single Data

S I M DSingle Instruction,

Multiple Data

M I S DMultiple Instruction,

Single Data

M I M DMultiple Instruction,

Multiple Data

38

3 / 20



MIMD: Multiple Instruction, Multiple Data

The most common type of parallel computer (most moderncomputers fall into this category).

Consists of a collection of fully independent processing units orcores having their own control unit and its own ALU.

Execution can be synchronous or asynchronous, as the processorscan operate at their own pace.

2.3 Parallel Hardware 33

As we noted in Chapter 1, there are two principal types of MIMD systems:shared-memory systems and distributed-memory systems. In a shared-memory sys-tem a collection of autonomous processors is connected to a memory system via aninterconnection network, and each processor can access each memory location. In ashared-memory system, the processors usually communicate implicitly by accessingshared data structures. In a distributed-memory system, each processor is pairedwith its own private memory, and the processor-memory pairs communicate overan interconnection network. So in distributed-memory systems the processors usu-ally communicate explicitly by sending messages or by using special functions thatprovide access to the memory of another processor. See Figures 2.3 and 2.4.

Interconnect

CPU CPU CPU CPU

Memory

FIGURE 2.3

A shared-memory system

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Interconnect

FIGURE 2.4

A distributed-memory system

Shared-memory systemsThe most widely available shared-memory systems use one or more multicore pro-cessors. As we discussed in Chapter 1, a multicore processor has multiple CPUs orcores on a single chip. Typically, the cores have private level 1 caches, while othercaches may or may not be shared between the cores.

4 / 20



Parallelism: An everyday example

Task parallelism: the ability to execute di↵erent tasks within aproblem at the same time.

Data parallelism: the ability to execute parts of the same task ondi↵erent data at the same time.

As an analogy, think about afarmer who hires workers topick apples from his trees:

Worker = hardware(processing element).

Trees = task.

Apples = data.

Parallelism As an analogy, think about a farmer who

hires workers to pick apples from an orchard of trees◦ Worker hardware

(processing element)◦ Trees tasks ◦ Apples data

475 / 20



Sequential approach

The sequential approach would be to have the worker pick all ofthe apples from each tree.

Parallelism The serial approach would be to have one

worker pick all of the apples from each tree

48

6 / 20



Parallelism: More workers

Data parallel hardware: Working on the same tree, which allowseach task to be completed quicker.

How many workers should work per tree?

What if some trees have few apples, while others many?

Parallelism – More workers Working on the same tree.

◦ data parallel hardware, and would allow each task to be completed quicker How many workers should there be per tree? What if some trees have few apples, while others have many?

49

7 / 20



Parallelism: More workers

Task parallelism: Each worker pick apples from a di↵erent tree.

Although each task takes the same time as in the sequential version,many tasks are accomplished in parallel.

What if there are only a few densely populated trees?

Parallelism – More workers Each worker pick apples from a different tree

◦ Task parallelism, and although each task takes the same time as in the serial version, many are accomplished in parallel

◦ What if there are only a few densely populated trees?

50

8 / 20



Algorithm Decomposition

Most of engineering problems are non trivial and it is crucial tohave more formal concepts for determining parallelism.

The concept of decomposition

Task decomposition: dividing the algorithm into individual tasks,which are functionally independent. Tasks which don’t havedependencies (or whose dependencies are completed) can beexecuted at any time to achieve parallelism.

Data decomposition: dividing a data set into discrete chunks thatcan be processed in parallel.

Task Decomposition reduces an algorithm to functionally independent parts Tasks may have dependencies on other tasks

◦ If the input of task B is dependent on the output of task A, then task B is dependent on task A

◦ Tasks that don’t have dependencies (or whose dependencies are completed) can be executed at any time to achieve parallelism

◦ Task dependency graphs are used to describe the relationship between tasks

A

B

A

C

B

B is dependent on A

A and B are independent of each other

C is dependent on A and B

529 / 20



Why GPU Programming?

10

A quiet revolution and potential build-up◦ Calculation: TFLOPS vs. 100 GFLOPS◦ Memory Bandwidth: ~10x

◦ GPU in every PC– massive volume and potential impact

Why GPU?

Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

Parallel programming is easier than ever because it can be done atrelative low-end pc’s.

Cards such as the Nvidia Tesla C1060 and GT200 contain 240cores, each of which is highly multithreaded.

10 / 20



GPU vs CPU

GPU: Few instructions but very fast execution. Uses very fastGDDR3 RAM. Most die area is used for ALUs and the caches arerelative small.

CPU: Lots of instructions but slower execution. Uses slower DDR2or DDR3 RAM (but it has direct access to more memory thanGPUs). Most die area is used for memory cache and there arerelative few transistors for ALUs.

CPU – GPU HW Differences

● CPU

● Most die area used for memory cache

● Relatively few transistors for ALUs

● GPU

● Most die area used for ALUs

● Relatively small caches

11 / 20



GPU is fastGPU is fast

12 / 20



CUDA: Compute Unified Device Architecture

CUDA Program: Consists of phases that are executed on eitherthe host (CPU) or a device (GPU).

No data parallelism = the code is executed at the host.

Data parallelism = the code is executed at the device.

Data-parallel portions of an application are expressed as devicekernels which run on the device.

GPU kernels are written using the Single Program Multiple Data(SPMD) programming model.

SPMD executes multiple instances of the same program

independently, where each program works on a di↵erent portion of

the data.

15

Arrays of Parallel Threads

• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and

make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

13 / 20



CUDA: Compute Unified Device Architecture

A CUDA kernel is executedby an array of threads.Each thread has an ID,which is used to computememory addresses and makecontrol decisions.

CUDA threads are organizedinto multiple blocks.Threads within a blockcooperate via sharedmemory, atomic operationsand barrier synchronization.

Chapter 2. Programming Model

10 CUDA Programming Guide Version 2.3

Figure 2-1. Grid of Thread Blocks

2.3 Memory Hierarchy CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 2-2. Each thread has a private local memory. Each thread block has a shared memory visible to all threads of the block and with the same lifetime as the block. Finally, all threads have access to the same global memory.

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Sections 5.1.2.1, 5.1.2.3, and 5.1.2.4). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Section 3.2.4).

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

Grid

Block (1, 1)

Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)



Block (2, 1) Block (1, 1) Block (0, 1)

Block (2, 0) Block (1, 0) Block (0, 0)

14 / 20



CUDA memory types

Global memory: Lowbandwidth but large space.Fastest read/write calls ifthey are coalesced.

Texture memory: Cacheoptimized for 2D spatialpatterns.

Constant memory: Slow,but with cache (8 kb).

Shared memory: Fast, butit can be used only by thethreads of the same block.

Registers: 32768 32-bitregisters per Multi-processor.

Chapter 4: Hardware Implementation

CUDA Programming Guide Version 2.3 75

A set of SIMT multiprocessors with on-chip shared memory.

Figure 4-2. Hardware Model

4.2 Multiple Devices The use of multiple GPUs as CUDA devices by an application running on a multi-GPU system is only guaranteed to work if these GPUs are of the same type.

When the system is in SLI mode, all GPUs are accessible via the CUDA driver and runtime as separate devices, but there are special considerations as described below.

First, an allocation in one CUDA device on one GPU will consume memory on other GPUs. Because of this, allocations may fail earlier than otherwise expected.

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device Memory

Shared Memory

Instruction Unit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

Constant Cache

Texture Cache

15 / 20



CA Parallel implementation

A parallel version of the Cellular Automata algorithm for variablysaturated flow in soils was developed in CUDA API.

The infiltration experiment of Vauclin et al. (1979) was chosen as abenchmark test for the accuracy and the speed of the algorithm.

Distance (m)

Wa

ter

De

pth

(m

)

0 0.5 1 1.5 2 2.5 32

1.5

1

0.5

0t = 2 hrs

t = 3 hrs

t = 4 hrs

t = 8 hrs

experimental data

16 / 20



Why parallel code is important?

In real case scenarios, where the 3-D simulation of large areas isneeded, the grid sizes are excessively large.

In natural hazards assessment the simulations should be fast in orderto be useful (the prediction should be before the actual event!).

Fast simulations allow us to calibrate easier the model parametersand investigate more e�ciently the physical phenomena.

The inherent CA concept natural parallelism make easier the

parallel implementation of the algorithm.

17 / 20



Technical details

Di�culties

The most challenging issue was the irregular geometry of thedomain which made more di�cult the exploitation of the locality atthe thread computations and the use of the shared memory.

The cell values were stored in a 1D array and for each cell theindexes of its neighboring cells were also stored.

Code structure

Simulation constants are stored in the constant memory.

Soil properties for each soil class are stored in the texture memory.

Atomic operations are used in order to check for convergence atevery iteration.

The shared memory is used to accelerate the atomic operations andthe block’s memory accesses.

18 / 20



Results of the numerical tests

Nvidia Quadro 2000:

192 CUDA cores.

1 GB GDDR5 of RAM memory.

1"

10"

100"

1000"

10000"

100000"

1000" 10000" 100000" 1000000" 10000000"

Speed%(%cells/sec%)%

Number%of%Cells%

CPU" GPU"

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

1000" 10000" 100000" 1000000" 10000000"

Speed%Up%

Number%of%Cells%

19 / 20


Thanks for your attention!

parallel computing: perspectives for more efficient hydrological modeling

Education

task parallel hardware

benets of parallel computing

modern parallel computers

workers workers parallelism

output of task

multiple processors

e cient hydrological

input of task b