parallel computing: perspectives for more efficient hydrological modeling
Post on 24-Jun-2015
618 Views
Preview:
DESCRIPTION
TRANSCRIPT
Parallel Computing: Perspectives for more e�cient
hydrological modeling
Grigorios Anagnostopoulos
Internal Seminar, 11.10.2011
General Concepts GPU Programming CA Parallel implementation
What is parallel computing?
Simultaneous use of multiple computing resources to solve a singlecomputational problem.
The computing resources can be:
A single computer with multiple processors.
A number of computers connected to a network.
A combination of both.
Benefits of parallel computing:
The computational load is broken apart in discrete pieces of work
that can be treated simultaneously.
The total simulation time is much less using multiple computing
resources.
2 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Parallel Computer Models Classification
Flynn’s taxonomy: A widely used classification
Classify along two independent dimensions:
Instruction and Data.
Each dimension can have two possible states:
Single or Multiple.
Parallel Computer Classification
Flynn's taxonomy: a widely used classifications◦ Classify along two independent dimensions:
Instruction and Data◦ Each dimension can have two possible states:
Single or Multiple
S I S DSingle Instruction,
Single Data
S I M DSingle Instruction,
Multiple Data
M I S DMultiple Instruction,
Single Data
M I M DMultiple Instruction,
Multiple Data
38
3 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
MIMD: Multiple Instruction, Multiple Data
The most common type of parallel computer (most moderncomputers fall into this category).
Consists of a collection of fully independent processing units orcores having their own control unit and its own ALU.
Execution can be synchronous or asynchronous, as the processorscan operate at their own pace.
2.3 Parallel Hardware 33
As we noted in Chapter 1, there are two principal types of MIMD systems:shared-memory systems and distributed-memory systems. In a shared-memory sys-tem a collection of autonomous processors is connected to a memory system via aninterconnection network, and each processor can access each memory location. In ashared-memory system, the processors usually communicate implicitly by accessingshared data structures. In a distributed-memory system, each processor is pairedwith its own private memory, and the processor-memory pairs communicate overan interconnection network. So in distributed-memory systems the processors usu-ally communicate explicitly by sending messages or by using special functions thatprovide access to the memory of another processor. See Figures 2.3 and 2.4.
Interconnect
CPU CPU CPU CPU
Memory
FIGURE 2.3
A shared-memory system
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
Interconnect
FIGURE 2.4
A distributed-memory system
Shared-memory systemsThe most widely available shared-memory systems use one or more multicore pro-cessors. As we discussed in Chapter 1, a multicore processor has multiple CPUs orcores on a single chip. Typically, the cores have private level 1 caches, while othercaches may or may not be shared between the cores.
4 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Parallelism: An everyday example
Task parallelism: the ability to execute di↵erent tasks within aproblem at the same time.
Data parallelism: the ability to execute parts of the same task ondi↵erent data at the same time.
As an analogy, think about afarmer who hires workers topick apples from his trees:
Worker = hardware(processing element).
Trees = task.
Apples = data.
Parallelism As an analogy, think about a farmer who
hires workers to pick apples from an orchard of trees◦ Worker hardware
(processing element)◦ Trees tasks ◦ Apples data
475 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Sequential approach
The sequential approach would be to have the worker pick all ofthe apples from each tree.
Parallelism The serial approach would be to have one
worker pick all of the apples from each tree
48
6 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Parallelism: More workers
Data parallel hardware: Working on the same tree, which allowseach task to be completed quicker.
How many workers should work per tree?
What if some trees have few apples, while others many?
Parallelism – More workers Working on the same tree.
◦ data parallel hardware, and would allow each task to be completed quicker How many workers should there be per tree? What if some trees have few apples, while others have many?
49
7 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Parallelism: More workers
Task parallelism: Each worker pick apples from a di↵erent tree.
Although each task takes the same time as in the sequential version,many tasks are accomplished in parallel.
What if there are only a few densely populated trees?
Parallelism – More workers Each worker pick apples from a different tree
◦ Task parallelism, and although each task takes the same time as in the serial version, many are accomplished in parallel
◦ What if there are only a few densely populated trees?
50
8 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Algorithm Decomposition
Most of engineering problems are non trivial and it is crucial tohave more formal concepts for determining parallelism.
The concept of decomposition
Task decomposition: dividing the algorithm into individual tasks,which are functionally independent. Tasks which don’t havedependencies (or whose dependencies are completed) can beexecuted at any time to achieve parallelism.
Data decomposition: dividing a data set into discrete chunks thatcan be processed in parallel.
Task Decomposition reduces an algorithm to functionally independent parts Tasks may have dependencies on other tasks
◦ If the input of task B is dependent on the output of task A, then task B is dependent on task A
◦ Tasks that don’t have dependencies (or whose dependencies are completed) can be executed at any time to achieve parallelism
◦ Task dependency graphs are used to describe the relationship between tasks
A
B
A
C
B
B is dependent on A
A and B are independent of each other
C is dependent on A and B
529 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Why GPU Programming?
10
A quiet revolution and potential build-up◦ Calculation: TFLOPS vs. 100 GFLOPS◦ Memory Bandwidth: ~10x
◦ GPU in every PC– massive volume and potential impact
Why GPU?
Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.
Multi-core CPU
Many-core GPU
Courtesy: John Owens
Parallel programming is easier than ever because it can be done atrelative low-end pc’s.
Cards such as the Nvidia Tesla C1060 and GT200 contain 240cores, each of which is highly multithreaded.
10 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
GPU vs CPU
GPU: Few instructions but very fast execution. Uses very fastGDDR3 RAM. Most die area is used for ALUs and the caches arerelative small.
CPU: Lots of instructions but slower execution. Uses slower DDR2or DDR3 RAM (but it has direct access to more memory thanGPUs). Most die area is used for memory cache and there arerelative few transistors for ALUs.
CPU – GPU HW Differences
● CPU
● Most die area used for memory cache
● Relatively few transistors for ALUs
● GPU
● Most die area used for ALUs
● Relatively small caches
11 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
GPU is fastGPU is fast
12 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
CUDA: Compute Unified Device Architecture
CUDA Program: Consists of phases that are executed on eitherthe host (CPU) or a device (GPU).
No data parallelism = the code is executed at the host.
Data parallelism = the code is executed at the device.
Data-parallel portions of an application are expressed as devicekernels which run on the device.
GPU kernels are written using the Single Program Multiple Data(SPMD) programming model.
SPMD executes multiple instances of the same program
independently, where each program works on a di↵erent portion of
the data.
15
Arrays of Parallel Threads
• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and
make control decisions
76543210
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
13 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
CUDA: Compute Unified Device Architecture
A CUDA kernel is executedby an array of threads.Each thread has an ID,which is used to computememory addresses and makecontrol decisions.
CUDA threads are organizedinto multiple blocks.Threads within a blockcooperate via sharedmemory, atomic operationsand barrier synchronization.
Chapter 2. Programming Model
10 CUDA Programming Guide Version 2.3
Figure 2-1. Grid of Thread Blocks
2.3 Memory Hierarchy CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 2-2. Each thread has a private local memory. Each thread block has a shared memory visible to all threads of the block and with the same lifetime as the block. Finally, all threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Sections 5.1.2.1, 5.1.2.3, and 5.1.2.4). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Section 3.2.4).
The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
14 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
CUDA memory types
Global memory: Lowbandwidth but large space.Fastest read/write calls ifthey are coalesced.
Texture memory: Cacheoptimized for 2D spatialpatterns.
Constant memory: Slow,but with cache (8 kb).
Shared memory: Fast, butit can be used only by thethreads of the same block.
Registers: 32768 32-bitregisters per Multi-processor.
Chapter 4: Hardware Implementation
CUDA Programming Guide Version 2.3 75
A set of SIMT multiprocessors with on-chip shared memory.
Figure 4-2. Hardware Model
4.2 Multiple Devices The use of multiple GPUs as CUDA devices by an application running on a multi-GPU system is only guaranteed to work if these GPUs are of the same type.
When the system is in SLI mode, all GPUs are accessible via the CUDA driver and runtime as separate devices, but there are special considerations as described below.
First, an allocation in one CUDA device on one GPU will consume memory on other GPUs. Because of this, allocations may fail earlier than otherwise expected.
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device Memory
Shared Memory
Instruction Unit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
Constant Cache
Texture Cache
15 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
CA Parallel implementation
A parallel version of the Cellular Automata algorithm for variablysaturated flow in soils was developed in CUDA API.
The infiltration experiment of Vauclin et al. (1979) was chosen as abenchmark test for the accuracy and the speed of the algorithm.
Distance (m)
Wa
ter
De
pth
(m
)
0 0.5 1 1.5 2 2.5 32
1.5
1
0.5
0t = 2 hrs
t = 3 hrs
t = 4 hrs
t = 8 hrs
experimental data
16 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Why parallel code is important?
In real case scenarios, where the 3-D simulation of large areas isneeded, the grid sizes are excessively large.
In natural hazards assessment the simulations should be fast in orderto be useful (the prediction should be before the actual event!).
Fast simulations allow us to calibrate easier the model parametersand investigate more e�ciently the physical phenomena.
The inherent CA concept natural parallelism make easier the
parallel implementation of the algorithm.
17 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Technical details
Di�culties
The most challenging issue was the irregular geometry of thedomain which made more di�cult the exploitation of the locality atthe thread computations and the use of the shared memory.
The cell values were stored in a 1D array and for each cell theindexes of its neighboring cells were also stored.
Code structure
Simulation constants are stored in the constant memory.
Soil properties for each soil class are stored in the texture memory.
Atomic operations are used in order to check for convergence atevery iteration.
The shared memory is used to accelerate the atomic operations andthe block’s memory accesses.
18 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
General Concepts GPU Programming CA Parallel implementation
Results of the numerical tests
Nvidia Quadro 2000:
192 CUDA cores.
1 GB GDDR5 of RAM memory.
1"
10"
100"
1000"
10000"
100000"
1000" 10000" 100000" 1000000" 10000000"
Speed%(%cells/sec%)%
Number%of%Cells%
CPU" GPU"
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
1000" 10000" 100000" 1000000" 10000000"
Speed%Up%
Number%of%Cells%
19 / 20
Parallel Computing: Perspectives for more e�cient hydrological modeling
Thanks for your attention!
top related