parallel and distributed computing - ulisboa · outline parallel programming dependency graphs...
TRANSCRIPT
Parallel Programming Methodology
Parallel and Distributed Computing
Department of Computer Science and Engineering (DEI)Instituto Superior Tecnico
October 4, 2012
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 1 / 26
Outline
Parallel programming
Dependency graphs
Overheads
influence on programming of shared- vs distributed-memory systems
Foster’s design methodology
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 2 / 26
Parallel Programming
Steps:
Identify work that can be done in parallel
Partition work and perhaps data among tasks
Manage data access, communication and synchronization
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 3 / 26
Dependency Graphs
Programs can be modeled as directed graphs:
Nodes: at the finer granularity level, are instructions
⇒ to reduce complexity, nodes may be an arbitrarysequence of statements
Edges: data dependency constraints among instructions in the nodes
Data Dependency Graphs
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 4 / 26
Dependency Graphs
Programs can be modeled as directed graphs:
Nodes: at the finer granularity level, are instructions
⇒ to reduce complexity, nodes may be an arbitrarysequence of statements
Edges: data dependency constraints among instructions in the nodes
Data Dependency Graphs
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 4 / 26
Dependency Graphs
read(A, B);
x = initX(A, B);
y = initY(A, B);
z = initZ(A, B);
for(i = 0; i < N_ENTRIES; i++)
x[i] = compX(y[i], z[i]);
for(i = 1; i < N_ENTRIES; i++){
x[i] = solveX(x[i-1]);
z[i] = x[i] + y[i];
}
finalize1(&x, &y, &z);
finalize2(&x, &y, &z);
finalize3(&x, &y, &z);
.
.
. ...
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 5 / 26
Types of Parallelism
C
A
B B B
E
A
B C D
A
A
A
B
C
B
B
C
C
Data Functional PipelineParallelism Parallelism Parallelism
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 6 / 26
Overheads
Task creation/finish
Data transfer
Communication (synchronization)
Load balancing
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 7 / 26
Shared vs Distributed Memory Systems
Overheads very different depending on type of architecture!
Start/Finish Data Load Comm
Shared H H = NDistributed N N = H
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 8 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained.DM: typically all tasks active until end, hence requires more coarse-grained
tasks.
Data SM: data partition not an issue when defining tasks; however caution whenaccessing shared data: avoid races using mutual-exclusive regions
DM: data partition is critical for the performance of the application
in both SM and DM:
minimize synchronization pointsbe careful about load balancing
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 9 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained.DM: typically all tasks active until end, hence requires more coarse-grained
tasks.
Data SM: data partition not an issue when defining tasks; however caution whenaccessing shared data: avoid races using mutual-exclusive regions
DM: data partition is critical for the performance of the application
in both SM and DM:
minimize synchronization pointsbe careful about load balancing
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 9 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained.DM: typically all tasks active until end, hence requires more coarse-grained
tasks.
Data SM: data partition not an issue when defining tasks; however caution whenaccessing shared data: avoid races using mutual-exclusive regions
DM: data partition is critical for the performance of the application
in both SM and DM:
minimize synchronization pointsbe careful about load balancing
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 9 / 26
Shared Memory Systems
Typical diagram of a parallel application under shared memory:
Master Thread
Other ThreadsFork
Join
Fork
Join
Tim
e
Fork / Join Parallelism
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 10 / 26
Shared Memory Systems
Application is typically a single program, with directives to handleparallelism:
fork / join
parallel loops
private vs shared variables
critical sections
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 11 / 26
Distributed Memory Systems
Cannot use fine granularity!
Each processor gets assigned a (large) task:
static scheduling: all tasks start at the beginning of computation
dynamic scheduling: tasks start as needed
Application is typically also a single program!⇒ identification number of each task indicates what is its job.
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 12 / 26
Distributed Memory Systems
Cannot use fine granularity!
Each processor gets assigned a (large) task:
static scheduling: all tasks start at the beginning of computation
dynamic scheduling: tasks start as needed
Application is typically also a single program!⇒ identification number of each task indicates what is its job.
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 12 / 26
Task / Channel Model
Parallel programming for distributed memory systems uses:
Task / Channel Model
Parallel computation is represented as a set of tasks that may interact witheach other by sending messages through channels.
Task: program + local memory + I/O ports
Channel: message queue that connects one task’s output port withanother task’s input port
All tasks start simultaneously, and finishing time is determined by the timethe last task stops its execution.
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 13 / 26
Messages in the Task / Channel Model
ordering of data in the channel is maintained
receiving task blocks until a value is available at the receiver
sender never blocks, independently of previous messages not yetdelivered
In the task / channel model
receiving is a synchronous operation
sending is an asynchronous operation
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 14 / 26
Foster’s Design Methodology
Development of scalable parallel algorithms by delayingmachine-dependent decisions to later stages.
Four steps:
partitioning
communication
agglomeration
mapping
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 15 / 26
Foster’s Design Methodology
Problem
Primitive Tasks
Partitioning
Communication
Agglomeration
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 16 / 26
Foster’s Design Methodology
Problem
Primitive Tasks
Partitioning
Communication
Agglomeration
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 16 / 26
Foster’s Design Methodology
Problem
Primitive Tasks
Partitioning
Communication
Agglomeration
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 16 / 26
Foster’s Design Methodology
Problem
Primitive Tasks
Partitioning
Communication
Agglomeration
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 16 / 26
Foster’s Design Methodology
Problem
Primitive Tasks
Partitioning
Communication
Agglomeration
Mapping
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 16 / 26
Foster’s Design Methodology: Partitioning
Partitioning
Process of dividing the computation and data into many small primitivetasks.
Strategies: (no single universal recipe...)
data decomposition
functional decomposition
recursive decomposition
Checklist:
> 10× P primitive tasks than P processors
minimize redundant computations and redundant data storage
primitive tasks are roughly the same size
number of tasks grows naturally with the problem size
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 17 / 26
Recursive Decomposition
Suitable for problems solvable using divide-and-conquer
Steps:
decompose a problem into a set of sub-problems
recursively decompose each sub-problem
stop decomposition when minimum desired granularity reached
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 18 / 26
Data Decomposition
Appropriate data partitioning is critical to parallel performance
Steps:
identify the data on which computations are performed
partition the data across various tasks
Decomposition can be based on
input data
output data
input + output data
intermediate data
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 19 / 26
Input Data Decomposition
Applicable if each output is computed as a function of the input
May be the only natural decomposition if output is unknown
problem of finding the minimum in a set or other reductionssorting a vector
Associate a task with each input data partition
task performs computation on its part of the datasubsequent processing combines partial results from earlier tasks
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 20 / 26
Output Data Decomposition
Applicable if each element of the output can be computedindependently
algorithm is based on one-to-one or many-to-one functions
Partition the output data across tasks
Have each task perform the computation for its outputs
Example:
Matrix-vector multiplication
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 21 / 26
Foster’s Design Methodology: Communication
Communication
Identification of the communication pattern among primitive tasks.
local communication: values shared by a small number of tasksdraw a channel from producing task to consumer tasks
global communication: values are required by a significant number of taskswhile important, not useful to represent in the task/channel model
Checklist:
communication balanced among tasks
each task communicates with a small number of tasks
tasks can perform their communication concurrently
tasks can perform their computations concurrently
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 22 / 26
Foster’s Design Methodology: Agglomeration
Agglomeration
Process of grouping primitive tasks into larger tasks.
Strategies:
group tasks that have high communication with each othergroup sender tasks and group receiving tasksgroup tasks to allow re-use of sequential code
Checklist:
locality has been maximizedreplicated computations take less time than the communications theyreplaceamount of replicated data is small enough to allow algorithm to scaletasks are balanced in terms of computation and communicationnumber of tasks grows naturally with problem sizenumber of tasks is small, but at least as great as Pcost of modifications to sequential code is minimized
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 23 / 26
Foster’s Design Methodology: Mapping
Mapping
Process of assigning tasks to processors.
Strategies:
maximize processor utilization (average % time processor are active)⇒ even load distribution
minimize interprocessor communication⇒ map tasks with channels among them to the same processor⇒ take into account network topology
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 24 / 26
Review
Parallel programming
Dependency graphs
Overheads
influence on programming of shared- vs distributed-memory systems
Foster’s design methodology
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 25 / 26
Next Class
OpenMP
CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 26 / 26