parallel and distributed computing - ulisboa · outline parallel programming dependency graphs...

Parallel Programming Methodology

Parallel and Distributed Computing

Department of Computer Science and Engineering (DEI)Instituto Superior Tecnico

October 4, 2012

CPD (DEI / IST) Parallel and Distributed Computing – 6 2012-10-04 1 / 26

Outline

Parallel programming

Dependency graphs

Overheads

influence on programming of shared- vs distributed-memory systems

Foster’s design methodology


Parallel Programming

Steps:

Identify work that can be done in parallel

Partition work and perhaps data among tasks

Manage data access, communication and synchronization


Dependency Graphs

Programs can be modeled as directed graphs:

Nodes: at the finer granularity level, are instructions

⇒ to reduce complexity, nodes may be an arbitrarysequence of statements

Edges: data dependency constraints among instructions in the nodes

Data Dependency Graphs


Dependency Graphs

read(A, B);

x = initX(A, B);

y = initY(A, B);

z = initZ(A, B);

for(i = 0; i < N_ENTRIES; i++)

x[i] = compX(y[i], z[i]);

for(i = 1; i < N_ENTRIES; i++){

x[i] = solveX(x[i-1]);

z[i] = x[i] + y[i];

}

finalize1(&x, &y, &z);



.

.

. ...


Types of Parallelism

C

A

B B B

E

A

B C D

A

A

A

B

C

B

B

C

C

Data Functional PipelineParallelism Parallelism Parallelism


Overheads

Task creation/finish

Data transfer

Communication (synchronization)

Load balancing


Shared vs Distributed Memory Systems

Overheads very different depending on type of architecture!

Start/Finish Data Load Comm

Shared H H = NDistributed N N = H


Shared vs Distributed Memory Systems

Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained.DM: typically all tasks active until end, hence requires more coarse-grained

tasks.

Data SM: data partition not an issue when defining tasks; however caution whenaccessing shared data: avoid races using mutual-exclusive regions

DM: data partition is critical for the performance of the application

in both SM and DM:

minimize synchronization pointsbe careful about load balancing


Shared Memory Systems

Typical diagram of a parallel application under shared memory:

Master Thread

Other ThreadsFork

Join

Fork

Join

Tim

e

Fork / Join Parallelism


Shared Memory Systems

Application is typically a single program, with directives to handleparallelism:

fork / join

parallel loops

private vs shared variables

critical sections


Distributed Memory Systems

Cannot use fine granularity!

Each processor gets assigned a (large) task:

static scheduling: all tasks start at the beginning of computation

dynamic scheduling: tasks start as needed

Application is typically also a single program!⇒ identification number of each task indicates what is its job.


Task / Channel Model

Parallel programming for distributed memory systems uses:

Task / Channel Model

Parallel computation is represented as a set of tasks that may interact witheach other by sending messages through channels.

Task: program + local memory + I/O ports

Channel: message queue that connects one task’s output port withanother task’s input port

All tasks start simultaneously, and finishing time is determined by the timethe last task stops its execution.


Messages in the Task / Channel Model

ordering of data in the channel is maintained

receiving task blocks until a value is available at the receiver

sender never blocks, independently of previous messages not yetdelivered

In the task / channel model

receiving is a synchronous operation

sending is an asynchronous operation


Foster’s Design Methodology

Development of scalable parallel algorithms by delayingmachine-dependent decisions to later stages.

Four steps:

partitioning

communication

agglomeration

mapping



Problem

Primitive Tasks

Partitioning

Communication

Agglomeration



Problem

Primitive Tasks

Partitioning

Communication

Agglomeration

Mapping


Foster’s Design Methodology: Partitioning

Partitioning

Process of dividing the computation and data into many small primitivetasks.

Strategies: (no single universal recipe...)

data decomposition

functional decomposition

recursive decomposition

Checklist:

> 10× P primitive tasks than P processors

minimize redundant computations and redundant data storage

primitive tasks are roughly the same size

number of tasks grows naturally with the problem size


Recursive Decomposition

Suitable for problems solvable using divide-and-conquer

Steps:

decompose a problem into a set of sub-problems

recursively decompose each sub-problem

stop decomposition when minimum desired granularity reached


Data Decomposition

Appropriate data partitioning is critical to parallel performance

Steps:

identify the data on which computations are performed

partition the data across various tasks

Decomposition can be based on

input data

output data

input + output data

intermediate data


Input Data Decomposition

Applicable if each output is computed as a function of the input

May be the only natural decomposition if output is unknown

problem of finding the minimum in a set or other reductionssorting a vector

Associate a task with each input data partition

task performs computation on its part of the datasubsequent processing combines partial results from earlier tasks


Output Data Decomposition

Applicable if each element of the output can be computedindependently

algorithm is based on one-to-one or many-to-one functions

Partition the output data across tasks

Have each task perform the computation for its outputs

Example:

Matrix-vector multiplication


Foster’s Design Methodology: Communication

Communication

Identification of the communication pattern among primitive tasks.

local communication: values shared by a small number of tasksdraw a channel from producing task to consumer tasks

global communication: values are required by a significant number of taskswhile important, not useful to represent in the task/channel model

Checklist:

communication balanced among tasks

each task communicates with a small number of tasks

tasks can perform their communication concurrently

tasks can perform their computations concurrently


Foster’s Design Methodology: Agglomeration

Agglomeration

Process of grouping primitive tasks into larger tasks.

Strategies:

group tasks that have high communication with each othergroup sender tasks and group receiving tasksgroup tasks to allow re-use of sequential code

Checklist:

locality has been maximizedreplicated computations take less time than the communications theyreplaceamount of replicated data is small enough to allow algorithm to scaletasks are balanced in terms of computation and communicationnumber of tasks grows naturally with problem sizenumber of tasks is small, but at least as great as Pcost of modifications to sequential code is minimized


Foster’s Design Methodology: Mapping

Mapping

Process of assigning tasks to processors.

Strategies:

maximize processor utilization (average % time processor are active)⇒ even load distribution

minimize interprocessor communication⇒ map tasks with channels among them to the same processor⇒ take into account network topology


Review

Parallel programming

Dependency graphs

Overheads

influence on programming of shared- vs distributed-memory systems

Foster’s design methodology


Next Class

OpenMP


parallel and distributed computing - ulisboa · outline parallel programming dependency graphs...

Documents