para algo design - c-dac

72
P a r a l l e l A l g o r i t h m D e s i g n - D e c o m p o s i t i o n T e c h . & L o a d B a l a n c i n g T e c h . June 03- 06, 2002 C-DAC, Pune 1 PCOPP PCOPP 2002 2002 Copyright C-DAC, 2002 Parallel Algorithm Design Parallel Algorithm Design Decomposition Techniques and Load Decomposition Techniques and Load Balancing Techniques Balancing Techniques PCOPP-2002 Day Day - - 4 Classroom Lecture 4 Classroom Lecture

Upload: others

Post on 23-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

1

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Parallel Algorithm Design Parallel Algorithm Design ––Decomposition Techniques and Load Decomposition Techniques and Load

Balancing Techniques Balancing Techniques

PCOPP-2002

Day Day -- 4 Classroom Lecture4 Classroom Lecture

Page 2: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

2

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Data parallelism; Task parallelism; Combination of Data and Task parallelism

� Decomposition Techniques

� Static and Load Balancing

�Mapping for load balancing

�Minimizing Interaction

�Overheads in parallel algorithms design

� Data Sharing Overheads

Parallel Algorithmic Design

Lecture OutlineFollowing Topics will be discussed

Page 3: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

3

PCOPPPCOPP20022002

Copyright C-DAC, 2002

CAD Database Scientific modeling

Multi programming

Shared address

Message passing

Data parallel

Compilation or library

Operating systems support

Communication hardware

Physical communication medium

Parallel applications

Programming models

Communication abstraction

User/system boundaryHardware/Software boundary

Layers of Abstraction in Parallel Computers

� Critical layers of abstraction lie between the application program and actual hardware

Page 4: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

4

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Questions to be answered

� How to partition the data?

� Which data is going to be partitioned?

� How many types of concurrency?

� What are the key principles of designing parallel algorithms?

� What are the overheads in the algorithm design?

� How the mapping for balancing the load is done effectively?

Parallel Algorithms and Design

Page 5: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

5

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Principles of Design of parallel algorithms

Two key steps

� Discuss methods for mapping the tasks to processors so that the processors are efficiently utilized.

� Different decompositions and mapping may yield good performance on different computers for a given problem.

It is therefore crucial for programmers to understand the relationship between the underlying machine model and the parallel program to develop efficient programs.

(Contd…)

Parallel Algorithmic Design

Page 6: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

6

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Types of Parallelism

� Data parallelism

� Task parallelism

� Combination of Data and Task parallelism

� Stream parallelism

Types of Parallelism

Page 7: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

7

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data parallelism

� Identical operations being applied concurrently on different data items is called data parallelism.

� It applies the SAME OPERATION in parallel on different elementsof a data set.

� It uses a simpler model and reduce the programmer’s work.

Example

� Problem of adding n x n matrices.

� Structured grid computations in CFD.

� Genetic algorithms.

Types of Parallelism : Data Parallelism (Contd…)

Page 8: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

8

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data parallelism

� For most of the application problems, the degree of data parallelism with the size of the problem.

� More number of processors can be used to solve large size problems.

� f90 and HPF data parallel languages

Responsibility of programmer

� Specifying the distribution of data structures

Computer responsibility

� Generates necessary code for communication and synchronization.

Types of Parallelism : Data Parallelism(Contd…)

Page 9: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

9

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Task parallelism

� Many tasks are executed concurrently is called task parallelism.

� This can be done (visualized) by a task graph. In this graph, the node represent a task to be executed. Edges represent the dependencies between the tasks.

� Sometimes, a task in the task graph can be executed as long as all preceding tasks have been completed.

� Let the programmer define different types of processes. These processes communicate and synchronize with each other through MPI or other mechanisms.

Types of Parallelism : Task Parallelism (Contd…)

Page 10: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

10

PCOPPPCOPP20022002

Copyright C-DAC, 2002

T0

T1 T2

T3 T4 T5 T6

T7 T8 T9 T10

Task GraphT00 = To

T01 T02

T03 T04 T05

T07 T08

Sub Task Graph for To

T10 = T1

T11T12

Sub Task Graph for T1

Types of Parallelism : Task Parallelism

Page 11: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

11

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Programmer’s responsibility

Programmer must deal explicitly with process creation, communication and synchronization.

Task parallelism

� Example

Vehicle relational database to process the following query

(MODEL = “-------” AND YEAR = “-------”) AND (COLOR = “Green” OR COLOR = “Black”)

Types of Parallelism : Task Parallelism (Contd…)

Page 12: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

12

PCOPPPCOPP20022002

Copyright C-DAC, 2002

The different tables generated by processing the query and theirdependencies.

Types of Parallelism : Task Parallelism (Contd…)

Page 13: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

13

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Parallel Unstructured Adaptive Mesh/Mesh Repartitioning methods

� HPF / Automatic compiler techniques may not yield good performance for unstructured mesh computations.

� Choosing right algorithm and message passing is a right candidate for partition (decomposition) of unstructured mesh (graph) onto processors. Task parallelism is right to obtain concurrency.

Finite Element Unstructured Mesh for Lake Superior Region

Types of Parallelism : Task Parallelism (Contd…)

Page 14: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

14

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Remarks : Data and Task Parallelism

� Unfortunately the data parallel model’s simplicity makes it lesssuitable for applications that do not fit the model.

� In particular, applications that use irregular data structures often do not match the model.

� These applications impose difficulties on both the language designer and compiler writer.

� Because task and data parallelism each has strengths and weaknesses, integrating both in one model is attractive.

Types of Parallelism : Data and Task Parallelism

Page 15: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

15

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Integration of Task and Data Parallelism

� Two Approaches

� Add task parallel constructs to data parallel constructs.

� Add data parallel constructs to task parallel construct

� Approach to Integration

� Language based approaches.

� Library based approaches.

Types of Parallelism : Data and Task Parallelism

(Contd…)

Page 16: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

16

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example

� Multi disciplinary optimization application for aircraft design.

� Need for supporting task parallel constructs and communication between data parallel modules

� Optimizer initiates and monitors the application’s execution until the result satisfy some objective function (such as minimal aircraft weight)

Types of Parallelism : Data and Task Parallelism

(Contd…)

Page 17: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

17

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Task and Data Parallelism example for aircraft design

Types of Parallelism : Data and Task Parallelism

(Contd…)

Page 18: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

18

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Advantages

� Generality

� Ability to increase scalability by exploiting both forms of parallelism in a application.

� Ability to co-ordinate multidisciplinary applications.

Problems

� Differences in parallel program structure

� Address space organization

� Language implementation

Types of Parallelism : Data and Task Parallelism(Contd…)

Page 19: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

19

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Stream Parallelism

� Stream parallelism refers to the simultaneous execution of different programs on a data stream. It is also referred to as pipelining.

� The computation is parallelized by executing a different program at each processor and sending intermediate results to the next processor.

� The result is a pipeline of data flow between processors.

Types of Parallelism : Stream Parallelism

Page 20: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

20

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Pipelined execution of instructions : Higher execution rates and better hardware utilization can be achieved.

Types of Parallelism : Stream Parallelism (Contd…)

Page 21: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

21

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Stream Parallelism

Remark :

� Each stage is executed in a single clock cycle, the entire instruction execution can be accomplished in five clock cycles. In a sequential execution of the instructions, most of the processor hardware idles for a majority of time.

� A compiler feature that takes advantage of the fact that many operations can be divided into stages which can proceed concurrently in an attempt to utilize 100% of the CPU resources

� In pipelining, this can be exploited by assigning a different physical unit to each phase.

Types of Parallelism : Stream Parallelism (Contd…)

Page 22: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

22

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Observations

� Many problems exhibit a combination of data, task and stream parallelism.

� The amount of stream parallelism available in a problem is usually independent of the size of the problem.

� The amount of data and task parallelism in a problem usually increases with the size of the problem.

� Combinations of task and data parallelism often allow us to utilize the coarse granularity inherent in task parallelism with the fine granularity in data parallelism to effectively utilize a large number of processors.

Types of Parallelism : Stream Parallelism (Contd…)

Page 23: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

23

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Decomposition Techniques

The process of splitting the computations in a problem into a set of concurrent tasks is referred to as decomposition.

� Decomposing a problem effectively is of paramount importance in parallel computing.

� Without a good decomposition, we may not be able to achieve a high degree of concurrency.

� Decomposing a problem must ensure good load balance.

Decomposition Techniques

Page 24: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

24

PCOPPPCOPP20022002

Copyright C-DAC, 2002

What is meant by good decomposition?

� It should lead to high degree of concurrency

� The interaction among tasks should be as little as possible. These objectives often conflict with each other.

� Parallel algorithm design has helped in the formulation of certain heuristics for decomposition.

Decomposition Techniques (Contd…)

Page 25: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

25

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Decomposition techniques

� Recursive decomposition

� Data decomposition

� Exploratory decomposition

� Hybrid decomposition

Decomposition Techniques (Contd…)

Page 26: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

26

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Recursive decomposition

� A method for inducing concurrency in problems that are often solved using the “divide and conquer” paradigm.

� A problem is solved by first dividing it into a set of independent sub problems

� Each one of these sub problems is solved recursively and then the solution to original problem is obtained by combining the results of these sub problems

� Recursive decomposition can be applied to extract concurrency in problems

Recursive Decomposition

Page 27: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

27

PCOPPPCOPP20022002

Copyright C-DAC, 2002

The quick sort recursion tree for sorting the sequence of 12 members. Each shaded box corresponds to the pivot element used to split each particular list.

Recursive Decomposition (Contd…)

Page 28: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

28

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data decomposition

� Data decomposition is a powerful method for deriving concurrency in algorithms that operate on large data structures.

� Decomposition of computations is done in two steps –

� First step : The data (or the domain) on which the computations are performed is partitioned

� Second step : This data partitioning is used to induce a partitioning of the computations into task which leads to data level parallelism

Data Decomposition

Page 29: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

29

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Decomposition of Input, Intermediate and output data

Data Decomposition

Intermediate

AlgorithmSt

age

1

Stag

e 2

Stag

e 3

Partitioning

Input

Partitioning Partitioning

Output

(Contd…)

Page 30: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

30

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Any partitioning of the output suggests a decomposition of the overall computation performed by the algorithm

� The degree of concurrency induced by this decomposition is a powerful method to find concurrency

� This decomposition leads to parallel algorithms that require little or no interaction among various concurrent tasks

Example:

� Adding of square matrices A and B to give matrix C = A + B (Partition the computations by partitioning the output matrix into groups each containing an equal number of matrix elements).

� Multiplication of square matrices A and B to give matrix C = A * B

� Problem of rendering a scene using the ray-tracing algorithm

Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data

Page 31: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

31

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Rendering a screen using the ray tracing algorithm

The amount of computations performed for each ray is different.

(Contd…)

Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data

Page 32: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

32

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Ray tracing algorithm

� The output of the ray-tracing algorithm is a two dimensional array corresponding to the size of the image.

� Each entry of this array stores the value of the corresponding pixel, computed by a ray passing through the pixel.

� Since the computation of color values for each pixel is independent of the computation for the pixels, we can decompose the computations performed by the algorithm by partitioning the array of pixel

(Contd…)

Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data

Page 33: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

33

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� In many algorithms, computations of some elements of output depends upon the computation of other output elements.

� In such cases, computation of different output elements can not be performed concurrently

� Sometimes these computations can be restructured as multi-stage computation such that each output elements of any of the stages can be computed independently of other output elements

� Parallel Gauss elimination method to solve linear system of matrix equations AX=B

� LU factorization to solve linear system of matrix equations AX=B

� Adaptive repartitioning of finite element meshes - dynamic load balancing methods

Data Decomposition : Partitioning intermediate dataData Decomposition : Partitioning intermediate data

Page 34: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

34

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Gauss Elimination to solve Ax = b matrix system

For k varying form 0 to n-1, In Gaussian elimination procedure, systematically eliminates variable x[k] form equations k+1 to n-1 so that the matrix of coefficients becomes upper-triangular.

(Contd…)

Data Decomposition : Partitioning intermediate dataData Decomposition : Partitioning intermediate data

Page 35: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

35

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Remark

� Partitioning the output data can be performed if each output can be naturally computed as a function of the input. In many algorithms, this is not possible or desirable to partition the output data.

� It is possible to decompose this computations by partitioning the input array and assigning each input partition to a different task.

Example : Partitioning Input data: Partitioning a list around a pivot element

� For sorting algorithm, the individual elements of the output cannot be efficiently determined in isolation.

� In such cases, it is sometimes possible to partition the input data and then use this partitioning to induce concurrency

Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data

Page 36: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

36

PCOPPPCOPP20022002

Copyright C-DAC, 2002

2618369733581724

Input`6869787521333124

output

pivot

Smal

ler

than

the

pivo

tG

reat

er th

an th

e pi

vot

2618

3697

3358

1724

Tas

k 4

Tas

k 3

Tas

k 2

Tas

k 1

No. of smaller

No. of greater

2 21 32 13 3

Smaller-to-GreaterPrefix Sum of smaller

Greater-to-SmallerPrefix Sum of greater

8653

2567

Using recursive decomposition

6869787521333124

output

task 1

task 2

task 3task 4

task 1

task 2

task 3

task 4

+

Example : Sorting algorithm

Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data

�Partitioning the input data around a pivot element and obtain concurrency

(Contd…)

Page 37: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

37

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Partitioning the input data

Example : Data mining and Database applications

� Consider the problem of computing the frequency of a set of item sets in a transaction database. (Association rule)

� In this problem, we are given a set of n transactions T and a set of m item sets I (I = I1,I2,...Im).

� Each transaction and itemset contains a small number of items, out of a possible set of item M.

� Our goal for each item set Ii is to find how many items it appears in all transactions.

(Contd…)

Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data

Page 38: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

38

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Summary

� We must look which of the possible partitions will yield a (good) easiest computation partitioning

� In using data decomposition one must explore and evaluate all possible ways of partitioning the data

The Owner Compute Rule :

� All these examples a widely used method of inducing a computational partitioning from a data partitioning that is called the owner-compute rule.

� Depending on the nature of the data or the type of the data usedto perform the data-partitioning, the owner-compute rule may mean different things.

Data Decomposition(Contd…)

Page 39: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

39

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Exploratory Decomposition

� Used to decompose problems whose underlying computations correspond to a search of a space for solutions.

� Partition search space into smaller parts, and search each one of these parts concurrently, until the desired solutions are found.

Example : A large company wants to award ten prizes to ten of its randomly selected highly valued customers (a highly valued customer is one that buys over $100,000 worth of products annually)

Exploratory Decomposition (Contd…)

Page 40: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

40

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Exploratory decomposition

� The task at hand is to select these ten customers. One way of solving this problem is to first randomly permute the customer records that correspond to the most valued customers.

� So given this problem, how can we decompose it so it will run inparallel?

Solution :

One way to use data decomposition and split the database into ten equal parts.Then in each one of these ten smaller databases, we can apply serial algorithm to find one valued customer

Exploratory Decomposition (Contd…)

Page 41: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

41

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Consider the problem of searching a state space (such as finding a solution to a puzzle problem)

Steps :

� One way of decomposing this computation is to split it into tasks, each one searching a different portion of the search space.

� The task of finding the shortest path from initial to final configuration now translates to finding a path from one of thesenewly generated nodes to final configuration.

� The 15 puzzle problem is typically solved using tree search techniques. Starting from initial configuration, all possible successor one generated.

Exploratory Decomposition (Contd…)

Page 42: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

42

PCOPPPCOPP20022002

Copyright C-DAC, 2002

A 15-puzzle problem instance : (a) initial configuration; (b)-(e) a sequence of moves leading form the initial to the final configuration. (f) final configuration.

1 6 2 4

9 5 3 8

13 7 11

14 10 15 12

1 6 2 4

9 5 3 8

13 7 11

14

10

15 12

1 6 2 4

9 5 3 8

13 7 11

14

10

15 12

1 6 2 4

9 5 3 8

137 11

14

10

15 12

1 6 2 4

9

5 3 8

13

7 11

14

10

15 12

1

6

2 4

9

5

3

8

13

7

11

14

10

15

12

(a) (b) (c)

(d) (e) (f)

Exploratory Decomposition (Contd…)

Page 43: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

43

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Hybrid Decomposition

� These composition techniques are not exclusive, and can often be combined together a higher degree of concurrency than that obtained by applying anyone decomposition.

� Infact, for many problems, it is necessary to combine different decomposition techniques to obtain acceptable degree of concurrency. At different stages, different types of decomposition have to be applied to derive concurrency in parallel algorithm .

Example:

� Parallel quick sort algorithm : Sample sort algorithm

� In many state space search problems, the computation associated with generating successor does require significant amount by decomposing the task of generating the successors.

Hybrid Decomposition

Page 44: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

44

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Speculative Decomposition

� Speculative decomposition leads to effective parallel algorithmswhen the speculative tasks are later found to be needed for the solution to the problem.

� However, in situations in which large fraction of these speculative tasks are found to be unnecessary, then the parallel algorithm ends up performing a lot of non-essential computations.

Examples :

� Optimistic parallel discrete event simulation algorithm.

� Randomized algorithms in Graph partitioning algorithms

Speculative Decomposition

Page 45: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

45

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Objectives :

� The amount of computation assigned to each processors is balanced so that some processor do not idle while others are executing tasks.

� The interactions among the different processors is minimized, sothat the processors spend most of the time in doing work.

� To balance the load among processors, it may be necessary to assign tasks, that interact heavily, to different processors.

Remark : These objectives conflict with each other. The problem of finding a good mapping becomes harder.

Load Balancing techniquesLoad Balancing techniques

Page 46: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

46

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Static load-balancing

� Distribute the work among processors prior to the execution of the algorithm

� Matrix-Matrix Computation

� Easy to design and implement

� Dynamic load-balancing

� Distribute the work among processors during the execution of the algorithm

(Contd…)Load Balancing techniquesLoad Balancing techniques

�Algorithms that require dynamic load-balancing are somewhat more complicated (Parallel Graph Partitioning and Adaptive Finite Element Computations)

Page 47: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

47

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� The concurrency and interdependency of tasks available in a parallel program is fully specified by its task graph. Two characteristics for suitability of static or dynamic load balancing

� The first characteristic is whether or not the tasks that one required to be solved are available before hand or they are dynamically created during the solution of the problem.

� The second characteristic that influences the choice of load balancing scheme is the knowledge of the task size; i.e., the amount of time that is required to solve them.

� Problems in which the amount of computation per task is not known a priori; and can be significantly different for different data.

Characteristics for Suitability of Load BalancingCharacteristics for Suitability of Load Balancing

Page 48: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

48

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Using the block-cyclic distribution shown in (b) to distribute the computations performed in array (a) will lead to load imbalances sparse matrix in Gaussian elimination.

(Contd…)

Array Distribution schemes : Block Distributions of matrix; One dimensional (strip); Two dimensional (checkerboard);Block cyclicDistributions ;Randomized block distributions

Schemes for Static Load BalancingSchemes for Static Load Balancing

Page 49: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

49

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Remark : Different decomposition require different amount of data

Dense matrix-matrix multiplication

Matrix-Matrix addition

Data decomposition : When data decomposition is used to derive concurrency a suitable decomposition of data can itself be used to balance the load and minimize interactions.

Schemes for Static Load BalancingSchemes for Static Load Balancing

Page 50: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

50

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Dynamic partition

� There are problems in which we cannot statistically partition the work among the processors

� In these problems, a static work partitioning is either impossible (e.g. first class) or can potentially lead to serious load imbalance problems (e.g., second and third classes)

� The only way to develop efficient message passing programs for these classes of problem is if we allow dynamic load balancing.

� Thus during the execution of the program, work is dynamically transferred among the processors that have a lot of work to the processors that have little or no work

Schemes for Dynamic Load Balancing (Contd…)

Page 51: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

51

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Three general classes of problems that fall into the category

� The first class consists of problems which all the tasks are available at the beginning of the computation but the amount of time required by each task is different and cannot be determined

� The second class consists of problems in which tasks are available at the beginning but as the computation progresses, the amount of time required by each task changes.

� The third class consists of problems in which tasks are generated dynamically

(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing

Page 52: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

52

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� Self-Scheduling methods

� Chunk Scheduling methods

� Master-Slave paradigm

� Randomization techniques

� Round robin methods

(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing

Page 53: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

53

PCOPPPCOPP20022002

Copyright C-DAC, 2002

� To refine mesh in parallel often require dynamic load balancing algorithms� Use diffusive and non-diffusive algorithms� Adapted graph is partitioned from scratch using state of art

multilevel graph partitioning.� Incremental modification of re-partitioned graphs.� Cut and paste re-partitioning algorithm.

Examples : Adaptive load balancing of unstructured meshes

(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing

Page 54: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

54

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Random distribution of elements in a two dimensional region

Graph partitioning of a two dimensional region

Graph Partitioning AlgorithmsGraph Partitioning Algorithms

Page 55: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

55

PCOPPPCOPP20022002

Copyright C-DAC, 2002

�Elements (evenly distributed)

�Processor level nodes (determined by elements on each processor)

�Global nodes (aligned with processor-level nodes as much as possible)

Mesh Distribution for Finite Element ComputationsMesh Distribution for Finite Element Computations

Page 56: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

56

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example for coloring a graph on three processors

Coloring a Sparse GraphColoring a Sparse Graph

Page 57: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

57

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Rendering a screen using the ray tracing algorithm

� The amount of computations performed for each ray is different. � The output of the ray-tracing algorithm is a two dimensional array

corresponding to the size of the image.

(Contd…)Dynamic Load BalancingDynamic Load Balancing

Page 58: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

58

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Example : Ray tracing algorithm

� Each entry of this array stores the value of the corresponding pixel, computed by a ray passing through the pixel.

� Even though the tasks are known statically, the amount of time associated required by each task is not known and can vary significantly.

� In general, the amount of computation associated with each ray depends on the type of the objects that it hits, it surface properties, and whether or not is in direct view from the different light source.

� Some arrays hit the background, requiring very little processing

� It is not possible to get an accurate estimate of the the work associated with task; thus assigning the number of arrays to each processor may potentially lead to significant load imbalances.

(Contd…)Dynamic Load BalancingDynamic Load Balancing

Page 59: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

59

PCOPPPCOPP20022002

Copyright C-DAC, 2002

A 15-puzzle problem instance : (a) initial configuration; (b)-(e) a sequence of moves leading form the initial to the final configuration.

In this problem, it is not possible to get an accurate estimate of the work associated with each task, as the size of the tree rooted at any node can vary significantly.

(a) (b) (c)1 6 2 49 5 3 8

13 7 1114 10 15 12

1 6 2 49 5 3 813 7 1114

1015 12

1 6 2 49 5 3 8

13 7 111410

15 12

1 6 2 49 5 3 8

137 11

1410

15 12

1 6 2 4

95 3 8

137 11

1410

15 12

162 4

95

38

13

71114

1015

12

(d) (e) (f)

Example : 15 Puzzle problem Dynamic Load BalancingDynamic Load Balancing

Page 60: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

60

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Overheads in algorithm design

� The time that is required by the parallel algorithm but is not essential to the serial algorithm is referred to as overhead due to parallelization.

� Minimizing these overheads is critical in developing efficient parallel algorithm.

� Parallel algorithms can potentially incur three types of overheads.

� The first is due to idling

� The second is due to data sharing and synchronization

� The third is due to extraneous configurations

Overheads in Algorithm Design

Page 61: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

61

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Idling overheads :

Processors can become idle for three main reasons.

� The first reason is due to load imbalances

� The second is due to computational dependencies.

� The third is due to interprocessor communication information exchange and synchronization.

Overheads in Algorithm Design (Contd…)

Page 62: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

62

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data sharing overheads

� In most non-trivial parallel programs, the tasks executed by different processors require access to the same data.

� For example, in matrix-vector multiplication Y = Ax, in which tasks correspond to computing contain access to the entire vector x. (Here A is dense and sparse matrix).

� Some tasks require data that are generated dynamically by earlier tasks.

Overheads in Algorithm Design (Contd…)

Page 63: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

63

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data sharing overheads

� Data sharing can take different forms which depend on the underlying architecture.

� On distributed memory machines, data is initially partitioned among processors.

� On shared address space machines, data sharing is often performed implicitly.

(Contd…)Overheads in Algorithm Design

Page 64: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

64

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Data sharing overheads

� Developing algorithms that minimize data sharing overheads ofteninvolves intelligent task decomposition,

� Task distribution

� Initial data distribution

� Proper scheduling of the data sharing operations

� Collective data sharing operations.

(Contd…)Overheads in Algorithm Design

Page 65: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

65

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Point-to-Point data sharing operations

� Collective Data Sharing Operations

� Collective Data Transfers

� Broadcast

� Gather

� Scatter

� All-to-All

� Collective Computations

� Reduction

� Collective Synchronization

� Barrier

(Contd…)Overheads in Algorithm Design

Page 66: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

66

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Cost of Data Sharing Operations

� Structured Data Sharing

� Dense Matrix-Matrix Multiplication

� Unstructured Data Sharing

� Sparse Matrix-Vector Multiplication

(Contd…)Overheads in Algorithm Design

Page 67: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

67

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Sparse Matrix A and the vector b Multiplication

P1 needs to receive elements {8,9,10} from P2.P2 needs to receive only elements {4,6} from P1.

Sparse Matrix A and Vector Multiplication

Page 68: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

68

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Extraneous Computations

� Parallel algorithms can potentially perform two types of extraneous computations.

� The first type consists of non-essential computations that are performed by the parallel algorithms but they are not required by the serial one.

� In parallel algorithms, non-essential computation often arise either because of speculative execution one.

� The second type of extraneous computation happens when more than one processor ends up performing the same computations.

Example : Parallel search of a sparse graph.

(Contd…)Overheads in Algorithm Design

Page 69: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

69

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Methods for containing interactions overheads

� Data sharing overheads

� Static vs Dynamic

� Data sharing interactions

Example : Sparse matrix-vector multiplication

� Regular vs irregular interactions

Remark : One must strive to minimize the interaction overhead in the same way as the user to minimize the interaction overhead due to idling and extraneous computations.

� Two general schemes for dealing with interaction overheads.

� Perform useful work while the interaction takes place

� Try to reduce the interactions parallel programs.

(Contd…)Overheads in Algorithm Design

Page 70: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

70

PCOPPPCOPP20022002

Copyright C-DAC, 2002

General Techniques for Choosing Right Parallel Algorithm

� Maximize data locality

� Minimize volume of data

� Minimize frequency of Interactions

� Overlapping computations with interactions.

� Data replication

� Minimize construction and Hot spots.

� Use highly optimized collective interaction operations.

� Collective data transfers and computations•

� Maximize Concurrency.

(Contd…)Overheads in Algorithm Design

Page 71: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

71

PCOPPPCOPP20022002

Copyright C-DAC, 2002

Conclusions

Five step process for the design of parallel programs. These steps are described as follows

� Identification of type of parallelism

� Decomposition to expose concurrency

� Mapping and interaction cost

� Extraneous computations

� Handling overheads

(Contd…)Overheads in Algorithm Design

Page 72: Para Algo Design - C-DAC

Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.

June

03-

06, 2

002

C-D

AC, P

une

72

PCOPPPCOPP20022002

Copyright C-DAC, 2002