lecture 4: principles of parallel algorithm design (part 4)zxu2/acms60212-40212-s12/lec-05-3.pdfย ยท...

Post on 09-Apr-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 4: Principles of Parallel Algorithm Design (part 4)

1

Mapping Technique for Load Balancing

โ€ข Sources of overheads: โ€“ Inter-process interaction โ€“ Idling

โ€ข Goals to achieve: โ€“ To reduce interaction time โ€“ To reduce total amount of time some processes being

idle โ€“ Remark: these two goals often conflict

โ€ข Classes of mapping: โ€“ Static โ€“ Dynamic

2

Schemes for Static Mapping

โ€ข Mapping Based on Data Partitioning

โ€ข Task Graph Partitioning

โ€ข Hybrid Strategies

3

Mapping Based on Data Partitioning

โ€ข By owner-computes rule, mapping the relevant data onto processes is equivalent to mapping tasks onto processes

โ€ข Array or Matrices โ€“ Block distributions

โ€“ Cyclic and block cyclic distributions

โ€ข Irregular Data โ€“ Example: data associated with unstructured mesh

โ€“ Graph partitioning

4

1D Block Distribution

5

Example. Distribute rows or columns of matrix to different processes

Multi-D Block Distribution

6

Example. Distribute blocks of matrix to different processes

Load-Balance for Block Distribution

Example. ๐‘› ร— ๐‘› dense matrix multiplication ๐ถ = ๐ด ร— ๐ต using ๐‘ processes

โ€“ Decomposition based on output data.

โ€“ Each entry of ๐ถ use the same amount of computation.

โ€“ Either 1D or 2D block distribution can be used:

โ€ข 1D distribution: ๐‘›

๐‘ rows are assigned to a process

โ€ข 2D distribution: ๐‘›/ ๐‘ ร— ๐‘›/ ๐‘ size block is assigned to a process

โ€“ Multi-D distribution allows higher degree of concurrency.

โ€“ Multi-D distribution can also help to reduce interactions

7

8

Cyclic and Block Cyclic Distributions

โ€ข If the amount of work differs for different entries of a matrix, a block distribution can lead to load imbalances.

โ€ข Example. Doolittleโ€™s method of LU factorization of dense matrix

9

10

Doolittleโ€™s method of LU factorization

๐ด =

๐‘Ž11 ๐‘Ž12 โ€ฆ ๐‘Ž1๐‘›๐‘Ž21 ๐‘Ž22 โ€ฆ ๐‘Ž2๐‘›โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ๐‘Ž๐‘›1 ๐‘Ž๐‘›2 โ€ฆ ๐‘Ž๐‘›๐‘›

= ๐ฟ๐‘ˆ =

1 0 โ€ฆ 0๐‘™21 1 โ€ฆ 0โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ๐‘™๐‘›1 ๐‘™๐‘›2 โ€ฆ 1

๐‘ข11 ๐‘ข12 โ€ฆ ๐‘ข1๐‘›0 ๐‘ข22 โ€ฆ ๐‘ข2๐‘›โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ0 0 โ€ฆ ๐‘ข๐‘›๐‘›

By matrix-matrix multiplication

๐‘ข1๐‘— = ๐‘Ž1๐‘— , ๐‘— = 1,2, โ€ฆ , ๐‘› (1๐‘ ๐‘ก row of ๐‘ˆ)

๐‘™๐‘—1 = ๐‘Ž๐‘—1/๐‘ข11, ๐‘— = 1,2, โ€ฆ , ๐‘› (1๐‘ ๐‘ก column of ๐ฟ)

For ๐‘– = 2,3, โ€ฆ , ๐‘› โˆ’ 1 do

๐‘ข๐‘–๐‘– = ๐‘Ž๐‘–๐‘– โˆ’ ๐‘™๐‘–๐‘ก๐‘ข๐‘ก๐‘—๐‘–โˆ’1๐‘ก=1

๐‘ข๐‘–๐‘— = ๐‘Ž๐‘–๐‘— โˆ’ ๐‘™๐‘–๐‘ก๐‘ข๐‘ก๐‘—๐‘–โˆ’1๐‘ก=1 for ๐‘— = ๐‘– + 1,โ€ฆ , ๐‘› (๐‘–๐‘กโ„Ž row of ๐‘ˆ)

๐‘™๐‘—๐‘– =๐‘Ž๐‘—๐‘–โˆ’ ๐‘™๐‘—๐‘ก๐‘ข๐‘ก๐‘–

๐‘–โˆ’1๐‘ก=1

๐‘ข๐‘–๐‘– for ๐‘— = ๐‘– + 1, โ€ฆ , ๐‘› (๐‘–๐‘กโ„Ž column of ๐ฟ)

End ๐‘ข๐‘›๐‘› = ๐‘Ž๐‘›๐‘› โˆ’ ๐‘™๐‘›๐‘ก๐‘ข๐‘ก๐‘›

๐‘›โˆ’1๐‘ก=1

Serial Column-Based LU

11

โ€ข Remark: Matrices L and U share space with A

Work used to compute Entries of L and U

12

โ€ข Block distribution of LU factorization tasks leads to load imbalance.

13

Block-Cyclic Distribution

โ€ข A variation of block distribution that can be used to alleviate the load-imbalance.

โ€ข Steps 1. Partition an array into many more blocks than

the number of available processes

2. Assign blocks to processes in a round-robin manner so that each process gets several non-adjacent blocks.

14

15

(a) The rows of the array are grouped into blocks each consisting of two rows, resulting in eight blocks of rows. These blocks are distributed to four processes in a wraparound fashion.

(b) The matrix is blocked into 16 blocks each of size 4ร—4, and it is mapped onto a 2ร—2 grid of processes in a wraparound fashion.

โ€ข Cyclic distribution: when the block size =1

Graph Partitioning

โ€ข Assign equal number of nodes (or cells) to each process

โ€ข Minimize edge count of the graph partition

16

Random Partitioning Partitioning for Minimizing Edge-Count

Mappings Based on Task Partitioning

โ€ข Mapping based on task partitioning can be used when computation is naturally expressed in the form of a static task-dependency graph with known sizes.

โ€ข Finding optimal mapping minimizing idle time and minimizing interaction time is NP-complete

โ€ข Heuristic solutions exist for many structured graphs

17

Mapping a Binary Tree Task-Dependency Graph

โ€ข Finding min.

18

โ€ข Mapping the tree graph onto 8 processes โ€ข Mapping minimizes the interaction overhead by mapping independent

tasks onto the same process (i.e., process 0) and others on processes only one communication link away from each other

โ€ข Idling exists. This is inherent in the graph

Mapping a Sparse Graph

Example. Sparse matrix-vector multiplication using 3 processes

โ€ข Arrow distribution

19

โ€ข Partitioning task interaction graph to reduce interaction overhead

20

Schemes for Dynamic Mapping

โ€ข When static mapping results in highly imbalanced distribution of work among processes or when task-dependency graph is dynamic, use dynamic mapping

โ€ข Primary goal is to balance load โ€“ dynamic load balancing โ€“ Example: Dynamic load balancing for AMR

โ€ข Types โ€“ Centralized

โ€“ Distributed

21

Centralized Dynamic Mapping

โ€ข Processes โ€“ Master: mange a group of available tasks โ€“ Slave: depend on master to obtain work

โ€ข Idea โ€“ When a slave process has no work, it takes a portion of available

work from master โ€“ When a new task is generated, it is added to the pool of tasks in

the master process

โ€ข Potential problem โ€“ When many processes are used, mast process may become

bottleneck

โ€ข Solution โ€“ Chunk scheduling: every time a process runs out of work it gets

a group of tasks.

22

Distributed Dynamic Mapping

โ€ข All processes are peers. Tasks are distributed among processes which exchange tasks at run time to balance work

โ€ข Each process can send or receive work from other processes โ€“ How are sending and receiving processes paired

together

โ€“ Is the work transfer initiated by the sender or the receiver?

โ€“ How much work is transferred?

โ€“ When is the work transfer performed?

23

Techniques to Minimize Interaction Overheads

โ€ข Maximize data locality

โ€“ Maximize the reuse of recently accessed data

โ€“ Minimize volume of data-exchange

โ€ข Use high dimensional distribution. Example: 2D block distribution for matrix multiplication

โ€“ Minimize frequency of interactions

โ€ข Reconstruct algorithm such that shared data are accessed and used in large pieces.

โ€ข Combine messages between the same source-destination pair

24

Techniques to Minimize Interaction Overheads

โ€ข Minimize contention and hot spots โ€“ Contention occur when multi-tasks try to access the same resources

concurrently: multiple processes sending message to the same process; multiple simultaneous accesses to the same memory block

25

โ€ข Using ๐ถ๐‘–,๐‘— = ๐ด๐‘–,๐‘˜๐ต๐‘˜,๐‘—๐‘โˆ’1๐‘˜=0 causes contention. For example, ๐ถ0,0,

๐ถ0,1, ๐ถ0, ๐‘โˆ’1 attempt to read ๐ด0,0, at once.

โ€ข A contention-free manner is to use:

๐ถ๐‘–,๐‘— = ๐ด๐‘–, ๐‘–+๐‘—+๐‘˜ % ๐‘๐ต ๐‘–+๐‘—+๐‘˜ % ๐‘,๐‘—๐‘โˆ’1๐‘˜=0

All tasks ๐‘ƒโˆ—,๐‘— that work on the same row of C access block

๐ด๐‘–, ๐‘–+๐‘—+๐‘˜ % ๐‘, which is different for each task.

Techniques to Minimize Interaction Overheads

โ€ข Overlap computations with interactions

โ€“ Use non-blocking communication

โ€ข Replicate data or computations

โ€“ Replicate a copy of shared data on each process if possible, so that there is only initial interaction during replication.

โ€ข Use collective interaction operations

โ€ข Overlap interactions with other interactions

26

Parallel Algorithm Models

โ€ข Data parallel โ€“ Each task performs similar operations on different data โ€“ Typically statically map tasks to processes

โ€ข Task graph โ€“ Use task dependency graph to promote locality or reduce

interactions

โ€ข Master-slave โ€“ One or more master processes generating tasks โ€“ Allocate tasks to slave processes โ€“ Allocation may be static or dynamic

โ€ข Pipeline/producer-consumer โ€“ Pass a stream of data through a sequence of processes โ€“ Each performs some operation on it

โ€ข Hybrid โ€“ Apply multiple models hierarchically, or apply multiple models

in sequence to different phases 27

top related