lecture 4: principles of parallel algorithm design (part 4)zxu2/acms60212-40212-s12/lec-05-3.pdfย ยท...
Post on 09-Apr-2020
8 Views
Preview:
TRANSCRIPT
Lecture 4: Principles of Parallel Algorithm Design (part 4)
1
Mapping Technique for Load Balancing
โข Sources of overheads: โ Inter-process interaction โ Idling
โข Goals to achieve: โ To reduce interaction time โ To reduce total amount of time some processes being
idle โ Remark: these two goals often conflict
โข Classes of mapping: โ Static โ Dynamic
2
Schemes for Static Mapping
โข Mapping Based on Data Partitioning
โข Task Graph Partitioning
โข Hybrid Strategies
3
Mapping Based on Data Partitioning
โข By owner-computes rule, mapping the relevant data onto processes is equivalent to mapping tasks onto processes
โข Array or Matrices โ Block distributions
โ Cyclic and block cyclic distributions
โข Irregular Data โ Example: data associated with unstructured mesh
โ Graph partitioning
4
1D Block Distribution
5
Example. Distribute rows or columns of matrix to different processes
Multi-D Block Distribution
6
Example. Distribute blocks of matrix to different processes
Load-Balance for Block Distribution
Example. ๐ ร ๐ dense matrix multiplication ๐ถ = ๐ด ร ๐ต using ๐ processes
โ Decomposition based on output data.
โ Each entry of ๐ถ use the same amount of computation.
โ Either 1D or 2D block distribution can be used:
โข 1D distribution: ๐
๐ rows are assigned to a process
โข 2D distribution: ๐/ ๐ ร ๐/ ๐ size block is assigned to a process
โ Multi-D distribution allows higher degree of concurrency.
โ Multi-D distribution can also help to reduce interactions
7
8
Cyclic and Block Cyclic Distributions
โข If the amount of work differs for different entries of a matrix, a block distribution can lead to load imbalances.
โข Example. Doolittleโs method of LU factorization of dense matrix
9
10
Doolittleโs method of LU factorization
๐ด =
๐11 ๐12 โฆ ๐1๐๐21 ๐22 โฆ ๐2๐โฎ โฎ โฑ โฎ๐๐1 ๐๐2 โฆ ๐๐๐
= ๐ฟ๐ =
1 0 โฆ 0๐21 1 โฆ 0โฎ โฎ โฑ โฎ๐๐1 ๐๐2 โฆ 1
๐ข11 ๐ข12 โฆ ๐ข1๐0 ๐ข22 โฆ ๐ข2๐โฎ โฎ โฑ โฎ0 0 โฆ ๐ข๐๐
By matrix-matrix multiplication
๐ข1๐ = ๐1๐ , ๐ = 1,2, โฆ , ๐ (1๐ ๐ก row of ๐)
๐๐1 = ๐๐1/๐ข11, ๐ = 1,2, โฆ , ๐ (1๐ ๐ก column of ๐ฟ)
For ๐ = 2,3, โฆ , ๐ โ 1 do
๐ข๐๐ = ๐๐๐ โ ๐๐๐ก๐ข๐ก๐๐โ1๐ก=1
๐ข๐๐ = ๐๐๐ โ ๐๐๐ก๐ข๐ก๐๐โ1๐ก=1 for ๐ = ๐ + 1,โฆ , ๐ (๐๐กโ row of ๐)
๐๐๐ =๐๐๐โ ๐๐๐ก๐ข๐ก๐
๐โ1๐ก=1
๐ข๐๐ for ๐ = ๐ + 1, โฆ , ๐ (๐๐กโ column of ๐ฟ)
End ๐ข๐๐ = ๐๐๐ โ ๐๐๐ก๐ข๐ก๐
๐โ1๐ก=1
Serial Column-Based LU
11
โข Remark: Matrices L and U share space with A
Work used to compute Entries of L and U
12
โข Block distribution of LU factorization tasks leads to load imbalance.
13
Block-Cyclic Distribution
โข A variation of block distribution that can be used to alleviate the load-imbalance.
โข Steps 1. Partition an array into many more blocks than
the number of available processes
2. Assign blocks to processes in a round-robin manner so that each process gets several non-adjacent blocks.
14
15
(a) The rows of the array are grouped into blocks each consisting of two rows, resulting in eight blocks of rows. These blocks are distributed to four processes in a wraparound fashion.
(b) The matrix is blocked into 16 blocks each of size 4ร4, and it is mapped onto a 2ร2 grid of processes in a wraparound fashion.
โข Cyclic distribution: when the block size =1
Graph Partitioning
โข Assign equal number of nodes (or cells) to each process
โข Minimize edge count of the graph partition
16
Random Partitioning Partitioning for Minimizing Edge-Count
Mappings Based on Task Partitioning
โข Mapping based on task partitioning can be used when computation is naturally expressed in the form of a static task-dependency graph with known sizes.
โข Finding optimal mapping minimizing idle time and minimizing interaction time is NP-complete
โข Heuristic solutions exist for many structured graphs
17
Mapping a Binary Tree Task-Dependency Graph
โข Finding min.
18
โข Mapping the tree graph onto 8 processes โข Mapping minimizes the interaction overhead by mapping independent
tasks onto the same process (i.e., process 0) and others on processes only one communication link away from each other
โข Idling exists. This is inherent in the graph
Mapping a Sparse Graph
Example. Sparse matrix-vector multiplication using 3 processes
โข Arrow distribution
19
โข Partitioning task interaction graph to reduce interaction overhead
20
Schemes for Dynamic Mapping
โข When static mapping results in highly imbalanced distribution of work among processes or when task-dependency graph is dynamic, use dynamic mapping
โข Primary goal is to balance load โ dynamic load balancing โ Example: Dynamic load balancing for AMR
โข Types โ Centralized
โ Distributed
21
Centralized Dynamic Mapping
โข Processes โ Master: mange a group of available tasks โ Slave: depend on master to obtain work
โข Idea โ When a slave process has no work, it takes a portion of available
work from master โ When a new task is generated, it is added to the pool of tasks in
the master process
โข Potential problem โ When many processes are used, mast process may become
bottleneck
โข Solution โ Chunk scheduling: every time a process runs out of work it gets
a group of tasks.
22
Distributed Dynamic Mapping
โข All processes are peers. Tasks are distributed among processes which exchange tasks at run time to balance work
โข Each process can send or receive work from other processes โ How are sending and receiving processes paired
together
โ Is the work transfer initiated by the sender or the receiver?
โ How much work is transferred?
โ When is the work transfer performed?
23
Techniques to Minimize Interaction Overheads
โข Maximize data locality
โ Maximize the reuse of recently accessed data
โ Minimize volume of data-exchange
โข Use high dimensional distribution. Example: 2D block distribution for matrix multiplication
โ Minimize frequency of interactions
โข Reconstruct algorithm such that shared data are accessed and used in large pieces.
โข Combine messages between the same source-destination pair
24
Techniques to Minimize Interaction Overheads
โข Minimize contention and hot spots โ Contention occur when multi-tasks try to access the same resources
concurrently: multiple processes sending message to the same process; multiple simultaneous accesses to the same memory block
25
โข Using ๐ถ๐,๐ = ๐ด๐,๐๐ต๐,๐๐โ1๐=0 causes contention. For example, ๐ถ0,0,
๐ถ0,1, ๐ถ0, ๐โ1 attempt to read ๐ด0,0, at once.
โข A contention-free manner is to use:
๐ถ๐,๐ = ๐ด๐, ๐+๐+๐ % ๐๐ต ๐+๐+๐ % ๐,๐๐โ1๐=0
All tasks ๐โ,๐ that work on the same row of C access block
๐ด๐, ๐+๐+๐ % ๐, which is different for each task.
Techniques to Minimize Interaction Overheads
โข Overlap computations with interactions
โ Use non-blocking communication
โข Replicate data or computations
โ Replicate a copy of shared data on each process if possible, so that there is only initial interaction during replication.
โข Use collective interaction operations
โข Overlap interactions with other interactions
26
Parallel Algorithm Models
โข Data parallel โ Each task performs similar operations on different data โ Typically statically map tasks to processes
โข Task graph โ Use task dependency graph to promote locality or reduce
interactions
โข Master-slave โ One or more master processes generating tasks โ Allocate tasks to slave processes โ Allocation may be static or dynamic
โข Pipeline/producer-consumer โ Pass a stream of data through a sequence of processes โ Each performs some operation on it
โข Hybrid โ Apply multiple models hierarchically, or apply multiple models
in sequence to different phases 27
top related