definitions a synchronous application is one where all processes must reach certain points before...
TRANSCRIPT
Definitions• A Synchronous application is one where all
processes must reach certain points before execution continues.
• Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues.
• A barrier is the basic message passing mechanism for synchronizing processes.
• Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.
Barrier Illustration
P0
P0
P0
P0
P0
Barrie
r
Waiting
Executing
C: MPI_Barrier(MPI_COMM_WORLD);
Processor code will reach barrier points at different times. This leads to idle time and load imbalance.
Counter (linear) Barrier: Implementation
Master Processor O(P) stepsFor (i=0; i<P; i++) // Entry Phase
Receive null message from any processorFor (i=0; i<P; i++) // Release Phase
Send null message to release slaves
Slave ProcessorsSend null message to enter barrierReceive null message for barrier release
Note: This logic avoids processors arriving before prior release
Barriers consist of two phases: Entry phase and departure phases
Tree (non-linear) BarrierP
0P
1P
2P
3P
4P
5P
6P
7
P0
P1
P2
P3
P4
P5
P6
P7Entry Phase Release Phase
Note: Implementation logic is similar to divide and conquer
The release phase uses the inverse tree construction, entry and departure each require O(lg P) steps
Butterfly Barrier– Stage 1: P0p1; p2p3; p4p5; p6p7
– Stage 2: p0 p2; p1p3; p4p6; p5p7
– Stage 3: p0p4; p1p5; p2p6; p3p7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
Advantages:•requires only single parallel single send() and receive() pairs at each stage.•Completes in only O(lg P) steps
Note: At stage s, processor p synchronizes with (p + 2s-1)mod P
Local Synchronization
• Even ProcessorsSend null message to processor i-1Receive null message from processor i-1Send null message to processor i+1Receive null message from processor i+1
• Odd Numbered ProcessorsReceive null message from processor i+1Send null message to processor i+1Receive null message from processor i-1Send null message to processor i-1
• Notes: – Local Synchronization is an incomplete barrier: processors exit after
receiving messages from their neighbors– Reminder: Deadlock can occur with incorrrect message passing
orders. MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free
Synchronize with neighbors before proceeding
Local Synchronization Example
• Heat Distribution Problem– Goal
• Determine final temperature at each n x n grid point
– Initial boundary condition• Know initial temperatures at the
designated points (ex: outer rim or internal heat sink)
– Cannot proceed to next iteration until local synchronization completes
DOAverage each grid point with its neighbors
UNTIL temperature changes are small enough
New Value =(∑neighbors)/4
Sequential Heat Distribution CodeInitialize rows 0,n and columns 0,n of g and h
Iteration = 0
DO
FOR (i=1; i<n; i++)
FOR (j=1; j<n; j++)
IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4
ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4
iteration++
UNTIL max (|gi – hi|)<tolerance or iteration>MAX
Notes•Even iterations update gij array; Odd iterations iterate gij array•Recall: Odd/Even sort
Block or Strip Partitioning
p0 p1 p2 p3
p4 p5 p6 p7
p8 p9 p10 p11
p12 p13 p14 p15
p0 p1 p2 p3 p4 p5 p6 p7
Blocks
Column Strips
• Block Partitioning (allocate in squares)– Eight messages exchanged
at each iteration– Data exchanged per
message is n/sqrt(P)• Strip Partitioning
– Four messages exchanged at each iteration
– Data exchanged per message is n/P
• Question: Which is better?
Assign portions of the grid to processors in the topology
Strip versus Block Partitioning• Characteristics
– Strip partitioning – generally more data, less messages
– Block partitioning – generally less data, more messages
– Choice: Low latency favors block; High latency favors strip
• Example: Grid is 64 x 64, p = 16– Strip Partitioning – Strips are 4x64; 4 x 64 cells transferred per
iteration per processor
– Block Partitioning – Blocks are 16 x 16; 8 x 16 cells transferred per iteration per processor
• Example: Grid is 64 x 64, p = 4– Strip Partitioning – Strips are 8x64, 4 x 64 cells transferred per
iteration per processor
– Block Partitioning – Blocks are 32 x 32, 8 x 32 cells transferred per iteration per processor
Parallel Implementation
Modifications to the sequential algorithm• Declare “ghost” rows to hold adjacent data (declare array of 10 x
10 for an 8 x 8 block)
• Exchange data with neighbor processors
• Perform the calculation for the local grid cells
PiCells to east
Cells to south
Cells to west
Cells to north
Heat Distribution PartitioningSendRcv(row,col) if row,col is not local if myrank even Send(point,prow,col) Recv(point,prow,col)Else Recv(point,prow,col) Send(point,prow,col)
Main logicFor each iteration
For each point compute new temperature
SendRcv(row-1,col,point)SendRcv(row+1,col,point)SendRcv(row,col-1,point)SendRcv(row,col+1,point)
Full Synchronization• Data Parallel Computations
– Simultaneously apply the same operation to different data
– This approach models many numerical computations
– They are easy to program and scale well to large data sets
• Sequential Codefor (i=0; i<n; i++) a[i] = someFunction(a[i])
• Shared Memory CodeForall (i=0; i<n; i++) {bodyOfInstructions}
– Note: the for loop semantics imply a natural barrier
• Distributed processingFor local a[i]; {someFunction(a[i])} barrier();
Data Parallel Example
A[0] += k A[1] += k A[n-1] += k
A[] += k
• All processors execute instructions in “lock step”
• forall (i=0; i<n; i++) a[i] += k
• Note: Multi-computers partition data into course grain blocks
p0 p1 pn
Prefix-Based Operations
• Definition: Given a set of n values a1, a2,…, an and an associative operation, the operation is applied to all predecessor values
• Prefix Sum: {2, 7, 9, 4} {2, 9, 18, 22}• Application: Radix Sort• Solution by Doubling: An algorithm where
operations calculate in increasing powers of 2• Example: 1, 2, 4, 8, etc., (each iteration doubles)
Prefix Sum by Doubling• Overview
– 1. Add each data[i] is added to data[i+1]– 2. Add each data[i] is added to data[i+2]– 3. Add each data[i] is added to data[i+4]– 4. Add each data[i] is added to data[i+8]– ETC…..
• Note: Skip the operation if i+increment > array length
Prefix Sum Example
Sequential Time: O(n), Parallel Time: O(N/P lg N/P )Note: * means the sum is not added at the next step
Prefix Sum Parallel Implementation
• Sequential codefor (j=0;j<lg(n);j++) for (i=0; i<n – 2j; i++) a[i] += a[i+2j];
• Parallel shared memory fine grain logicfor (j=0; j<lg(n); j++) forall (i=0; i<n–2j; i++) a[i+2j] +=a[i];
• Parallel distributed course grain logic for (j=1; j<= log(n); j++) if (myrank>=2j-1 receive(sum, myrank – 2j-
1) add sum to processor's data
else send(processor's data, myrank + 2j-
1)
Synchronous Iteration
• Processes synchronize at each iteration step• Example: Simulation of Natural Processes• Shared memory code for (j=0; j<n; j++) forall (i=0; i<N; i++) algorithm(i);
• Distributed memory code for (j=0; j<n; j++) algorithm(myRank); barrier();
Example: n equations, n unknowns
an-1,0x0 + an,1x1 … + an,n-1xn-1 = bk
∙∙∙
ak,0x0 + ak,1x1 … + ak,n-1xn-1 = bk
∙∙∙
a1,0x0 + a1,1x1 … + a1,n-1xn-1 = b1
a0,0x0 + a0,1x1 … + a0,n-1xn-1 = b0
• Or we can rewrite the equations as follows:xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k
= (bk - ∑j≠kai,j xj)/ai,i
Jacobi Iteration
• Pseudo Codexnewi = initial guess
DO
xi = xnewi
xnewi = Calculated next guess
UNTIL ∑i|xnewi – xi|<tolerance
• Jacobi iteration always converges if:
ak,k > ∑i≠k ai,0
(The diagonal value dominates the column sum)
i i+1
Error
Iteration
xi
Numerical Algorithm to solve N equations with N unknowns
Traditional solutions are O(N3), or O(N2) for special cases
Parallel Jacobi Code
xnewi = bi
DO for each i xi = xnewi
sum = -ai,i * xi
FOR (j=0; j<n; j++) sum += ai,i * xj
xnewi = (bi – sum)/ai,i
allgather(xnewi)
barrier()
Until iterations>MAX or ∑i|xnewi – xi|<tolerance
xnew0 xnew1 xnewn-1
xi
Allgather() xnewi into xi
Additional Jacobi Notes• If P (processor count) < n,
allocate blocks of variables to processors
• Block Allocation: Allocate consecutive xi to processors
• Cyclic Allocation– Allocate x0, xP, … to p0– Allocate x1, xp+1, … to p1 … etc.
• Question: Which allocation scheme is better?
Time
Processors4 8 12 16 20 24
Computation
Communication
Jacobi Performance
Cellular Automata
• The System has a finite grid of cells
• Each cell can assume a finite number of states
• Cells change state according to a well-defined rule set
• All cell changes of state occur simultaneously
• The system iterates through a number of generations
Serious Applications• Fluid and gas dynamics• Biological growth• Airplane wing airflow• Erosion modeling• Groundwater pollution
Definition
Note: Animations of these systems can lead to interesting insights
Fun Applications• Game of Life• Sharks and Fishes• Foxes and Rabbits• Gaming applications
Conway’s Game of Life• The grid (world) is a two dimension array of cells• Note: The grid ends can optionally wrap around (like a torus)• Each cell
– Can hold one “organism” – There are eight neighbor cells: North, Northeast, East, Southeast, South,
Southwest, West, Northwest• Rules (run the simulation over many generations)
1. Organism dies (loneliness) if <2 organisms live in neighbor cells2. Organism survives if 2 organisms live in adjacent cells3. An empty cell with 3 living neighbors gives birth to organisms in
every empty adjacent cell 4. Organism dies (overpopulation) >= 4 organisms live in neighbor cells
Sharks and Fishes• The grid (ocean) is modeled by a three dimension array• Note: The grid ends can optionally wrap around (like a torus)• Each cell
– Can hold either a fish or a shark, but not both– There are twenty six adjacent cubic cells
• Rules for fish1. Fish move randomly to empty adjacent cells2. If there are no empty adjacent cells, fish stay put3. Fish of breeding age leave a baby fish in the vacating cell4. Fish die after some fixed (or random) number of generations
• Rules for sharks1. Sharks randomly move to adjacent cells that don't contain sharks2. If they enter a cell containing a fish, they eat the fish3. Sharks stay put when all adjacent cells contain sharks4. Sharks of breeding age leave a baby shark in a vacating cell5. Sharks die (starvation) if they don’t eat a fish for some fixed (or
random) number of generations