design of parallel algorithms
DESCRIPTION
Design of parallel algorithms. Matrix operations J. Porras. Matrix x vector. Sequential approach MAT_VECT(A,x,y) for(i=0;iTRANSCRIPT
Design of parallel algorithms
Matrix operations
J. Porras
Matrix x vector
• Sequential approach MAT_VECT(A,x,y)
for(i=0;i<n;i++) {y[i] = 0;for(j=0;j<n:j++) {
y[i] = y[i] + A[i,j] * x[j]}
}
• Work = n2
Parallelization of matrix operationsMatrix x vector
• Three ways to implement – rowwise striping– columnwise striping– checkerboarding
• DRAW each of these approaches !
Rowwise striping
• N x N is distributed into n processors (one row each)
• N x 1 vector is distributed into n processors (one element each)
• All processors need the whole vector so all-to-all broadcast is required
Rowwise striping
• All-to-all broadcast requires n).
• One row takes n) time for multiplications
• Rows are calculated in parallel thus the total time is n) and the work n2).– Algorithm is cost-optimal
Block striping
• Assume that p < n and the matrix in partitioned by using block striping
• All processors contain n/p rows and n/p elements of the vector
• All processors require the whole vector thus all-to-all broadcast is required (message size n/p)
Block striping in hypercube
• all-to-all broadcast in hypercube with n/p-sized message takes
tslog p + tw(n/p)(p-1)
• If p is considered large enoughtslog p + twn
• Multiplication requires n2/p time (n/p rows to multiply with the vector)
Block striping in hypercube
• Parallel execution time TP = n2/p + tslog p + twn
• Cost pTP n2 + ts plog p + twnp
• Algorithm is costoptimal if
p = O(n)
Block striping in mesh
• All-to-all broadcast in mesh with wraparounds takes 2ts(p-1) + tw(n/p)(p-1)
• Parallel execution requiresTP = n2/p + 2ts (p-1) + twn
Scalability of block striping
• Overhead (T0 = pTp – W)
T0 = ts plog p + twnp
• Isoeffiency (W = KT0) for hypercube
W = K ts p log p
W = K tw np
• Since W = n2, W = K2 tw
2 p2
Scalability of block striping
• Because p = O(n), n = p)n2 = p2)W = p2)
• Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency
Scalability of block striping
• Isoeffiency in hypercube is (p2).
• Similar analysis can be done for the mesh architecture and get the same value (p2).
• Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh
Checkerboard
• N x N matrix in partitioned into N2
processors (one element per processor)• N x 1 vector is located on a last column (or
on a diagonal)• Vector is distributed into corresponding
processors• Calculate multiplications in parallel and
collect results with single node accumulation into the last processor
Checkerboard
• Three communication stapes are required– One-to-one communication to send the vector
onto diagonal– One-to-all broadcast to distributed the
elements of the vector– Single-node accumulation to sum the partial
results
Checkerboard
• Mesh requires (n) time for all the operations (SF) and hypercube (log n)
• Multiplication happens in constant time
• Parallel execution time is (n) in mesh and (log n) in hypercube architecture
• Cost is (n3) for the mesh and (n2log n)for the hypercube
• Algorithms are not cost-optimal
Checkerboard p < n2
• Cost-optimality can be achieved if the granularity is increased ??
• Consider two dimensional mesh of p processors in which each processor stores (n/p) x (n/p block of the matrix
• Simlarly for the vector (n/p)
Checkerboard p < n2
• Vector elements are sent to the diagonal
• Vector elements are distributed for the other processors
• Each processor performs n2/p multiplications and calculates n/p additions
• Partial sums are collected with single node accumulation
Scalability of checkerboard p < n2
• Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds)
• Sent to diagonal takes
ts + twn / p + th p
• One-to-all in columns takes(ts + twn / p) log (p) + th p
Scalability of checkerboard p < n2
• Single-node accumulation takes(ts + twn / p) log (p) + th p
• Multiplicatios in each processor takes n2/p.
• Thus
TP = n2/p + tslog p +(tw n / p) log p + 3th p
• T0 = pTP - W gives for the overhead:
T0 = tsplog p + tw n p log p + 3th p3/2
Scalability of checkerboard p < n2
• Isoeffiency for ts:
W = Kts p log p
• Isoeffiency for tw:
W = n2 = K tw n p log p
n = K tw p log p
n2 = K2 tw 2 p log 2 p
W = K2 tw2p log2 p
• Isoeffiency for th:
W = 3 K th p3/2
Scalability of checkerboard p < n2
• If p = O(n2), :p = O(n2)n2 = p)W = p)
• tw and th dominate ts
Scalability of checkerboard p < n2
• Concentrate on th (p3/2) and tw:n (plog2 p)
• Because p3/2 > plog2p only for p > 65536 both of the terms could dominate
• Assume that the term (plog2 p) dominates
Scalability of checkerboard p < n2
• Maximum number of processors that can be used costoptimally for the problem size W is determined by
plog2 p = O( n2 )
log p + 2 log log p = O( log n )
log p = O (log n)
Scalability of checkerboard p < n2
• Substitute log n for log p:n
• p log2 n = O (n2 ) p = O ( n2 / log2 n )
• p gives the upper limit for the number of processors that can be used cost-optimally
SF and CT
• Parallel execution takes n2 / p + 2ts p + 3tw
n time on p processor mesh with SF routing (isoeffiency (p2) dueto tw )
• CT routing performs much better
• Note that this is true for cases with several elements per processor
• HOW about fine-grain case ?
Striped and checkerboard
• Comparison shows that checkerboard is faster than striped approach with the same amount of processors
• If p > n, striped approach is not available
• How about the effect of architecture ?
• Scalability ?
• Isoefficiency ?
Sequential matrix multiplication
• Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j]
• n3 work (strassen’s algorithm has better complexity)
Block approach
• n/q * n/q submatrices
• Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j
• Same complexity n3
Simple parallel approach
• Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2)
• Map into p1/2 x p1/2 mesh
• Processors P0,0 ... Pp-1,p-1
• Pi,j stores Ai,j and Bi,j and computes Ci,j
• Ci,j requires Ai,k and Bk,j
• A needs to communicate within rows • B communicates within columns
Performance on hypercube
• Requires 2 broadcasts (rows and columns)
• message size n2/p
• tc = 2(ts log(p)+tw(n2/p)(p-1))
• tm= p (n/p)3=n3/p
• Tp = n3/p + ts log p + 2twn2/ p , p » 1
Performance on mesh
• Store-and-forward routing
• tc = 2(tsp + twn2/ p)
• tm= p (n/ p)3=n3/p
• tp = n3/p + 2ts p + 2twn2/ p
Cannon´s algorithm
• Partition to blocks as usual
• Processors P0,0 - P p-1, p-1
• Pi,j contains Ai,j and Bi,j
• rotate block !!
• A blocks to the left
• B blocks upwards
Fox’s algorithm
• Partition to blocks as usual
• Pi,j contains Ai,j and Bi,j
• Uses one-to-all broadcastsp iterations
• (1) broadcast selected block to row
• (2) multiply by B
• (3) send B upwards
• (4) select Ai,(j+1)mod(p)
DNS
• Dekel, Nassimi and Sahni
• n3 processors available
• use 3D structure
• Pi,j,k solves A[i,k]xB[k,j]
• C[i,j] = Pi,j,0 +...+ Pi,j,n-1
(log n) time
DNS for hypercube
• 3D structure is mapped into hypercube where n3 = 23d processors
• Processor Pi,j,o contains A[i,j] and B[i,j]
• 3 steps
• (1) move A & B to correct plane
• (2) replicate on each plane
• (3) single node accumulation
DNS < n3 processors
• Processors p = q3, q < n
• Partition matrices into (n/q)*(n/q) blocks
• Matrices contain q x q submatrices
• Since 1<=q<=n, p=[1,n3]