05 products
TRANSCRIPT
-
7/31/2019 05 Products
1/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Numerical AlgorithmsChapter 5 Vector and Matrix Products
Prof. Michael T. Heath
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
CSE 512 / CS 554
Michael T. Heath Parallel Numerical Algorithms 1 / 79
http://find/http://goback/ -
7/31/2019 05 Products
2/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Outline
1 Inner Product
2 Outer Product
3 Matrix-Vector Product
4 Matrix-Matrix Product
Michael T. Heath Parallel Numerical Algorithms 2 / 79
http://find/http://goback/ -
7/31/2019 05 Products
3/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Basic Linear Algebra Subprograms
Basic Linear Algebra Subprograms (BLAS) are building
blocks for many other matrix computations
BLAS encapsulate basic operations on vectors and
matrices so they can be optimized for particular computerarchitecture while high-level routines that call them remain
portable
BLAS offer good opportunities for optimizing utilization of
memory hierarchy
Generic BLAS are available from netlib, and many
computer vendors provide custom versions optimized for
their particular systems
Michael T. Heath Parallel Numerical Algorithms 3 / 79
http://find/http://goback/ -
7/31/2019 05 Products
4/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Examples of BLAS
Level Work Examples Function
1 O(n) saxpy Scalar vector + vectorsdot Inner product
snrm2 Euclidean vector norm2 O(n2) sgemv Matrix-vector product
strsv Triangular solution
sger Rank-one update
3 O(n3
)sgemm Matrix-matrix product
strsm Multiple triang. solutions
ssyrk Rank-k update
Michael T. Heath Parallel Numerical Algorithms 4 / 79
http://find/http://goback/ -
7/31/2019 05 Products
5/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Simplifying Assumptions
For problem of dimension n using p processes, assume p(or in some cases
p ) divides n
For 2-D mesh, assume p is perfect square and mesh isp
p
For hypercube, assume p is power of two
Assume matrices are square, n n, not rectangularDealing with general cases where these assumptions do
not hold is straightforward but tedious, and complicatesnotation
Caveat: your mileage may vary, depending on
assumptions about target system, such as level of
concurrency in communication
Michael T. Heath Parallel Numerical Algorithms 5 / 79
I P d
http://find/http://goback/ -
7/31/2019 05 Products
6/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Inner Product
Inner product of two n-vectors x and y given by
xTy =
n
i=1
xi yi
Computation of inner product requires n multiplicationsand n 1 additionsFor simplicity, model serial time as
T1 = tc n
where tc is time for one scalar multiply-add operation
Michael T. Heath Parallel Numerical Algorithms 6 / 79
I P d t
http://find/http://goback/ -
7/31/2019 05 Products
7/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Parallel Algorithm
Partition
For i = 1, . . . , n, fine-grain task i stores xi and yi, and
computes their product xi yi
Communicate
Sum reduction over n fine-grain tasks
x1y
1x
2y
2x
3y
3x
4y
4x
5y
5x
6y
6x
7y
7x
8y
8x
9y
9
Michael T. Heath Parallel Numerical Algorithms 7 / 79
Inner Product
http://find/http://goback/ -
7/31/2019 05 Products
8/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Fine-Grain Parallel Algorithm
z = xiyi
reduce z across all tasks
{ local scalar product }
{ sum reduction }
Michael T. Heath Parallel Numerical Algorithms 8 / 79
Inner Product
http://find/http://goback/ -
7/31/2019 05 Products
9/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Agglomeration and Mapping
Agglomerate
Combine k components of both x and y to form eachcoarse-grain task, which computes inner product of these
subvectors
Communication becomes sum reduction over n/kcoarse-grain tasks
Map
Assign (n/k)/p coarse-grain tasks to each of p processes,for total of n/p components of x and y per process
x1y
1x
2y
2x
3y
3x
4y
4x
5y
5x
6y
6x
7y
7x
8y
8x
9y
9++ ++ + +
Michael T. Heath Parallel Numerical Algorithms 9 / 79
Inner Product
http://find/http://goback/ -
7/31/2019 05 Products
10/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Coarse-Grain Parallel Algorithm
z = xT[i ]y[i ]
reduce z across all processes
{ local inner product }
{ sum reduction }
x[i ] means subvector of x assigned to process i by mapping
Michael T. Heath Parallel Numerical Algorithms 10 / 79
Inner Product
http://find/http://goback/ -
7/31/2019 05 Products
11/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Performance
Time for computation phase is
Tcomp = tc n/p
regardless of network
Depending on network, time for communication phase is
1-D mesh: Tcomm = (ts + tw) (p 1)2-D mesh: Tcomm = (ts + tw) 2(
p
1)
hypercube: Tcomm = (ts + tw) logp
For simplicity, ignore cost of additions in reduction, which is
usually negligible
Michael T. Heath Parallel Numerical Algorithms 11 / 79
Inner ProductP ll l Al ith
http://find/http://goback/ -
7/31/2019 05 Products
12/79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Scalability for 1-D Mesh
For 1-D mesh, total time is
Tp = tc n/p + (ts + tw) (p
1)
To determine isoefficiency function, set
T1 E(p Tp)tc n
E(tc n + (ts + tw)p (p
1))
which holds if n = (p2), so isoefficiency function is (p2),since T1 = (n)
Michael T. Heath Parallel Numerical Algorithms 12 / 79
Inner ProductP ll l Al ith
http://find/http://goback/ -
7/31/2019 05 Products
13/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Scalability for 2-D Mesh
For 2-D mesh, total time is
Tp = tc n/p + (ts + tw) 2(
p
1)
To determine isoefficiency function, set
tc n E(tc n + (ts + tw)p 2(p 1))
which holds if n = (p3/2), so isoefficiency function is(p3/2), since T1 = (n)
Michael T. Heath Parallel Numerical Algorithms 13 / 79
Inner ProductParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
14/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Scalability for Hypercube
For hypercube, total time is
Tp = tc n/p + (ts + tw) logp
To determine isoefficiency function, set
tc n E(tc n + (ts + tw)p logp)
which holds if n = (p logp), so isoefficiency function is(p logp), since T1 = (n)
Michael T. Heath Parallel Numerical Algorithms 14 / 79
Inner ProductParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
15/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Optimality for 1-D Mesh
To determine optimal number of processes for given n,take p to be continuous variable and minimize Tp withrespect to p
For 1-D mesh
Tp =d
dp
tc n/p + (ts + tw) (p 1)
= tc n/p2 + (ts + tw) = 0
implies that optimal number of processes is
p
tc n
ts + tw
Michael T. Heath Parallel Numerical Algorithms 15 / 79
Inner ProductParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
16/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Optimality for 1-D Mesh
If n < (ts + tw)/tc , then only one process should be used
Substituting optimal p into formula for Tp shows thatoptimal time to compute inner product grows as
n with
increasing n on 1-D mesh
Michael T. Heath Parallel Numerical Algorithms 16 / 79
Inner Product
O P dParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
17/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Scalability
Optimality
Optimality for Hypercube
For hypercube
Tp =d
dptc n/p + (ts + tw) logp
= tc n/p2 + (ts + tw)/p = 0
implies that optimal number of processes is
p tc
n
ts + tw
and optimal time grows as log n with increasing n
Michael T. Heath Parallel Numerical Algorithms 17 / 79
Inner Product
O t P d tParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
18/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
g
Agglomeration Schemes
Scalability
Outer Product
Outer product of two n-vectors x and y is n n matrixZ= xyT whose (i, j) entry zij = xi yj
For example,
x1x2
x3
y1y2
y3
T =
x1y1 x1y2 x1y3x2y1 x2y2 x2y3
x3y1 x3y2 x3y3
Computation of outer product requires n2 multiplications,
so model serial time as
T1 = tc n2
where tc is time for one scalar multiplication
Michael T. Heath Parallel Numerical Algorithms 18 / 79
Inner Product
Outer ProductParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
19/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
g
Agglomeration Schemes
Scalability
Parallel Algorithm
Partition
For i, j = 1, . . . , n, fine-grain task (i, j) computes andstores zij = xi yj , yielding 2-D array of n
2 fine-grain tasks
Assuming no replication of data, at most 2n fine-grain
tasks store components of x and y, say eitherfor some j, task (i, j) stores xi and task (j, i) stores yi, or
task (i, i) stores both xi and yi, i = 1, . . . , n
Communicate
For i = 1, . . . , n, task that stores xi broadcasts it to all othertasks in ith task row
For j = 1, . . . , n, task that stores yj broadcasts it to allother tasks in jth task column
Michael T. Heath Parallel Numerical Algorithms 19 / 79
Inner Product
Outer ProductParallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
20/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
Fine-Grain Tasks and Communication
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 20 / 79
Inner ProductOuter Product
Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
21/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
Fine-Grain Parallel Algorithm
broadcast xi to tasks (i, k), k = 1, . . . , n
broadcast yj to tasks (k, j), k = 1, . . . , n
zij = xiyj
{ horizontal broadcast }
{ vertical broadcast }
{ local scalar product }
Michael T. Heath Parallel Numerical Algorithms 21 / 79
Inner ProductOuter Product
Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
22/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
Agglomeration
Agglomerate
With n n array of fine-grain tasks, natural strategies are
2-D: Combine k k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks
1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks
1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks
Michael T. Heath Parallel Numerical Algorithms 22 / 79
Inner ProductOuter Product
Parallel Algorithm
A l i S h
http://find/http://goback/ -
7/31/2019 05 Products
23/79
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
2-D Agglomeration
Each task that stores portion of x must broadcast its
subvector to all other tasks in its task row
Each task that stores portion of y must broadcast its
subvector to all other tasks in its task column
Michael T. Heath Parallel Numerical Algorithms 23 / 79
Inner ProductOuter Product
Parallel Algorithm
A l ti S h
http://find/http://goback/ -
7/31/2019 05 Products
24/79
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
2-D Agglomeration
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 24 / 79
Inner ProductOuter Product
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
25/79
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
1-D Agglomeration
If either x or y stored in one task, then broadcast required
to communicate needed values to all other tasks
If either x or y distributed across tasks, then multinode
broadcast required to communicate needed values to other
tasks
Michael T. Heath Parallel Numerical Algorithms 25 / 79
Inner ProductOuter Product
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
26/79
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
1-D Column Agglomeration
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 26 / 79
Inner ProductOuter Product
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
27/79
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
1-D Row Agglomeration
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 27 / 79
Inner ProductOuter Product
M i V P d
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
28/79
Matrix-Vector Product
Matrix-Matrix Product
Agglomeration Schemes
Scalability
Mapping
Map
2-D: Assign (n/k)2/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,
treating target network as 2-D mesh
1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D
mesh
Michael T. Heath Parallel Numerical Algorithms 28 / 79
Inner ProductOuter Product
M t i V t P d t
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
29/79
Matrix-Vector Product
Matrix-Matrix Product
gg o e at o Sc e es
Scalability
2-D Agglomeration with Block Mapping
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 29 / 79
Inner ProductOuter Product
Matrix Vector Product
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
30/79
Matrix-Vector Product
Matrix-Matrix Product
gg
Scalability
1-D Column Agglomeration with Block Mapping
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 30 / 79
Inner ProductOuter Product
Matrix Vector Product
Parallel Algorithm
Agglomeration Schemes
http://find/http://goback/ -
7/31/2019 05 Products
31/79
Matrix-Vector Product
Matrix-Matrix ProductScalability
1-D Row Agglomeration with Block Mapping
x1y
1x
1y
2
x2y
1x
2y
2
x1y
3x
1y
4
x2y
3x
2y
4
x3y
1x
3y
2
x4y
1x
4y
2
x3y
3x
3y
4
x4y
3x
4y
4
x5y
1x
5y
2
x6y
1x
6y
2
x5y
3x
5y
4
x6y
3x
6y
4
x1y
5x
1y
6
x2y
5x
2y
6
x3y
5x
3y
6
x4y
5x
4y
6
x5y
5x
5y
6
x6y
5x
6y
6
Michael T. Heath Parallel Numerical Algorithms 31 / 79
http://find/http://goback/ -
7/31/2019 05 Products
32/79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
-
7/31/2019 05 Products
33/79
Matrix Vector Product
Matrix-Matrix ProductScalability
Scalability
Time for computation phase is
Tcomp = tc n2/p
regardless of network or agglomeration scheme
For 2-D agglomeration on 2-D mesh, communication time
is at least
Tcomm = (ts + tw n/p ) (p 1)assuming broadcasts can be overlapped
Michael T. Heath Parallel Numerical Algorithms 33 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
S l bili
http://find/http://goback/ -
7/31/2019 05 Products
34/79
Matrix Vector Product
Matrix-Matrix ProductScalability
Scalability for 2-D Mesh
Total time for 2-D mesh is at least
Tp = tc n2/p + (ts + tw n/
p ) (
p 1)
tc n2/p + tsp + tw nTo determine isoefficiency function, set
tc n2
E(tc n
2 + ts p3/2 + tw n p)
which holds for large p if n = (p), so isoefficiency functionis (p2), since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 34 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
S l bilit
http://find/http://goback/ -
7/31/2019 05 Products
35/79
Matrix-Matrix ProductScalability
Scalability for Hypercube
Total time for hypercube is at least
Tp = tc n2/p + (ts + tw n/
p )(logp)/2
= tc n2/p + ts (logp)/2 + tw n (logp)/(2p )To determine isoefficiency function, set
tc n2
E(tc n
2 + ts p (logp)/2 + tw n
p (logp)/2)
which holds for large p if n = (
p logp), so isoefficiencyfunction is (p (logp)2), since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 35 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
36/79
Matrix-Matrix ProductScalability
Scalability for 1-D mesh
Depending on network, time for communication phase with1-D agglomeration is at least
1-D mesh: Tcomm = (ts + tw n/p) (p 1)2-D mesh: Tcomm = (ts + tw n/p) 2(
p 1)
hypercube: Tcomm = (ts + tw n/p) logp
assuming broadcasts can be overlapped
Michael T. Heath Parallel Numerical Algorithms 36 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
37/79
Matrix-Matrix ProductScalability
Scalability for 1-D Mesh
For 1-D mesh, total time is at least
Tp = tc n2/p + (ts + tw n/p) (p 1)
tc n2
/p + ts p + tw n
To determine isoefficiency function, set
tc n2 E(tc n2 + ts p2 + tw n p)
which holds if n = (p), so isoefficiency function is (p2),since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 37 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
38/79
Matrix-Matrix ProductScalability
Memory Requirements
With either 1-D or 2-D algorithm, straightforward
broadcasting of x or y could require as much total memory
as replication of entire vector in all processes
Memory requirements can be reduced by circulating
portions of x or y through processes in ring fashion, with
each process using each portion as it passes through, so
that no process need store entire vector at once
Michael T. Heath Parallel Numerical Algorithms 38 / 79
Inner ProductOuter Product
Matrix-Vector Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
39/79
Matrix-Matrix ProductScalability
Matrix-Vector Product
Consider matrix-vector product
y = Ax
where A is n n matrix and x and y are n-vectorsComponents of vector y are given by
yi =
nj=1
aij xj , i = 1, . . . , n
Each of n components requires n multiply-add operations,so model serial time as
T1 = tc n2
Michael T. Heath Parallel Numerical Algorithms 39 / 79
Inner ProductOuter Product
Matrix-Vector Product
M t i M t i P d t
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
40/79
Matrix-Matrix ProductSca ab ty
Parallel Algorithm
Partition
For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes aij xj , yielding 2-D array of n
2 fine-grain tasks
Assuming no replication of data, at most 2n fine-grain
tasks store components of x and y, say eitherfor some j, task (j, i) stores xi and task (i, j) stores yi, or
task (i, i) stores both xi and yi, i = 1, . . . , n
Communicate
For j = 1, . . . , n, task that stores xj broadcasts it to allother tasks in jth task column
For i = 1, . . . , n, sum reduction over ith task row gives yi
Michael T. Heath Parallel Numerical Algorithms 40 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
41/79
Matrix-Matrix Producty
Fine-Grain Tasks and Communication
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a13x
3a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 41 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
42/79
Matrix-Matrix Product
Fine-Grain Parallel Algorithm
broadcast xj
to tasks (k, j), k = 1, . . . , n
yi = aijxj
reduce yi across tasks (i, k), k = 1, . . . , n
{ vertical broadcast }
{ local scalar product }
{ horizontal sum reduction }
Michael T. Heath Parallel Numerical Algorithms 42 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
43/79
Matrix-Matrix Product
Agglomeration
Agglomerate
With n n array of fine-grain tasks, natural strategies are
2-D: Combine k k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks
1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks
1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks
Michael T. Heath Parallel Numerical Algorithms 43 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
44/79
Matrix Matrix Product
2-D Agglomeration
Subvector of x broadcast along each task column
Each task computes local matrix-vector product ofsubmatrix of A with subvector of x
Sum reduction along each task row produces subvector of
result y
Michael T. Heath Parallel Numerical Algorithms 44 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
45/79
Matrix Matrix Product
2-D Agglomeration
a13x
3
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 45 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
46/79
1-D Agglomeration
1-D column agglomeration
Each task computes product of its component of x times
its column of matrix, with no communication required
Sum reduction across tasks then produces y
1-D row agglomeration
If x stored in one task, then broadcast required to
communicate needed values to all other tasks
If x distributed across tasks, then multinode broadcastrequired to communicate needed values to other tasks
Each task computes inner product of its row of A with
entirevector x to produce its component of y
Michael T. Heath Parallel Numerical Algorithms 46 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
47/79
1-D Column Agglomeration
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a13x
3a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 47 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
48/79
1-D Row Agglomeration
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a13x
3a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 48 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
49/79
1-D Agglomeration
Column and row algorithms are dual to each other
Column algorithm begins with communication-free localsaxpy computations followed by sum reduction
Row algorithm begins with broadcast followed by
communication-free local sdot computations
Michael T. Heath Parallel Numerical Algorithms 49 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
50/79
Mapping
Map
2-D: Assign (n/k)2/p coarse-grain tasks to each of p
processes using any desired mapping in each dimension,treating target network as 2-D mesh
1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D
mesh
Michael T. Heath Parallel Numerical Algorithms 50 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
51/79
2-D Agglomeration with Block Mapping
a13x
3
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 51 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
52/79
1-D Column Agglomeration with Block Mapping
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a13x
3a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 52 / 79
Inner ProductOuter Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
53/79
1-D Row Agglomeration with Block Mapping
a11x
1
y1
a12x
2
a21x
1
a22x
2
y2
a13x
3a14x
4
a23x
3a24x
4
a31x
1a32x
2
a41x
1a42x
2
a33x
3
y3
a34x
4
a43x
3
a44x
4
y4
a51x
1a52x
2
a61x
1a62x
2
a53x
3a54x
4
a63x
3a64x
4
a15x
5a16x
6
a25x
5a26x
6
a35x
5a36x
6
a45x
5a46x
6
a55x5
y5
a56x
6
a65x
5
a66x
6
y6
Michael T. Heath Parallel Numerical Algorithms 53 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
54/79
Coarse-Grain Parallel Algorithm
broadcast x[j ] to jth process column
y[i ] = A[i ][j ]x[j ]
reduce y[i ] across ith process row
{ vertical broadcast }
{ local matrix-vector product }
{ horizontal sum reduction }
Michael T. Heath Parallel Numerical Algorithms 54 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
55/79
Scalability
Time for computation phase is
Tcomp = tc n2/p
regardless of network or agglomeration scheme
For 2-D agglomeration on 2-D mesh, each of two
communication phases requires time
(ts + tw n/
p ) (
p
1)
ts
p + tw n
so total time is
Tp tc n2/p + 2(tsp + tw n)
Michael T. Heath Parallel Numerical Algorithms 55 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
http://find/http://goback/ -
7/31/2019 05 Products
56/79
Scalability for 2-D Mesh
To determine isoefficiency function, set
tc n2 E(tc n2 + 2(ts p3/2 + tw n p))
which holds if n = (p), so isoefficiency function is (p2),since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 56 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
S l bili f H b
http://find/http://goback/ -
7/31/2019 05 Products
57/79
Scalability for Hypercube
Total time for hypercube is
Tp = tc n2/p + (ts + tw n/
p ) logp
= tc n2
/p + ts logp + tw n (logp)/pTo determine isoefficiency function, set
tc n2 E(tc n2 + ts p logp + tw np logp)
which holds for large p if n = (p logp), so isoefficiencyfunction is (p (logp)2), since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 57 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
S l bilit f 1 D M h
http://find/http://goback/ -
7/31/2019 05 Products
58/79
Scalability for 1-D Mesh
Depending on network, time for communication phase with1-D agglomeration is at least
1-D mesh: Tcomm = (ts + tw n/p) (p 1)
2-D mesh: Tcomm = (ts + tw n/p) 2(p 1)hypercube: Tcomm = (ts + tw n/p) logp
For 1-D agglomeration on 1-D mesh, total time is at least
Tp = tc n2
/p + (ts + tw n/p) (p 1) tc n2/p + ts p + tw n
Michael T. Heath Parallel Numerical Algorithms 58 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
S l bilit f 1 D M h
http://find/http://goback/ -
7/31/2019 05 Products
59/79
Scalability for 1-D Mesh
To determine isoefficiency function, set
tc n2 E(tc n2 + ts p2 + tw n p)
which holds if n = (p), so isoefficiency function is (p2),since T1 = (n
2)
Michael T. Heath Parallel Numerical Algorithms 59 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
M t i M t i P d t
http://find/http://goback/ -
7/31/2019 05 Products
60/79
Matrix-Matrix Product
Consider matrix-matrix product
C= AB
where A, B, and result Care n n matricesEntries of matrix Care given by
cij =
nk=1
aik bkj , i, j = 1, . . . , n
Each of n2 entries of C requires n multiply-add operations,so model serial time as
T1 = tc n3
Michael T. Heath Parallel Numerical Algorithms 60 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Matrix Matrix Product
http://find/http://goback/ -
7/31/2019 05 Products
61/79
Matrix-Matrix Product
Matrix-matrix product can be viewed as
n2 inner products, or
sum of n outer products, or
n matrix-vector products
and each viewpoint yields different algorithm
One way to derive parallel algorithms for matrix-matrix
product is to apply parallel algorithms already developed
for inner product, outer product, or matrix-vector product
We will develop parallel algorithms for this problem directly,
however
Michael T. Heath Parallel Numerical Algorithms 61 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
62/79
Parallel Algorithm
Partition
For i,j,k = 1, . . . , n, fine-grain task(i,j,k) computes product aik bkj , yielding3-D array of n3 fine-grain tasks
Assuming no replication of data, at most3n2 fine-grain tasks store entries of A, B,or C, say task (i,j,j) stores aij , task(i,j,i) stores bij , and task (i,j,k) storescij for i, j = 1, . . . , n and some fixed k
i
j
k
We refer to subsets of tasks along i, j, and k dimensionsas rows, columns, and layers, respectively, so kth columnof A and kth row of B are stored in kth layer of tasks
Michael T. Heath Parallel Numerical Algorithms 62 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
63/79
Parallel Algorithm
Communicate
Broadcast entries of jth column of A horizontally along
each task row in jth layerBroadcast entries of ith row of B vertically along each taskcolumn in ith layer
For i, j = 1, . . . , n, result cij is given by sum reduction over
tasks (i,j,k), k = 1, . . . , n
Michael T. Heath Parallel Numerical Algorithms 63 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Fine Grain Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
64/79
Fine-Grain Algorithm
broadcast aik to tasks (i,q,k), q = 1, . . . , n
broadcast bkj to tasks (q,j,k), q = 1, . . . , n
cij = aikbkj
reduce cij across tasks (i,j,q), q = 1, . . . , n
{ horizontal broadcast }
{ vertical broadcast }
{ local scalar product }
{ lateral sum reduction }
Michael T. Heath Parallel Numerical Algorithms 64 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
65/79
Agglomeration
Agglomerate
With n n n array of fine-grain tasks, natural strategies are3-D: Combine q q q subarray of fine-grain tasks2-D: Combine q q n subarray of fine-grain tasks,eliminating sum reductions
1-D column: Combine n 1 n subarray of fine-graintasks, eliminating vertical broadcasts and sum reductions
1-D row: Combine 1 n n subarray of fine-grain tasks,eliminating horizontal broadcasts and sum reductions
Michael T. Heath Parallel Numerical Algorithms 65 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Mapping
http://find/http://goback/ -
7/31/2019 05 Products
66/79
Mapping
Map
Corresponding mapping strategies are
3-D: Assign (n/q)3/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,
treating target network as 3-D mesh
2-D: Assign (n/q)2/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,
treating target network as 2-D mesh
1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D
mesh
Michael T. Heath Parallel Numerical Algorithms 66 / 79
Inner Product
Outer Product
Matrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Agglomeration with Block Mapping
http://find/http://goback/ -
7/31/2019 05 Products
67/79
Agglomeration with Block Mapping
1-D row 1-D col 2-D 3-D
agglomerations
1-D column1-D row 3-D2-D
Michael T. Heath Parallel Numerical Algorithms 67 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Coarse-Grain 3-D Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
68/79
Coarse Grain 3 D Parallel Algorithm
broadcast A[i ][k] to ith process row
broadcast B[k][j ] to jth process column
C[i ][j ] = A[i ][k]B[k][j ]
reduce C[i ][j ] across process layers
{ horizontal broadcast }
{ vertical broadcast }
{ local matrix product }
{ lateral sum reduction }
Michael T. Heath Parallel Numerical Algorithms 68 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Agglomeration with Block Mapping
http://find/http://goback/ -
7/31/2019 05 Products
69/79
Agglomeration with Block Mapping
A21B
12+A
22B
22A
21B
11+A
22B
21
A11B
12+A
12B
22
B22
B21
B12
B11
A22
A21
A12
A11
A11B
11+A
12B
21
=1-D column:
A21B
12+A
22B
22A
21B
11+A
22B
21
A11B
12+A
12B
22
B22
B21
B12
B11
A22
A21
A12
A11
A11B
11+A
12B
21
=1-D row:
A21B
12+A
22B
22A
21B
11+A
22B
21
A11B
12+A
12B
22
B22
B21
B12
B11
A22
A21
A12
A11
A11B
11+A
12B
21
=2-D:
Michael T. Heath Parallel Numerical Algorithms 69 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Coarse-Grain 2-D Parallel Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
70/79
Coarse Grain 2 D Parallel Algorithm
all-to-all bcast A[i ][j ] in ith process row
all-to-all bcast B[i ][j ] in jth process columnC[i ][j ] = O
for k = 1, . . . ,
pC[i ][j ] = C[i ][j ] +A[i ][k]B[k][j ]
end
{ horizontal broadcast }
{ vertical broadcast }
{ sum local products }
Michael T. Heath Parallel Numerical Algorithms 70 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Fox Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
71/79
Fox Algorithm
Algorithm just described requires excessive memory, since
each process accumulates
p blocks of both A and B
One way to reduce memory requirements is to
broadcast blocks of A successively across process rows,and
circulate blocks of B in ring fashion vertically along processcolumns
step by step so that each block of B comes in conjunction
with appropriate block of A broadcast at that same step
This algorithm is due to Fox et al.
Michael T. Heath Parallel Numerical Algorithms 71 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Cannon Algorithm
http://find/http://goback/ -
7/31/2019 05 Products
72/79
Ca o go t
Another approach, due to Cannon, is to circulate blocks of
B vertically and blocks of A horizontally in ring fashion
Blocks of both matrices must be initially aligned using
circular shifts so that correct blocks meet as needed
Requires even less memory than Fox algorithm, but trickier
to program because of shifts required
Performance and scalability of Fox and Cannon algorithms
are not significantly different from that of previous 2-D
algorithm, but memory requirements are much less
Michael T. Heath Parallel Numerical Algorithms 72 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Scalability for 3-D Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
73/79
y gg
For 3-D agglomeration, computing each of p blocks C[i ][j ]requires matrix-matrix product of two (n/ 3
p ) (n/ 3p )
blocks, so
Tcomp = tc (n/ 3
p )3 = tc n3/p
On 3-D mesh, each broadcast or reduction takes time
(ts + tw (n/ 3
p)2) ( 3
p 1) ts p1/3 + tw n2/p1/3
Total time is therefore
Tp = tc n3/p + 3ts p
1/3 + 3tw n2/p1/3
Michael T. Heath Parallel Numerical Algorithms 73 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Scalability for 3-D Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
74/79
y gg
To determine isoefficiency function, set
tc n3 E(tc n3 + 3ts p4/3 + 3tw n2p2/3)
which holds for large p if n = (p2/3), so isoefficiencyfunction is (p2), since T1 = (n
3)
For hypercube, total time becomes
Tp = tc n
3
/p + ts logp + tw n
2
(logp)/p
2/3
which leads to isoefficiency function of (p (logp)3)
Michael T. Heath Parallel Numerical Algorithms 74 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Scalability for 2-D Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
75/79
y gg
For 2-D agglomeration, computation of each block C[i ][j ]requires
p matrix-matrix products of (n/
p ) (n/p )
blocks, so
Tcomp = tc
p (n/
p )3 = tc n3/p
For 2-D mesh, communication time for broadcasts along
rows and columns is
Tcomm = (ts + tw n2/p)(
p
1)
tsp + tw n2/passuming horizontal and vertical broadcasts can overlap
(multiply by two otherwise)
Michael T. Heath Parallel Numerical Algorithms 75 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Scalability for 2-D Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
76/79
y gg
Total time for 2-D mesh is
Tp tc n3/p + tsp + tw n2/p
To determine isoefficiency function, set
tc n3 E(tc n3 + ts p3/2 + tw n2p )
which holds for large p if n = (
p ), so isoefficiency
function is (p3/2), since T1 = (n3)
Michael T. Heath Parallel Numerical Algorithms 76 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel AlgorithmAgglomeration Schemes
Scalability
Scalability for 1-D Agglomeration
http://find/http://goback/ -
7/31/2019 05 Products
77/79
For 1-D agglomeration on 1-D mesh, total time is
Tp = tc n3/p + (ts + tw n
2/p) (p 1)
tc n
3/p + ts p + tw n2
To determine isoefficiency function, set
tc n3 E(tc n3 + ts p2 + tw n2p)
which holds for large p if n = (p), so isoefficiency functionis (p3) since T1 = (n
3)
Michael T. Heath Parallel Numerical Algorithms 77 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
References
http://find/http://goback/ -
7/31/2019 05 Products
78/79
R. C. Agarwal and F. G. Gustavson and M. Zubair, Ahigh-performance matrix multiplication algorithm on a
distributed memory parallel computer using overlapped
communication, IBM J. Res. Dev., 38:673-681, 1994
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and DistributedComputation: Numerical Methods, Prentice Hall, 1989
J. Berntsen, Communication efficient matrix multiplication
on hypercubes, Parallel Computing12:335-342, 1989
J. W. Demmel, M. T. Heath, and H. A. van der Vorst,
Parallel numerical linear algebra, Acta Numerica
2:111-197, 1993
Michael T. Heath Parallel Numerical Algorithms 78 / 79
Inner Product
Outer ProductMatrix-Vector Product
Matrix-Matrix Product
Parallel Algorithm
Agglomeration Schemes
Scalability
References
http://find/http://goback/ -
7/31/2019 05 Products
79/79
R. Dias da Cunha, A benchmark study based on theparallel computation of the vector outer-product A = uvT
operation, Concurrency: Practice and Experience
9:803-819, 1997
G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms
on a hypercube I: matrix multiplication, Parallel Computing4:17-31, 1987
S. L. Johnsson, Communication efficient basic linear
algebra computations on hypercube architectures, J.
Parallel Distrib. Comput. 4(2):133-172, 1987O. McBryan and E. F. Van de Velde, Matrix and vector
operations on hypercube parallel processors, Parallel
Computing5:117-126, 1987
Michael T. Heath Parallel Numerical Algorithms 79 / 79
http://find/http://goback/