05 products

Upload: sntinpet17

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 05 Products

    1/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Numerical AlgorithmsChapter 5 Vector and Matrix Products

    Prof. Michael T. Heath

    Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

    CSE 512 / CS 554

    Michael T. Heath Parallel Numerical Algorithms 1 / 79

    http://find/http://goback/
  • 7/31/2019 05 Products

    2/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Outline

    1 Inner Product

    2 Outer Product

    3 Matrix-Vector Product

    4 Matrix-Matrix Product

    Michael T. Heath Parallel Numerical Algorithms 2 / 79

    http://find/http://goback/
  • 7/31/2019 05 Products

    3/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Basic Linear Algebra Subprograms

    Basic Linear Algebra Subprograms (BLAS) are building

    blocks for many other matrix computations

    BLAS encapsulate basic operations on vectors and

    matrices so they can be optimized for particular computerarchitecture while high-level routines that call them remain

    portable

    BLAS offer good opportunities for optimizing utilization of

    memory hierarchy

    Generic BLAS are available from netlib, and many

    computer vendors provide custom versions optimized for

    their particular systems

    Michael T. Heath Parallel Numerical Algorithms 3 / 79

    http://find/http://goback/
  • 7/31/2019 05 Products

    4/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Examples of BLAS

    Level Work Examples Function

    1 O(n) saxpy Scalar vector + vectorsdot Inner product

    snrm2 Euclidean vector norm2 O(n2) sgemv Matrix-vector product

    strsv Triangular solution

    sger Rank-one update

    3 O(n3

    )sgemm Matrix-matrix product

    strsm Multiple triang. solutions

    ssyrk Rank-k update

    Michael T. Heath Parallel Numerical Algorithms 4 / 79

    http://find/http://goback/
  • 7/31/2019 05 Products

    5/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Simplifying Assumptions

    For problem of dimension n using p processes, assume p(or in some cases

    p ) divides n

    For 2-D mesh, assume p is perfect square and mesh isp

    p

    For hypercube, assume p is power of two

    Assume matrices are square, n n, not rectangularDealing with general cases where these assumptions do

    not hold is straightforward but tedious, and complicatesnotation

    Caveat: your mileage may vary, depending on

    assumptions about target system, such as level of

    concurrency in communication

    Michael T. Heath Parallel Numerical Algorithms 5 / 79

    I P d

    http://find/http://goback/
  • 7/31/2019 05 Products

    6/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Inner Product

    Inner product of two n-vectors x and y given by

    xTy =

    n

    i=1

    xi yi

    Computation of inner product requires n multiplicationsand n 1 additionsFor simplicity, model serial time as

    T1 = tc n

    where tc is time for one scalar multiply-add operation

    Michael T. Heath Parallel Numerical Algorithms 6 / 79

    I P d t

    http://find/http://goback/
  • 7/31/2019 05 Products

    7/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Parallel Algorithm

    Partition

    For i = 1, . . . , n, fine-grain task i stores xi and yi, and

    computes their product xi yi

    Communicate

    Sum reduction over n fine-grain tasks

    x1y

    1x

    2y

    2x

    3y

    3x

    4y

    4x

    5y

    5x

    6y

    6x

    7y

    7x

    8y

    8x

    9y

    9

    Michael T. Heath Parallel Numerical Algorithms 7 / 79

    Inner Product

    http://find/http://goback/
  • 7/31/2019 05 Products

    8/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Fine-Grain Parallel Algorithm

    z = xiyi

    reduce z across all tasks

    { local scalar product }

    { sum reduction }

    Michael T. Heath Parallel Numerical Algorithms 8 / 79

    Inner Product

    http://find/http://goback/
  • 7/31/2019 05 Products

    9/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Agglomeration and Mapping

    Agglomerate

    Combine k components of both x and y to form eachcoarse-grain task, which computes inner product of these

    subvectors

    Communication becomes sum reduction over n/kcoarse-grain tasks

    Map

    Assign (n/k)/p coarse-grain tasks to each of p processes,for total of n/p components of x and y per process

    x1y

    1x

    2y

    2x

    3y

    3x

    4y

    4x

    5y

    5x

    6y

    6x

    7y

    7x

    8y

    8x

    9y

    9++ ++ + +

    Michael T. Heath Parallel Numerical Algorithms 9 / 79

    Inner Product

    http://find/http://goback/
  • 7/31/2019 05 Products

    10/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Coarse-Grain Parallel Algorithm

    z = xT[i ]y[i ]

    reduce z across all processes

    { local inner product }

    { sum reduction }

    x[i ] means subvector of x assigned to process i by mapping

    Michael T. Heath Parallel Numerical Algorithms 10 / 79

    Inner Product

    http://find/http://goback/
  • 7/31/2019 05 Products

    11/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Performance

    Time for computation phase is

    Tcomp = tc n/p

    regardless of network

    Depending on network, time for communication phase is

    1-D mesh: Tcomm = (ts + tw) (p 1)2-D mesh: Tcomm = (ts + tw) 2(

    p

    1)

    hypercube: Tcomm = (ts + tw) logp

    For simplicity, ignore cost of additions in reduction, which is

    usually negligible

    Michael T. Heath Parallel Numerical Algorithms 11 / 79

    Inner ProductP ll l Al ith

    http://find/http://goback/
  • 7/31/2019 05 Products

    12/79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Scalability for 1-D Mesh

    For 1-D mesh, total time is

    Tp = tc n/p + (ts + tw) (p

    1)

    To determine isoefficiency function, set

    T1 E(p Tp)tc n

    E(tc n + (ts + tw)p (p

    1))

    which holds if n = (p2), so isoefficiency function is (p2),since T1 = (n)

    Michael T. Heath Parallel Numerical Algorithms 12 / 79

    Inner ProductP ll l Al ith

    http://find/http://goback/
  • 7/31/2019 05 Products

    13/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Scalability for 2-D Mesh

    For 2-D mesh, total time is

    Tp = tc n/p + (ts + tw) 2(

    p

    1)

    To determine isoefficiency function, set

    tc n E(tc n + (ts + tw)p 2(p 1))

    which holds if n = (p3/2), so isoefficiency function is(p3/2), since T1 = (n)

    Michael T. Heath Parallel Numerical Algorithms 13 / 79

    Inner ProductParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    14/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Scalability for Hypercube

    For hypercube, total time is

    Tp = tc n/p + (ts + tw) logp

    To determine isoefficiency function, set

    tc n E(tc n + (ts + tw)p logp)

    which holds if n = (p logp), so isoefficiency function is(p logp), since T1 = (n)

    Michael T. Heath Parallel Numerical Algorithms 14 / 79

    Inner ProductParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    15/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Optimality for 1-D Mesh

    To determine optimal number of processes for given n,take p to be continuous variable and minimize Tp withrespect to p

    For 1-D mesh

    Tp =d

    dp

    tc n/p + (ts + tw) (p 1)

    = tc n/p2 + (ts + tw) = 0

    implies that optimal number of processes is

    p

    tc n

    ts + tw

    Michael T. Heath Parallel Numerical Algorithms 15 / 79

    Inner ProductParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    16/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Optimality for 1-D Mesh

    If n < (ts + tw)/tc , then only one process should be used

    Substituting optimal p into formula for Tp shows thatoptimal time to compute inner product grows as

    n with

    increasing n on 1-D mesh

    Michael T. Heath Parallel Numerical Algorithms 16 / 79

    Inner Product

    O P dParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    17/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Scalability

    Optimality

    Optimality for Hypercube

    For hypercube

    Tp =d

    dptc n/p + (ts + tw) logp

    = tc n/p2 + (ts + tw)/p = 0

    implies that optimal number of processes is

    p tc

    n

    ts + tw

    and optimal time grows as log n with increasing n

    Michael T. Heath Parallel Numerical Algorithms 17 / 79

    Inner Product

    O t P d tParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    18/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    g

    Agglomeration Schemes

    Scalability

    Outer Product

    Outer product of two n-vectors x and y is n n matrixZ= xyT whose (i, j) entry zij = xi yj

    For example,

    x1x2

    x3

    y1y2

    y3

    T =

    x1y1 x1y2 x1y3x2y1 x2y2 x2y3

    x3y1 x3y2 x3y3

    Computation of outer product requires n2 multiplications,

    so model serial time as

    T1 = tc n2

    where tc is time for one scalar multiplication

    Michael T. Heath Parallel Numerical Algorithms 18 / 79

    Inner Product

    Outer ProductParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    19/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    g

    Agglomeration Schemes

    Scalability

    Parallel Algorithm

    Partition

    For i, j = 1, . . . , n, fine-grain task (i, j) computes andstores zij = xi yj , yielding 2-D array of n

    2 fine-grain tasks

    Assuming no replication of data, at most 2n fine-grain

    tasks store components of x and y, say eitherfor some j, task (i, j) stores xi and task (j, i) stores yi, or

    task (i, i) stores both xi and yi, i = 1, . . . , n

    Communicate

    For i = 1, . . . , n, task that stores xi broadcasts it to all othertasks in ith task row

    For j = 1, . . . , n, task that stores yj broadcasts it to allother tasks in jth task column

    Michael T. Heath Parallel Numerical Algorithms 19 / 79

    Inner Product

    Outer ProductParallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    20/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    Fine-Grain Tasks and Communication

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 20 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    21/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    Fine-Grain Parallel Algorithm

    broadcast xi to tasks (i, k), k = 1, . . . , n

    broadcast yj to tasks (k, j), k = 1, . . . , n

    zij = xiyj

    { horizontal broadcast }

    { vertical broadcast }

    { local scalar product }

    Michael T. Heath Parallel Numerical Algorithms 21 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    22/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    Agglomeration

    Agglomerate

    With n n array of fine-grain tasks, natural strategies are

    2-D: Combine k k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks

    1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks

    1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks

    Michael T. Heath Parallel Numerical Algorithms 22 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    A l i S h

    http://find/http://goback/
  • 7/31/2019 05 Products

    23/79

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    2-D Agglomeration

    Each task that stores portion of x must broadcast its

    subvector to all other tasks in its task row

    Each task that stores portion of y must broadcast its

    subvector to all other tasks in its task column

    Michael T. Heath Parallel Numerical Algorithms 23 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    A l ti S h

    http://find/http://goback/
  • 7/31/2019 05 Products

    24/79

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    2-D Agglomeration

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 24 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    25/79

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    1-D Agglomeration

    If either x or y stored in one task, then broadcast required

    to communicate needed values to all other tasks

    If either x or y distributed across tasks, then multinode

    broadcast required to communicate needed values to other

    tasks

    Michael T. Heath Parallel Numerical Algorithms 25 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    26/79

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    1-D Column Agglomeration

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 26 / 79

    Inner ProductOuter Product

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    27/79

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    1-D Row Agglomeration

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 27 / 79

    Inner ProductOuter Product

    M i V P d

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    28/79

    Matrix-Vector Product

    Matrix-Matrix Product

    Agglomeration Schemes

    Scalability

    Mapping

    Map

    2-D: Assign (n/k)2/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,

    treating target network as 2-D mesh

    1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D

    mesh

    Michael T. Heath Parallel Numerical Algorithms 28 / 79

    Inner ProductOuter Product

    M t i V t P d t

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    29/79

    Matrix-Vector Product

    Matrix-Matrix Product

    gg o e at o Sc e es

    Scalability

    2-D Agglomeration with Block Mapping

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 29 / 79

    Inner ProductOuter Product

    Matrix Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    30/79

    Matrix-Vector Product

    Matrix-Matrix Product

    gg

    Scalability

    1-D Column Agglomeration with Block Mapping

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 30 / 79

    Inner ProductOuter Product

    Matrix Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    http://find/http://goback/
  • 7/31/2019 05 Products

    31/79

    Matrix-Vector Product

    Matrix-Matrix ProductScalability

    1-D Row Agglomeration with Block Mapping

    x1y

    1x

    1y

    2

    x2y

    1x

    2y

    2

    x1y

    3x

    1y

    4

    x2y

    3x

    2y

    4

    x3y

    1x

    3y

    2

    x4y

    1x

    4y

    2

    x3y

    3x

    3y

    4

    x4y

    3x

    4y

    4

    x5y

    1x

    5y

    2

    x6y

    1x

    6y

    2

    x5y

    3x

    5y

    4

    x6y

    3x

    6y

    4

    x1y

    5x

    1y

    6

    x2y

    5x

    2y

    6

    x3y

    5x

    3y

    6

    x4y

    5x

    4y

    6

    x5y

    5x

    5y

    6

    x6y

    5x

    6y

    6

    Michael T. Heath Parallel Numerical Algorithms 31 / 79

    http://find/http://goback/
  • 7/31/2019 05 Products

    32/79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

  • 7/31/2019 05 Products

    33/79

    Matrix Vector Product

    Matrix-Matrix ProductScalability

    Scalability

    Time for computation phase is

    Tcomp = tc n2/p

    regardless of network or agglomeration scheme

    For 2-D agglomeration on 2-D mesh, communication time

    is at least

    Tcomm = (ts + tw n/p ) (p 1)assuming broadcasts can be overlapped

    Michael T. Heath Parallel Numerical Algorithms 33 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    S l bili

    http://find/http://goback/
  • 7/31/2019 05 Products

    34/79

    Matrix Vector Product

    Matrix-Matrix ProductScalability

    Scalability for 2-D Mesh

    Total time for 2-D mesh is at least

    Tp = tc n2/p + (ts + tw n/

    p ) (

    p 1)

    tc n2/p + tsp + tw nTo determine isoefficiency function, set

    tc n2

    E(tc n

    2 + ts p3/2 + tw n p)

    which holds for large p if n = (p), so isoefficiency functionis (p2), since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 34 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    S l bilit

    http://find/http://goback/
  • 7/31/2019 05 Products

    35/79

    Matrix-Matrix ProductScalability

    Scalability for Hypercube

    Total time for hypercube is at least

    Tp = tc n2/p + (ts + tw n/

    p )(logp)/2

    = tc n2/p + ts (logp)/2 + tw n (logp)/(2p )To determine isoefficiency function, set

    tc n2

    E(tc n

    2 + ts p (logp)/2 + tw n

    p (logp)/2)

    which holds for large p if n = (

    p logp), so isoefficiencyfunction is (p (logp)2), since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 35 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    36/79

    Matrix-Matrix ProductScalability

    Scalability for 1-D mesh

    Depending on network, time for communication phase with1-D agglomeration is at least

    1-D mesh: Tcomm = (ts + tw n/p) (p 1)2-D mesh: Tcomm = (ts + tw n/p) 2(

    p 1)

    hypercube: Tcomm = (ts + tw n/p) logp

    assuming broadcasts can be overlapped

    Michael T. Heath Parallel Numerical Algorithms 36 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    37/79

    Matrix-Matrix ProductScalability

    Scalability for 1-D Mesh

    For 1-D mesh, total time is at least

    Tp = tc n2/p + (ts + tw n/p) (p 1)

    tc n2

    /p + ts p + tw n

    To determine isoefficiency function, set

    tc n2 E(tc n2 + ts p2 + tw n p)

    which holds if n = (p), so isoefficiency function is (p2),since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 37 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    38/79

    Matrix-Matrix ProductScalability

    Memory Requirements

    With either 1-D or 2-D algorithm, straightforward

    broadcasting of x or y could require as much total memory

    as replication of entire vector in all processes

    Memory requirements can be reduced by circulating

    portions of x or y through processes in ring fashion, with

    each process using each portion as it passes through, so

    that no process need store entire vector at once

    Michael T. Heath Parallel Numerical Algorithms 38 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    39/79

    Matrix-Matrix ProductScalability

    Matrix-Vector Product

    Consider matrix-vector product

    y = Ax

    where A is n n matrix and x and y are n-vectorsComponents of vector y are given by

    yi =

    nj=1

    aij xj , i = 1, . . . , n

    Each of n components requires n multiply-add operations,so model serial time as

    T1 = tc n2

    Michael T. Heath Parallel Numerical Algorithms 39 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    M t i M t i P d t

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    40/79

    Matrix-Matrix ProductSca ab ty

    Parallel Algorithm

    Partition

    For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes aij xj , yielding 2-D array of n

    2 fine-grain tasks

    Assuming no replication of data, at most 2n fine-grain

    tasks store components of x and y, say eitherfor some j, task (j, i) stores xi and task (i, j) stores yi, or

    task (i, i) stores both xi and yi, i = 1, . . . , n

    Communicate

    For j = 1, . . . , n, task that stores xj broadcasts it to allother tasks in jth task column

    For i = 1, . . . , n, sum reduction over ith task row gives yi

    Michael T. Heath Parallel Numerical Algorithms 40 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    41/79

    Matrix-Matrix Producty

    Fine-Grain Tasks and Communication

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a13x

    3a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 41 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    42/79

    Matrix-Matrix Product

    Fine-Grain Parallel Algorithm

    broadcast xj

    to tasks (k, j), k = 1, . . . , n

    yi = aijxj

    reduce yi across tasks (i, k), k = 1, . . . , n

    { vertical broadcast }

    { local scalar product }

    { horizontal sum reduction }

    Michael T. Heath Parallel Numerical Algorithms 42 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    43/79

    Matrix-Matrix Product

    Agglomeration

    Agglomerate

    With n n array of fine-grain tasks, natural strategies are

    2-D: Combine k k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks

    1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks

    1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks

    Michael T. Heath Parallel Numerical Algorithms 43 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    44/79

    Matrix Matrix Product

    2-D Agglomeration

    Subvector of x broadcast along each task column

    Each task computes local matrix-vector product ofsubmatrix of A with subvector of x

    Sum reduction along each task row produces subvector of

    result y

    Michael T. Heath Parallel Numerical Algorithms 44 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    45/79

    Matrix Matrix Product

    2-D Agglomeration

    a13x

    3

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 45 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    46/79

    1-D Agglomeration

    1-D column agglomeration

    Each task computes product of its component of x times

    its column of matrix, with no communication required

    Sum reduction across tasks then produces y

    1-D row agglomeration

    If x stored in one task, then broadcast required to

    communicate needed values to all other tasks

    If x distributed across tasks, then multinode broadcastrequired to communicate needed values to other tasks

    Each task computes inner product of its row of A with

    entirevector x to produce its component of y

    Michael T. Heath Parallel Numerical Algorithms 46 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    47/79

    1-D Column Agglomeration

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a13x

    3a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 47 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    48/79

    1-D Row Agglomeration

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a13x

    3a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 48 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    49/79

    1-D Agglomeration

    Column and row algorithms are dual to each other

    Column algorithm begins with communication-free localsaxpy computations followed by sum reduction

    Row algorithm begins with broadcast followed by

    communication-free local sdot computations

    Michael T. Heath Parallel Numerical Algorithms 49 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    50/79

    Mapping

    Map

    2-D: Assign (n/k)2/p coarse-grain tasks to each of p

    processes using any desired mapping in each dimension,treating target network as 2-D mesh

    1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D

    mesh

    Michael T. Heath Parallel Numerical Algorithms 50 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    51/79

    2-D Agglomeration with Block Mapping

    a13x

    3

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 51 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    52/79

    1-D Column Agglomeration with Block Mapping

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a13x

    3a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 52 / 79

    Inner ProductOuter Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    53/79

    1-D Row Agglomeration with Block Mapping

    a11x

    1

    y1

    a12x

    2

    a21x

    1

    a22x

    2

    y2

    a13x

    3a14x

    4

    a23x

    3a24x

    4

    a31x

    1a32x

    2

    a41x

    1a42x

    2

    a33x

    3

    y3

    a34x

    4

    a43x

    3

    a44x

    4

    y4

    a51x

    1a52x

    2

    a61x

    1a62x

    2

    a53x

    3a54x

    4

    a63x

    3a64x

    4

    a15x

    5a16x

    6

    a25x

    5a26x

    6

    a35x

    5a36x

    6

    a45x

    5a46x

    6

    a55x5

    y5

    a56x

    6

    a65x

    5

    a66x

    6

    y6

    Michael T. Heath Parallel Numerical Algorithms 53 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    54/79

    Coarse-Grain Parallel Algorithm

    broadcast x[j ] to jth process column

    y[i ] = A[i ][j ]x[j ]

    reduce y[i ] across ith process row

    { vertical broadcast }

    { local matrix-vector product }

    { horizontal sum reduction }

    Michael T. Heath Parallel Numerical Algorithms 54 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    55/79

    Scalability

    Time for computation phase is

    Tcomp = tc n2/p

    regardless of network or agglomeration scheme

    For 2-D agglomeration on 2-D mesh, each of two

    communication phases requires time

    (ts + tw n/

    p ) (

    p

    1)

    ts

    p + tw n

    so total time is

    Tp tc n2/p + 2(tsp + tw n)

    Michael T. Heath Parallel Numerical Algorithms 55 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    http://find/http://goback/
  • 7/31/2019 05 Products

    56/79

    Scalability for 2-D Mesh

    To determine isoefficiency function, set

    tc n2 E(tc n2 + 2(ts p3/2 + tw n p))

    which holds if n = (p), so isoefficiency function is (p2),since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 56 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    S l bili f H b

    http://find/http://goback/
  • 7/31/2019 05 Products

    57/79

    Scalability for Hypercube

    Total time for hypercube is

    Tp = tc n2/p + (ts + tw n/

    p ) logp

    = tc n2

    /p + ts logp + tw n (logp)/pTo determine isoefficiency function, set

    tc n2 E(tc n2 + ts p logp + tw np logp)

    which holds for large p if n = (p logp), so isoefficiencyfunction is (p (logp)2), since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 57 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    S l bilit f 1 D M h

    http://find/http://goback/
  • 7/31/2019 05 Products

    58/79

    Scalability for 1-D Mesh

    Depending on network, time for communication phase with1-D agglomeration is at least

    1-D mesh: Tcomm = (ts + tw n/p) (p 1)

    2-D mesh: Tcomm = (ts + tw n/p) 2(p 1)hypercube: Tcomm = (ts + tw n/p) logp

    For 1-D agglomeration on 1-D mesh, total time is at least

    Tp = tc n2

    /p + (ts + tw n/p) (p 1) tc n2/p + ts p + tw n

    Michael T. Heath Parallel Numerical Algorithms 58 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    S l bilit f 1 D M h

    http://find/http://goback/
  • 7/31/2019 05 Products

    59/79

    Scalability for 1-D Mesh

    To determine isoefficiency function, set

    tc n2 E(tc n2 + ts p2 + tw n p)

    which holds if n = (p), so isoefficiency function is (p2),since T1 = (n

    2)

    Michael T. Heath Parallel Numerical Algorithms 59 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    M t i M t i P d t

    http://find/http://goback/
  • 7/31/2019 05 Products

    60/79

    Matrix-Matrix Product

    Consider matrix-matrix product

    C= AB

    where A, B, and result Care n n matricesEntries of matrix Care given by

    cij =

    nk=1

    aik bkj , i, j = 1, . . . , n

    Each of n2 entries of C requires n multiply-add operations,so model serial time as

    T1 = tc n3

    Michael T. Heath Parallel Numerical Algorithms 60 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Matrix Matrix Product

    http://find/http://goback/
  • 7/31/2019 05 Products

    61/79

    Matrix-Matrix Product

    Matrix-matrix product can be viewed as

    n2 inner products, or

    sum of n outer products, or

    n matrix-vector products

    and each viewpoint yields different algorithm

    One way to derive parallel algorithms for matrix-matrix

    product is to apply parallel algorithms already developed

    for inner product, outer product, or matrix-vector product

    We will develop parallel algorithms for this problem directly,

    however

    Michael T. Heath Parallel Numerical Algorithms 61 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    62/79

    Parallel Algorithm

    Partition

    For i,j,k = 1, . . . , n, fine-grain task(i,j,k) computes product aik bkj , yielding3-D array of n3 fine-grain tasks

    Assuming no replication of data, at most3n2 fine-grain tasks store entries of A, B,or C, say task (i,j,j) stores aij , task(i,j,i) stores bij , and task (i,j,k) storescij for i, j = 1, . . . , n and some fixed k

    i

    j

    k

    We refer to subsets of tasks along i, j, and k dimensionsas rows, columns, and layers, respectively, so kth columnof A and kth row of B are stored in kth layer of tasks

    Michael T. Heath Parallel Numerical Algorithms 62 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    63/79

    Parallel Algorithm

    Communicate

    Broadcast entries of jth column of A horizontally along

    each task row in jth layerBroadcast entries of ith row of B vertically along each taskcolumn in ith layer

    For i, j = 1, . . . , n, result cij is given by sum reduction over

    tasks (i,j,k), k = 1, . . . , n

    Michael T. Heath Parallel Numerical Algorithms 63 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Fine Grain Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    64/79

    Fine-Grain Algorithm

    broadcast aik to tasks (i,q,k), q = 1, . . . , n

    broadcast bkj to tasks (q,j,k), q = 1, . . . , n

    cij = aikbkj

    reduce cij across tasks (i,j,q), q = 1, . . . , n

    { horizontal broadcast }

    { vertical broadcast }

    { local scalar product }

    { lateral sum reduction }

    Michael T. Heath Parallel Numerical Algorithms 64 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    65/79

    Agglomeration

    Agglomerate

    With n n n array of fine-grain tasks, natural strategies are3-D: Combine q q q subarray of fine-grain tasks2-D: Combine q q n subarray of fine-grain tasks,eliminating sum reductions

    1-D column: Combine n 1 n subarray of fine-graintasks, eliminating vertical broadcasts and sum reductions

    1-D row: Combine 1 n n subarray of fine-grain tasks,eliminating horizontal broadcasts and sum reductions

    Michael T. Heath Parallel Numerical Algorithms 65 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Mapping

    http://find/http://goback/
  • 7/31/2019 05 Products

    66/79

    Mapping

    Map

    Corresponding mapping strategies are

    3-D: Assign (n/q)3/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,

    treating target network as 3-D mesh

    2-D: Assign (n/q)2/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,

    treating target network as 2-D mesh

    1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-D

    mesh

    Michael T. Heath Parallel Numerical Algorithms 66 / 79

    Inner Product

    Outer Product

    Matrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Agglomeration with Block Mapping

    http://find/http://goback/
  • 7/31/2019 05 Products

    67/79

    Agglomeration with Block Mapping

    1-D row 1-D col 2-D 3-D

    agglomerations

    1-D column1-D row 3-D2-D

    Michael T. Heath Parallel Numerical Algorithms 67 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Coarse-Grain 3-D Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    68/79

    Coarse Grain 3 D Parallel Algorithm

    broadcast A[i ][k] to ith process row

    broadcast B[k][j ] to jth process column

    C[i ][j ] = A[i ][k]B[k][j ]

    reduce C[i ][j ] across process layers

    { horizontal broadcast }

    { vertical broadcast }

    { local matrix product }

    { lateral sum reduction }

    Michael T. Heath Parallel Numerical Algorithms 68 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Agglomeration with Block Mapping

    http://find/http://goback/
  • 7/31/2019 05 Products

    69/79

    Agglomeration with Block Mapping

    A21B

    12+A

    22B

    22A

    21B

    11+A

    22B

    21

    A11B

    12+A

    12B

    22

    B22

    B21

    B12

    B11

    A22

    A21

    A12

    A11

    A11B

    11+A

    12B

    21

    =1-D column:

    A21B

    12+A

    22B

    22A

    21B

    11+A

    22B

    21

    A11B

    12+A

    12B

    22

    B22

    B21

    B12

    B11

    A22

    A21

    A12

    A11

    A11B

    11+A

    12B

    21

    =1-D row:

    A21B

    12+A

    22B

    22A

    21B

    11+A

    22B

    21

    A11B

    12+A

    12B

    22

    B22

    B21

    B12

    B11

    A22

    A21

    A12

    A11

    A11B

    11+A

    12B

    21

    =2-D:

    Michael T. Heath Parallel Numerical Algorithms 69 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Coarse-Grain 2-D Parallel Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    70/79

    Coarse Grain 2 D Parallel Algorithm

    all-to-all bcast A[i ][j ] in ith process row

    all-to-all bcast B[i ][j ] in jth process columnC[i ][j ] = O

    for k = 1, . . . ,

    pC[i ][j ] = C[i ][j ] +A[i ][k]B[k][j ]

    end

    { horizontal broadcast }

    { vertical broadcast }

    { sum local products }

    Michael T. Heath Parallel Numerical Algorithms 70 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Fox Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    71/79

    Fox Algorithm

    Algorithm just described requires excessive memory, since

    each process accumulates

    p blocks of both A and B

    One way to reduce memory requirements is to

    broadcast blocks of A successively across process rows,and

    circulate blocks of B in ring fashion vertically along processcolumns

    step by step so that each block of B comes in conjunction

    with appropriate block of A broadcast at that same step

    This algorithm is due to Fox et al.

    Michael T. Heath Parallel Numerical Algorithms 71 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Cannon Algorithm

    http://find/http://goback/
  • 7/31/2019 05 Products

    72/79

    Ca o go t

    Another approach, due to Cannon, is to circulate blocks of

    B vertically and blocks of A horizontally in ring fashion

    Blocks of both matrices must be initially aligned using

    circular shifts so that correct blocks meet as needed

    Requires even less memory than Fox algorithm, but trickier

    to program because of shifts required

    Performance and scalability of Fox and Cannon algorithms

    are not significantly different from that of previous 2-D

    algorithm, but memory requirements are much less

    Michael T. Heath Parallel Numerical Algorithms 72 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Scalability for 3-D Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    73/79

    y gg

    For 3-D agglomeration, computing each of p blocks C[i ][j ]requires matrix-matrix product of two (n/ 3

    p ) (n/ 3p )

    blocks, so

    Tcomp = tc (n/ 3

    p )3 = tc n3/p

    On 3-D mesh, each broadcast or reduction takes time

    (ts + tw (n/ 3

    p)2) ( 3

    p 1) ts p1/3 + tw n2/p1/3

    Total time is therefore

    Tp = tc n3/p + 3ts p

    1/3 + 3tw n2/p1/3

    Michael T. Heath Parallel Numerical Algorithms 73 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Scalability for 3-D Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    74/79

    y gg

    To determine isoefficiency function, set

    tc n3 E(tc n3 + 3ts p4/3 + 3tw n2p2/3)

    which holds for large p if n = (p2/3), so isoefficiencyfunction is (p2), since T1 = (n

    3)

    For hypercube, total time becomes

    Tp = tc n

    3

    /p + ts logp + tw n

    2

    (logp)/p

    2/3

    which leads to isoefficiency function of (p (logp)3)

    Michael T. Heath Parallel Numerical Algorithms 74 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Scalability for 2-D Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    75/79

    y gg

    For 2-D agglomeration, computation of each block C[i ][j ]requires

    p matrix-matrix products of (n/

    p ) (n/p )

    blocks, so

    Tcomp = tc

    p (n/

    p )3 = tc n3/p

    For 2-D mesh, communication time for broadcasts along

    rows and columns is

    Tcomm = (ts + tw n2/p)(

    p

    1)

    tsp + tw n2/passuming horizontal and vertical broadcasts can overlap

    (multiply by two otherwise)

    Michael T. Heath Parallel Numerical Algorithms 75 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Scalability for 2-D Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    76/79

    y gg

    Total time for 2-D mesh is

    Tp tc n3/p + tsp + tw n2/p

    To determine isoefficiency function, set

    tc n3 E(tc n3 + ts p3/2 + tw n2p )

    which holds for large p if n = (

    p ), so isoefficiency

    function is (p3/2), since T1 = (n3)

    Michael T. Heath Parallel Numerical Algorithms 76 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel AlgorithmAgglomeration Schemes

    Scalability

    Scalability for 1-D Agglomeration

    http://find/http://goback/
  • 7/31/2019 05 Products

    77/79

    For 1-D agglomeration on 1-D mesh, total time is

    Tp = tc n3/p + (ts + tw n

    2/p) (p 1)

    tc n

    3/p + ts p + tw n2

    To determine isoefficiency function, set

    tc n3 E(tc n3 + ts p2 + tw n2p)

    which holds for large p if n = (p), so isoefficiency functionis (p3) since T1 = (n

    3)

    Michael T. Heath Parallel Numerical Algorithms 77 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    References

    http://find/http://goback/
  • 7/31/2019 05 Products

    78/79

    R. C. Agarwal and F. G. Gustavson and M. Zubair, Ahigh-performance matrix multiplication algorithm on a

    distributed memory parallel computer using overlapped

    communication, IBM J. Res. Dev., 38:673-681, 1994

    D. P. Bertsekas and J. N. Tsitsiklis, Parallel and DistributedComputation: Numerical Methods, Prentice Hall, 1989

    J. Berntsen, Communication efficient matrix multiplication

    on hypercubes, Parallel Computing12:335-342, 1989

    J. W. Demmel, M. T. Heath, and H. A. van der Vorst,

    Parallel numerical linear algebra, Acta Numerica

    2:111-197, 1993

    Michael T. Heath Parallel Numerical Algorithms 78 / 79

    Inner Product

    Outer ProductMatrix-Vector Product

    Matrix-Matrix Product

    Parallel Algorithm

    Agglomeration Schemes

    Scalability

    References

    http://find/http://goback/
  • 7/31/2019 05 Products

    79/79

    R. Dias da Cunha, A benchmark study based on theparallel computation of the vector outer-product A = uvT

    operation, Concurrency: Practice and Experience

    9:803-819, 1997

    G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms

    on a hypercube I: matrix multiplication, Parallel Computing4:17-31, 1987

    S. L. Johnsson, Communication efficient basic linear

    algebra computations on hypercube architectures, J.

    Parallel Distrib. Comput. 4(2):133-172, 1987O. McBryan and E. F. Van de Velde, Matrix and vector

    operations on hypercube parallel processors, Parallel

    Computing5:117-126, 1987

    Michael T. Heath Parallel Numerical Algorithms 79 / 79

    http://find/http://goback/