matrix multiplication on multidimensional torus...

24
Torus networks Algorithms Implementation Performance Conclusion Matrix multiplication on multidimensional torus networks Edgar Solomonik, and James Demmel University of California, Berkeley VECPAR, July 2012 Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23

Upload: others

Post on 02-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Matrix multiplication on multidimensional torusnetworks

Edgar Solomonik, and James Demmel

University of California, Berkeley

VECPAR, July 2012

Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23

Page 2: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Outline

Torus networksBlueGene architectureCollective communication

AlgorithmsSUMMACannon’s algorithmSD-Cannon’s algorithm

ImplementationOne-sided MPI communicationCharm++ virtualization

Performance

Conclusion

Edgar Solomonik Split-Dimensional Cannon’s algorithm 2/ 23

Page 3: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

BlueGene/P and BlueGene/Q

Direct torus networks

I BG/P is 3D, BG/Q is 5D

I Both are bidirectional networks (6 and 10 links per node)

I Injection bandwidth sufficient to saturate all links

I Topology-aware partition allocation and collectives

Edgar Solomonik Split-Dimensional Cannon’s algorithm 3/ 23

Page 4: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Split-Dimensional Cannon’s algorithm 4/ 23

Page 5: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular (pipelined) multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik Split-Dimensional Cannon’s algorithm 5/ 23

Page 6: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

2D rectangular pipelined multicast trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 6/ 23

Page 7: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Matrix multiplication

A

BA

B

A

B

AB

[Van De Geijn and Watts 97]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 7/ 23

Page 8: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

SUMMA and LU with rectangular vs binomial collectives

0

10

20

30

40

50

60

70

80

MM LU LU+PVT

Per

cent

age

of m

achi

ne p

eak

Different collectives on BG/P (n=131,072, p=16,384)

binomialrectangular

Edgar Solomonik Split-Dimensional Cannon’s algorithm 8/ 23

Page 9: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Cannon’s algorithmA B

Stagger left

A[i,j] := A[i,j+1]

Shift right

A[i,j] := A[i,j-1]

Starting position

Stagger up

B[i,j] := B[i+1,j]

Shift down

B[i,j] := B[i-1,j]

Starting position

...

[Cannon 69]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 9/ 23

Page 10: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Cannon’s algorithm

Advantages over SUMMAI Uses only near-neighbor sends rather than multicasts

I lower latency cost

I Can be done in-place given near-neighbor data-swaps

Disadvantages with respect to SUMMA

I Does not generalize well to non-square processor grids

I Cannot exploit multiple links via rectangular multicasts

Edgar Solomonik Split-Dimensional Cannon’s algorithm 10/ 23

Page 11: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Split-dimensional Cannon’s algorithm

Improves over Cannon by saturating all network links

I Subdivide whole multiply into 2n block outer products

I Use bidirectional links by shifting half the outer products inopposite direction

I Perform each outer product in a different dimensional order

I Accumulation of outer-products into one buffer allows forsame-sized local multiplications as pure Cannon’s algorithm

I Does not require pipelined multicasts

Edgar Solomonik Split-Dimensional Cannon’s algorithm 11/ 23

Page 12: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Split-dimensional Cannon’s algorithm

SD-Cannon on a 3-ary 6-cube

dim 1

dim 2

dim 3

Each circle corresponds to a shift along a dimension

Each color corresponds to an outer product

Edgar Solomonik Split-Dimensional Cannon’s algorithm 12/ 23

Page 13: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

MPI implementation

SD-Cannon parallel implementation using MPI

I All communication done with near-neighbor one-sided puts

I Code is simple (200 lines)

I Limited to square (k-ary n-cube) processor grids

Edgar Solomonik Split-Dimensional Cannon’s algorithm 13/ 23

Page 14: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

Charm++

Charm++ is an asynchronous dynamic runtime system

I Provides object-based (chares) virtualization (decouples fromprocess grid)

I Message-directed task invocation

I Allows topology-aware task mapping

I Provides additional features such as dynamic load balancingand performance profiling

Edgar Solomonik Split-Dimensional Cannon’s algorithm 14/ 23

Page 15: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

Virtual topology via Charm++

Implemented Cannon and SD-Cannon in Charm++

I Can map to any torus process topology

I Code more complex, but not significantly so

I Not using one-sided communication (though possible viaCkDirect)

I Virtualization lowers task granularity

Edgar Solomonik Split-Dimensional Cannon’s algorithm 15/ 23

Page 16: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

2D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1024 2048 4096 8192 16384

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8 torus partition of BG/P

SUMMAMPI SD-Cannon

Charm++ SD-CannonMPI Cannon

Charm++ Cannon

Edgar Solomonik Split-Dimensional Cannon’s algorithm 16/ 23

Page 17: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

3D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4096 8192 16384 32768 65536

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8x8 torus partition of BG/P

SUMMACharm++ Cannon VN

Charm++ SD-Cannon VNCharm++ Cannon SMP

Charm++ SD-Cannon SMP

Edgar Solomonik Split-Dimensional Cannon’s algorithm 17/ 23

Page 18: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Preliminary 5D BlueGene/Q performance

12

14

16

18

20

22

24

26

28

30

8192 16384 32768 65536

Tera

flops

matrix dimension

Performance on 2x4x4x2x4 torus partition of BG/Q

SUMMAMPI SD-Cannon

MPI Cannon

Edgar Solomonik Split-Dimensional Cannon’s algorithm 18/ 23

Page 19: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

K computer Tofu network

T - Torus M - Mesh

M

M

T

TTT

UnitNode

10 links per node / 4 torus dimensions / 2 mesh dimensions of length 2

Edgar Solomonik Split-Dimensional Cannon’s algorithm 19/ 23

Page 20: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Tofu network is 5D not 6D

=

A two-by-two mesh is a 1D ring of length 4

Edgar Solomonik Split-Dimensional Cannon’s algorithm 20/ 23

Page 21: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

SD-Cannon K computer potential

5D K computer torus different from 5D BG/Q

I K computer injection bandwidth can saturate only 4/10 links

SD-Cannon could still be beneficial

I Can saturate 4 links rather than 2 (up to 2X speed-up)

I Does not require pipelined broadcast implementation

Edgar Solomonik Split-Dimensional Cannon’s algorithm 21/ 23

Page 22: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Conclusion

SD-Cannon

I Breaches performance gap between Cannon and SUMMA

I Is uniquely asymptotically communication-optimal on a k-aryn-cube

I Virtualization allows general mapping support but incursoverhead

Topology-aware mapping and algorithm design

I Allows zero network contention

I Permits saturation of much more bandwidth on torus networks

I Pervasive for parallel scalability on high-end supercomputers

Edgar Solomonik Split-Dimensional Cannon’s algorithm 22/ 23

Page 23: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Torus networksAlgorithms

ImplementationPerformanceConclusion

Finish

Acknowledgements:

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)

I Resources at Argonne National Lab

I Anonymous VECPAR reviewersI Discussions with collaborators

I James Demmel (UC Berkeley)I Jeff Hammond (Argonne National Lab)I Grey Ballard (UC Berkeley)

Also, see my talk on 2.5D algorithms at University of Tokyo

I Tuesday, July 24, 10:00-12:00

Edgar Solomonik Split-Dimensional Cannon’s algorithm 23/ 23

Page 24: Matrix multiplication on multidimensional torus networkssolomon2.web.engr.illinois.edu/talks/vecpar-2012.pdf · 2016-08-18 · Torus networks Algorithms Implementation Performance

Backup slides

Edgar Solomonik Split-Dimensional Cannon’s algorithm 24/ 23