matrix multiply with dryad b649 course project introduction

21
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction

Upload: alfred-fields

Post on 03-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

MATRIX MULTIPLY WITH DRYADB649 Course Project Introduction

Matrix Multiplication• Fundamental kernel algorithm used by many applications• Examples: Graph Theory, Physics, Electronics

Scalability Issues:• Run on single machine:

• Memory overhead increase in terms of N^2• CPU overhead increase in terms of N^3

• Run on multiple machines:• Communication overhead increase in terms of N^2

0 200 400 600 800 1000 1200 1400100

1000

10000

100000

1000000

10000000

100000000

1000000000

10000000000

memory overhead CPU overhead

Matrix Multiply Approaches

Programming Mdoel

Algorithm Customized Libraries

User Implementation

Sequential Naïve approach, tiles matrix multiply, Blas_dgemm

Vendor supplied package (ie, Intel, AMD Blas), ATLAS

Fortran, C, C++, C#, Java

Shared memory parallelism

Row Partition ATLAS Multi Threads, TPL, PLINQ, OpenMP

Distributed memory parallelism

Row Column Partition,Fox Algorithm

ScalePack OpenMPI, Twister, Dryad

Why DryadLINQ?• Dryad is a general purpose runtime that supports the

processing of data intensive application in Windows• DryadLINQ is a high level programming language and

compiler for Dryad• Applicability:

• Dryad transparently deal with the parallelism, scheduling, fault. tolerance, messaging, and workload balancing issues.

• SQL-like interface, based on .NET platform, easy to have code.

• Performance:• Intelligent job execution engine, optimized execution plan.• Scale out for thousands of machines.

Parallel Algorithms for Matrix Multiplication

• MM algorithms can deal with matrices distributed on rectangular grids

• No single algorithm always achieves best performance on different matrix and grid shapes.

• MM Algorithms can be classified into categories according to the communication primitives• Row Partition• Row Column Partition• Fox Algorithm (BMR) – broadcast, multiply, roll up

Row Partition• Heavy communication overhead• Large memory usage per node

• The full Matrix B is copied to every node• The Matrix A row blocks are distributed to each node

Pseudo Code sample:Partition matrix A by rowsBroadcast matrix BDistributed matrix A row blocksCompute the matrix C row blocks

Row Column Partition• Heavy communication overhead• Scheduling overhead for each iteration• Moderate memory usage

Row Block 1

Row Block 2

Row Block 3

Row Block m

...

Node 2 Node 3Node 1 Node n

...

A Matrix

Col

umn

Blo

ck 1

Col

umn

Blo

ck 2

Col

umn

Blo

ck 3

Col

umn

Blo

ck n

...

B Matrix

1

2

3

m

Itera

tions

C Matrix

Block(2,1)

Block(2,2)

Block(2,n)

Block(m,0)

Block(m,1)

Block(m,n)

Block(1,1)

Block(1,2)

Block(1,n)

Block(1,3)

Block(1,3)

Block(m,3)

...

...

...

Block(m,0)

Block(m,1)

Block(m,n)

Block(m,3)

...

...

Pseudo Code sample:Partitioned matrix A by rowsPartitioned matrix B by columnsFor each iteration i: broadcast matrix A row block i distributed matrix B column blocks compute matrix C row blocks

Fox AlgorithmStage One Stage Two

Fox algorithm • Less communication overhead than other approach• Scale well for large matrices sizes

Pseudo Code sample:Partitioned matrix A, B to blocksFor each iteration i: 1) broadcast matrix A block (i%N, i%N) to row i 2) compute matrix C blocks add the to the previous result 3) roll-up matrix B block

Performance Analysis on Fox algorithm

24004800

72009600

1200014400

1680019200

2160024000

2640028800

312001

10

100

1node_8cores 16nodes_8cores

MPI/Threads/Cblas with Various Problem Sizes

10^3

Mfl

ops

00.20.40.60.8

11.21.4

OpenMPI/Threads/Cblas on 16 nodes

Grain Size Per Node

Relative Parallel E

ffi-ciency

• Cache Issue• Cache miss (size),

pollution, confliction• Tiles matrix multiply

• Memory Issue• Size (memory paging)• Bandwidth, latency

Cache Size Turning Point

• Absolute performance degrade as problem size increase for both cases

• Single node performance worse than multiple nodes due to memory issue.

Multicore level parallelism• To use every core on a compute node for Dryad Job, the

task must be programmed with multicore technology. (i.e. Task Parallel Library<TPL>, Thread, PLINQ)

• For each thread, it will compute one row in matrix C or several rows in matrix C depends on the implementation.

• By using TPL or PLINQ, the optimization for threads is implicit and easier to use.

Timeline for term long project• Stage One

• Familiar with HPC cluster• Sequential MM with C#• Multithreaded MM with C#• Performance comparison of above two approaches

• Stage Two• Familiar with DryadLINQ Interface• Implement Row Partition algorithm with DryadLINQ• Performance study

• Stage Three• Refinement experiments results• Report and presentation

Backup slides

Input: C# and LINQ data objects DryadLINQ distributed data objects. DryadLINQ translates LINQ programs into distributed Dryad computations: C# methods become code running on the vertices of a Dryad job. Output: DryadLINQ distributed data objects .Net objects

DryadLINQ

Client machine

(11)

Distributedquery plan

.NET program

Query Expr

HPC Cluster

Output TablesResults

Input TablesInvoke Query

Output DryadTable

Dryad Execution

.Net Objects

JM

ToTable

foreach

Vertexcode

Dryad Job Submission

Dryad Job Execution Flow

Performance on one Node

Performance on Multiple Node

Analysis for three algorithms

Performance for three algorithms • Test done on 16 nodes of Tempest, using one core per

node.

Performance for Multithreaded MM• Test done on one node of Tempest, 24 cores

2400 4800 7200 9600 12000 14400 16800 192000

5

10

15

20

25

TPL Thread PLINQ

Scale of Square Matrix

Spee

d-up