high performance computing : models, methods, & means applied parallel algorithms 2

61
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011 HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 18, 2011

Upload: gellert-rendor

Post on 31-Dec-2015

43 views

Category:

Documents


5 download

DESCRIPTION

Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 18, 2011. HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2. Puzzle of the Day. if(a = 0) { … } - PowerPoint PPT Presentation

TRANSCRIPT

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

APPLIED PARALLEL ALGORITHMS 2

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Puzzle of the Day

• Some nice ways to get something different from what was intended:

2

if(a = 0) { … }/* a always equals 0, but block will never be executed */

if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */

if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment, rather than (a != 0) or (a == !0) */

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

3

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

4

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

5

Parallel Matrix Processing & Locality

• Maximize locality– Spatial locality

• Variable likely to be used if neighbor data is used• Exploits unit or uniform stride access patterns• Exploits cache line length• Adjacent blocks minimize message traffic

– Depends on volume to surface ratio

– Temporal locality• Variable likely to be reused if already recently used• Exploits cache loads and LRU (least recently used) replacement policy• Exploits register allocation

– Granularity• Maximizes length of local computation• Reduces number of messages• Maximizes length of individual messages

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

6

Array Decomposition

• Simple MPI Example• Master-Worker Data Partitioning and Distribution

– Array decomposition– Uniformly distributes parts of array among workers

• (and master)– A kind of static load balancing

• Assumes equal work on equal data set sizes

• Demonstrates– Data partitioning– Data distribution– Coarse grain parallel execution

• No communication between tasks– Reduction operator– Master-worker control model

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

7

Array Decomposition Layout

• Dimensions – 1 dimension: linear (dot product)– 2 dimensions: “2-D” or (matrix operations)– 3 dimensions (higher order models)– Impacts surface to volume ratio for inter process communications

• Distribution – Block

• Minimizes messaging• Maximizes message size

– Cyclic • Improves load balancing

• Memory layout– C vs. FORTRAN

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

8

Array Decomposition

Accumulate sum from each part

rayCompleteAr

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

9

Array Decomposition

Demonstrate simple data decomposition :– Master initializes array and then distributes an equal portion of the array

among the other tasks.

– The other tasks receive their portion of the array, they perform an addition operation to each array element.

– Each task maintains the sum for their portion of the array

– The master task does likewise with its portion of the array.

– As each of the non-master tasks finish, they send their updated portion of the array to the master.

– An MPI collective communication call is used to collect the sums maintained by each task.

– Finally, the master task displays selected parts of the final array and the global sum of all array elements.

– Assumption : that the array can be equally divided among the group.

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

10

Flowchart for Array Decomposition“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

Calculate Sum for array chunk

Calculate Sum for array chunk

Calculate Sum for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunk…

Send SumSend Sum Send SumSend Sum … Send SumSend Sum

Recv. resultsRecv. results

Reduction Operator to Sum up results

Reduction Operator to Sum up results

Print resultsPrint results

EndEnd

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

11

Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0

float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);

MPI_Status status;

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) { printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/ MPI_Abort(MPI_COMM_WORLD, rc); exit(0); }MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);

chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Workload to be processed by each processor

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

12

Array Decompositon (source code)if (taskid == MASTER){ sum = 0; for(i=0; i<ARRAYSIZE; i++) { data[i] = i * 1.0; sum = sum + data[i]; } printf("Initialized array sum = %e\n",sum); offset = chunksize; for (dest=1; dest<numtasks; dest++) { MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD); printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset); offset = offset + chunksize; } offset = 0;

mysum = update(offset, chunksize, taskid);

for (i=1; i<numtasks; i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); }

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Initialize array

Array[0] -> Array[offset-1] is processed by master

Send workloads to respective processorsMaster computes

local Sum

Master receives summation computed by workers

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

13

Array Decompositon (source code) MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); printf("Sample results: \n"); offset = 0; for (i=0; i<numtasks; i++) { for (j=0; j<5; j++) printf(" %e",data[offset+j]); printf("\n"); offset = offset + chunksize; } printf("*** Final sum= %e ***\n",sum); } /* end of master section */if (taskid > MASTER) { /* Receive my portion of array from the master task */ source = MASTER; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); mysum = update(offset, chunksize, taskid); /* Send my results back to the master task */ dest = MASTER; MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD); MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); } /* end of non-master */

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Master computes the SUM of all workloads

Worker processes receive work chunks from master

Each worker computes local sum

Send local sum to master process

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

14

Array Decompositon (source code)

MPI_Finalize();

} /* end of main */

float update(int myoffset, int chunk, int myid) { int i; float mysum; /* Perform addition to each of my array elements and keep my sum */ mysum = 0; for(i=myoffset; i < myoffset + chunk; i++) { data[i] = data[i] + i * 1.0; mysum = mysum + data[i]; } printf("Task %d mysum = %e\n",myid,mysum); return(mysum); }

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

15

Demo : Array Decomposition

[lsu00@master array_decomposition]$ mpiexec -np 4 ./arrayMPI task 0 has started...MPI task 2 has started...MPI task 1 has started...MPI task 3 has started...Initialized array sum = 1.335708e+14Sent 4000000 elements to task 1 offset= 4000000Sent 4000000 elements to task 2 offset= 8000000Task 1 mysum = 4.884048e+13Sent 4000000 elements to task 3 offset= 12000000Task 2 mysum = 7.983003e+13Task 0 mysum = 1.598859e+13Task 3 mysum = 1.161867e+14Sample results: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07*** Final sum= 2.608458e+14 ***

Output from arete for a 4 processor run.

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

16

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix

formed by interchanging the rows and columns such that row i becomes column i of the transposed matrix

mnnn

m

m

T

aaa

aaa

aaa

21

22212

12111

A

mnmm

n

n

aaa

aaa

aaa

21

22221

11211

A

010

431A

04

13

01TA

52

31A

53

21TA

17

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

18

#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4

main(){ int i, j; float Matrix[SIZE][SIZE], Trans[SIZE][SIZE]; for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Matrix[i][j] = (i * j) * 5 + i; } for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Trans[i][j] = 0.0; }

Initialize source matrix

Initialize results matrix

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

19

#pragma omp parallel for private(j) for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) Trans[j][i] = Matrix[i][j]; printf("The Input Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Matrix[i][j]); printf("\n"); } printf("\nThe Transpose Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Trans[i][j]); printf("\n"); } return 0;}

Perform transpose in parallel using omp parallel

for

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – OpenMP (DEMO)

20

[LSU760000@n01 matrix_transpose]$ ./omp_mtrans

The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000

The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

21

#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){ int i,j; for(i = 0; i < N; i ++) for(j = 0; j < N; j ++) A[i][j] = i * N + j;}void print_matrix(){ int i,j; for(i = 0; i < N; i ++) { for(j = 0; j < N; j ++) printf("%d ", A[i][j]); printf("\n"); }}

Initialize source matrix

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

22

main(int argc, char* argv[]){ int r, i; MPI_Status st; MPI_Datatype typ;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &r);

if(r == 0) { fill_matrix(); printf("\n Source:\n"); print_matrix(); MPI_Type_contiguous(N * N, MPI_INT, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD); }

Creating custom MPI datatype to store local

workloads

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

23

else if(r == 1){ MPI_Type_vector(N, 1, N, MPI_INT, &typ); MPI_Type_hvector(N, 1, sizeof(int), typ, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st); printf("\n Transposed:\n"); print_matrix(); }

MPI_Finalize();}

Creates a vector datatype of length N strided by a blocklength

of 1

Datatype MPI_Type_hvector allows for on the fly transpose of the matrix

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – MPI (DEMO)

24

[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans

Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

25

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Linear Systems

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

Solve Ax=b, where A is an nn matrix andb is an n1 column vector

www.cs.princeton.edu/courses/archive/fall07/cos323/

26

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Fundamental operations:1. Replace one equation with linear combination

of other equations

2. Interchange two equations

3. Re-label two variables

• Combine to reduce to trivial system

• Simplest variant only uses #1 operations but get better stability by adding– #2 or

– #2 and #3

www.cs.princeton.edu/courses/archive/fall07/cos323/

27

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Solve:

• Can be represented as

• Goal: to reduce the LHS to an identity matrix resulting with the solutions in RHS

1354

732

21

21

xx

xx

1354

732

21

21

xx

xx

13

7

54

32

13

7

54

32

?

?

10

01

?

?

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

28

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Basic operation 1: replace any row bylinear combination with any other row :

replace row1 with 1/2 * row1 + 0 * row2

• Replace row2 with row2 – 4 * row1

• Negate row2

13

7

54

32

13

7

54

32

1354

1 27

23

1354

1 27

23

110

1 27

23

110

1 27

23

110

1 27

23

110

1 27

23

www.cs.princeton.edu/courses/archive/fall07/cos323/

29

Row1 = (Row1)/2

Row2=Row2-(4*Row1)

Row2 = (-1)*Row2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Replace row1 with row1 – 3/2 * row2

• Solution:

x1 = 2, x2 = 1

110

1 27

23

110

1 27

23

1

2

10

01

1

2

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

30

Row1 = Row1 – (3/2)* Row2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Immediately run into problem: algorithm wants us to divide by zero!• More subtle version:

• The pivot or pivot element is the element of a matrix which is selected first by an algorithm to do computation

• Pivot entry is usually required to be at least distinct from zero, and often distant from it

• Select largest element in matrix and swap columns and rows to bring this element to the ‚right’ position: full (complete) pivoting

8

2

32

10

8

2

32

10

8

2

32

1001.0

8

2

32

1001.0

www.cs.princeton.edu/courses/archive/fall07/cos323/

31

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Pivoting :– Swap rows 1 and 2:

– And continue to solve as shown before

1

8

10

23

1

8

10

23

1

2

10

01

110

1 38

32

1

2

10

01

110

1 38

32

www.cs.princeton.edu/courses/archive/fall07/cos323/

32

x1 = 2, x2 = 1

8

1

23

10

8

1

23

10

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting:Example• Division by small numbers round-off error in computer arithmetic

• Consider the following system0.0001x1 + x2 = 1.000x1 + x2 = 2.000

• exact solution: x1=1.0001 and x2 = 0.9999

• say we round off after 3 digits after the decimal point

• Multiply the first equation by 104 and subtract it from the second equation

• (1 - 1)x1 + (1 - 104)x2 = 2 - 104

• But, in finite precision with only 3 digits:– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4

• Therefore, x2 = 1 and x1 = 0 (from the first equation)

• Very far from the real solution!

0.0001 1

1 1

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

33

1

2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

• Partial pivoting doesn‘t look for largest element in matrix, but just for the largest element in the ‚current‘ column

• Swap rows to bring the corresponding row to ‚right‘ position

• Partial pivoting is generally sufficient to adequately reduce round-off error.

• Complete pivoting is usually not necessary to ensure numerical stability

• Due to the additional computations it introduces, it may not always be the most appropriate pivoting strategy

34

http://www.amath.washington.edu/~bloss/amath352_lectures/

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting• One can just swap rows

x1 + x2 = 2.000

0.0001x1 + x2 = 1.000• Multiple the first equation by 0.0001 and subtract it from the second equation gives:

(1 - 0.0001)x2 = 1 - 0.0001

0.9999 x2 = 0.9999 => x2 = 1

and then x1 = 1• Final solution is closer to the real solution.

• Partial Pivoting– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the

largest element in row i– This row is swapped with row i (along with elements of the right hand side) before the

subtractions• the swap is not done in memory but rather one keeps an indirection array

• Total Pivoting– Look for the greatest element ANYWHERE in the matrix– Swap columns– Swap rows

• Numerical stability is really a difficult fieldsrc: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

35

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

36

http://www.amath.washington.edu/~bloss/amath352_lectures/

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Special Cases

• Common special case:• Tri-diagonal Systems :

– Only main diagonal & 1 above,1 below– Solve using : Gauss-Jordan

• Lower Triangular Systems (L)– Solve using : forward substitution

• Upper Triangular Systems (U)– Solve using : backward substitution

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

11

11 a

bx

11

11 a

bx

22

12122 a

xabx

22

12122 a

xabx

33

23213133 a

xaxabx

33

23213133 a

xaxabx

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

55

55 a

bx

55

55 a

bx

44

54544 a

xabx

44

54544 a

xabx

www.cs.princeton.edu/courses/archive/fall07/cos323/

37

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

38

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Solving Linear Systems of Eq.

• Method for solving Linear Systems– The need to solve linear systems arises in an estimated 75% of

all scientific computing problems [Dahlquist 1974]• Gaussian Elimination is perhaps the most well-known method

– based on the fact that the solution of a linear system is invariant under scaling and under row additions

• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant

• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side

– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:

??

???

x =

equation n-i has i unknowns, with

?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

39

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination1 1 1

1 -2 2

1 2 -1

0

4

2x =

1 1 1

0 -3 1

0 1 -2

0

4

2x =

1 1 1

0 -3 1

0 0 -5

0

4

10

x =

Subtract row 1 from rows 2 and 3

Multiple row 3 by 3 and add row 2

-5x3 = 10 x3 = -2-3x2 + x3 = 4 x2 = -2x1 + x2 + x3 = 0 x1 = 4

Solving equations inreverse order (backsolving)

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

40

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination• The algorithm goes through the matrix from the top-left

corner to the bottom-right corner• The ith step eliminates non-zero sub-diagonal elements

in column i, subtracting the ith row scaled by aji/aii from row j, for j=i+1,..,n.

i

0

values already computed

values yet to beupdated

pivot row ito

be

ze

roe

d

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

41

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Sequential Gaussian Elimination

Simple sequential algorithm

// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

• Several “tricks” that do not change the spirit of the algorithm but make implementation easier and/or more efficient– Right-hand side is typically kept in column n+1 of the matrix and one

speaks of an augmented matrix– Compute the A(i,j)/A(i,i) term outside of the loop

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

42

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Parallel Gaussian Elimination?

• Assume that we have one processor per matrix element

Reduction

to find the max aji

Broadcast

max aji needed to computethe scaling factor

Compute

Independent computationof the scaling factor

Broadcasts

Every update needs thescaling factor and the element from the pivot row

Compute

Independentcomputations

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

43

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• Gaussian Elimination is simple but– What if we have to solve many Ax = b systems for different values of b?

• This happens a LOT in real applications

• Another method is the “LU Factorization” (LU Decomposition)• Ax = b• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper

triangular matrix O(n3)• Then Ax = b is written L U x = b• Solve L y = b O(n2) • Solve U x = y O(n2)

??????

x =??????

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

44

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization: Principle

• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

• Magically, A = L x U !• Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5

5

2 2 3

gaussianelimination

save thescalingfactor

1 2 -1

4 -5

5

2 2 3

gaussianelimination

+save thescalingfactor

1 2 -1

4 -5

5

2 -2

5gaussianelimination

+save thescalingfactor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3U =

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

45

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

stores the scaling factors

k

k

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

46

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

k

ij

k

update

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

47

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Parallel LU on a ring

• Since the algorithm operates by columns from left to right, we should distribute columns to processors

• Principle of the algorithm– At each step, the processor that owns column k does the “prepare” task

and then broadcasts the bottom part of column k to all others• Annoying if the matrix is stored in row-major fashion• Remember that one is free to store the matrix in anyway one wants, as long

as it’s coherent and that the right output is generated

– After the broadcast, the other processors can then update their data.

• Assume there is a function alloc(k) that returns the rank of the processor that owns column k– Basically so that we don’t clutter our program with too many global-to-

local index translations

• In fact, we will first write everything in terms of global indices, as to avoid all annoying index arithmetic

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

48

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

LU-broadcast(A,n) { q MY_NUM() p NUM_PROCS() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj

}}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

49

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Dealing with local indices

• Assume that p divides n• Each processor needs to store r=n/p columns and its

local indices go from 0 to r-1• After step k, only columns with indices greater than k will

be used• Simple idea: use a local index, l, that everyone initializes

to 0• At step k, processor alloc(k) increases its local index so

that next time it will point to its next local column

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

50

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

51

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Bad load balancing

P1 P2 P3 P4

alreadydone

alreadydone working

on it

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

52

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Good Load Balancing?

working on it

alreadydone

alreadydone

Cyclic distribution

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

53

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Load-balanced program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k mod p == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

54

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Performance Analysis

• How long does this code take to run?– This is not an easy question because there are many tasks and

many communications

• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)– n-1 column preparations: (n2/2) w’ + O(1)– column updates: (n3/3p) w + O(n2)

• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)

• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice

• How can we improve this algorithm?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

55

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pipelining on the Ring

• So far, in the algorithm we’ve used a simple broadcast• Nothing was specific to being on a ring of processors

and it’s portable – in fact you could just write raw MPI that just looks like our

pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors

• But it’s not efficient– The n-1 communication steps are not overlapped with

computations– Therefore Amdahl’s law, etc.

• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast

code we saw at the very beginning throughout the LU code

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

56

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Previous program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

57

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-pipeline algorithm

double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

58

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

59

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Summary : Material for the Test

• Matrix Transpose: Slides 17-23• Gauss Jordan: Slides 26-30• Pivoting: Slides 31-37• Special Cases (forward & backward substitution): Slide 35• LU Decomposition 44-58

60

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

61