a concurrent matrix transpose algorithm

15
A Concurrent A Concurrent Matrix Transpose Matrix Transpose Algorithm Algorithm Pourya Jafari Pourya Jafari

Upload: noma

Post on 07-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

A Concurrent Matrix Transpose Algorithm. Pourya Jafari. Application. Frequently Used Linear Algebra Operation Scientific Applications FFT Matrix Multiplication. Transpose Matrix. : item/cell at row i and column j of matrix B . For all i, j we have . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Concurrent Matrix Transpose Algorithm

A Concurrent Matrix A Concurrent Matrix Transpose AlgorithmTranspose Algorithm

Pourya JafariPourya Jafari

Page 2: A Concurrent Matrix Transpose Algorithm

ApplicationApplication

Frequently Used Linear Algebra Operation Frequently Used Linear Algebra Operation Scientific ApplicationsScientific Applications FFTFFT Matrix MultiplicationMatrix Multiplication

Page 3: A Concurrent Matrix Transpose Algorithm

Transpose MatrixTranspose Matrix

: item/cell at row i and column j of matrix B: item/cell at row i and column j of matrix B

.. For all i, j we have For all i, j we have

..

Simply exchange rows and columnsSimply exchange rows and columns

For simplicity we only consider square matrices For simplicity we only consider square matrices N row N columns labeled 0 to N-1N row N columns labeled 0 to N-1

Page 4: A Concurrent Matrix Transpose Algorithm

An ExampleAn Example

Each cell is filled with row|column numberEach cell is filled with row|column number

6 swaps, (4*4 – 4)/2 = 6 6 swaps, (4*4 – 4)/2 = 6 In general, for size N square Matrix we haveIn general, for size N square Matrix we have

swaps,swaps,

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 5: A Concurrent Matrix Transpose Algorithm

ParallelizingParallelizing

Naïve algorithmNaïve algorithm A thread for each swapA thread for each swap

Quadratic number of threadsQuadratic number of threads

Quadratic number of communication linksQuadratic number of communication links →→ impracticalimpractical

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 6: A Concurrent Matrix Transpose Algorithm

Parallelizing - 2Parallelizing - 2

More efficient WayMore efficient Way Assign a column to each threadAssign a column to each thread

O(N) threadsO(N) threads

Communication links?Communication links? Depends on the approachDepends on the approach

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 7: A Concurrent Matrix Transpose Algorithm

Measure dislocationMeasure dislocation

A single swap operation as row and column A single swap operation as row and column shiftsshifts

For column shift length AFor column shift length A j= i + K j= i + K →→ K = i - j K = i - j

Shift length is i-j; value range is from 0 to N-1Shift length is i-j; value range is from 0 to N-1

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 8: A Concurrent Matrix Transpose Algorithm

Concurrency SchemeConcurrency Scheme

Minimize Minimize communicationcommunication Pre-process inside Pre-process inside

threadthreadShift each rowsShift each rows

Intra-process/thread Intra-process/thread communicationcommunication

Shift each columnsShift each columns Post-process inside Post-process inside

threadthreadShift each rows againShift each rows again

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 9: A Concurrent Matrix Transpose Algorithm

Concurrency Scheme - 2Concurrency Scheme - 2

We have the row shifts fixed based on row We have the row shifts fixed based on row index index Has range 0 to N-1, Has range 0 to N-1,

consistent with our initial findingconsistent with our initial finding

Now arrange the rows, so that column Now arrange the rows, so that column shifts gets us to ishifts gets us to i i - L = i’ L + i’ = i L = -j i - L = i’ L + i’ = i L = -j

So we shift each column j cells upSo we shift each column j cells up

Page 10: A Concurrent Matrix Transpose Algorithm

Steps so farSteps so far

1 1 →→ 2: Column shift j up 2: Column shift j up2 2 →→ 3: Row shift based on row indices 3: Row shift based on row indices3 3 →→ 4: ? 4: ?

Change of indices so farChange of indices so far (i - j, j) (i - j, j) → → (i - j, i - j + j) (i - j, i - j + j) → → (i - j, i) = (m, n)(i - j, i) = (m, n)

One operation to change row index to jOne operation to change row index to jn - m = (i - (i - j))= jn - m = (i - (i - j))= j

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

0000 1111 2222 3333

1010 2121 3232 0303

2020 3131 0202 1313

3030 0101 1212 2323

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

0000 1111 2222 3333

0303 1010 2121 3232

0202 1313 2020 3131

0101 1212 2323 3030

0000 1010 2020 3030

0101 1111 2121 3131

0202 1212 2222 3232

0303 1313 2323 3333

(1) (2-a) (2-b) (3)

(4)

Page 11: A Concurrent Matrix Transpose Algorithm

Efficiency of algorithm so farEfficiency of algorithm so far

O(N) row and column operationO(N) row and column operation O(NO(N22) overall considering both rows and ) overall considering both rows and

columncolumn O(N) communication linksO(N) communication links

Communication is a major bottleneckCommunication is a major bottleneck Group row shiftsGroup row shifts

Reduce communication and overall complexityReduce communication and overall complexity

Page 12: A Concurrent Matrix Transpose Algorithm

Radix RepresentationRadix Representation

Radix r Radix r Base r numbersBase r numbers For k each digit place (starting from LS) For k each digit place (starting from LS)

For l steps from 0 to r-1 For l steps from 0 to r-1 group all row shifts for current stepgroup all row shifts for current step

Radix 3Radix 3Possible numbers 0, 1 and 2Possible numbers 0, 1 and 2

Second loop { For l=0 to 2 }Second loop { For l=0 to 2 }Shift all number have l in their kShift all number have l in their kthth digit place l*r^k to digit place l*r^k to the rightthe right

Page 13: A Concurrent Matrix Transpose Algorithm

Special Case: Radix-2Special Case: Radix-2

Two steps only 0 and 1Two steps only 0 and 1 We only shift for 1We only shift for 1

Digits are bit representationDigits are bit representation Shift all row indices have their kShift all row indices have their kthth bit on bit on

00

11

22

33

00

11

22

33

00

11

22

33

Shift for each row k=0 k=1

= +

Page 14: A Concurrent Matrix Transpose Algorithm

Algorithm complexityAlgorithm complexity

Depends on r (radix)Depends on r (radix) CC11=(r-1)[log=(r-1)[logrrN]N]

CC22=b(r-1)[N/r][log=b(r-1)[N/r][logrrN]N] Special casesSpecial cases

r=2r=2 Important when communication cost is highImportant when communication cost is high

Good when message size smallGood when message size small

r=Nr=N Good when message size is largeGood when message size is large

Best value based on communication costs, message size, Best value based on communication costs, message size, communication link performance, number of ports, etc.communication link performance, number of ports, etc.

Page 15: A Concurrent Matrix Transpose Algorithm

Radix vs. message size vs. index Radix vs. message size vs. index time for 64 processorstime for 64 processors