a concurrent matrix transpose algorithm
DESCRIPTION
A Concurrent Matrix Transpose Algorithm. Pourya Jafari. Application. Frequently Used Linear Algebra Operation Scientific Applications FFT Matrix Multiplication. Transpose Matrix. : item/cell at row i and column j of matrix B . For all i, j we have . - PowerPoint PPT PresentationTRANSCRIPT
A Concurrent Matrix A Concurrent Matrix Transpose AlgorithmTranspose Algorithm
Pourya JafariPourya Jafari
ApplicationApplication
Frequently Used Linear Algebra Operation Frequently Used Linear Algebra Operation Scientific ApplicationsScientific Applications FFTFFT Matrix MultiplicationMatrix Multiplication
Transpose MatrixTranspose Matrix
: item/cell at row i and column j of matrix B: item/cell at row i and column j of matrix B
.. For all i, j we have For all i, j we have
..
Simply exchange rows and columnsSimply exchange rows and columns
For simplicity we only consider square matrices For simplicity we only consider square matrices N row N columns labeled 0 to N-1N row N columns labeled 0 to N-1
An ExampleAn Example
Each cell is filled with row|column numberEach cell is filled with row|column number
6 swaps, (4*4 – 4)/2 = 6 6 swaps, (4*4 – 4)/2 = 6 In general, for size N square Matrix we haveIn general, for size N square Matrix we have
swaps,swaps,
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
ParallelizingParallelizing
Naïve algorithmNaïve algorithm A thread for each swapA thread for each swap
Quadratic number of threadsQuadratic number of threads
Quadratic number of communication linksQuadratic number of communication links →→ impracticalimpractical
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
Parallelizing - 2Parallelizing - 2
More efficient WayMore efficient Way Assign a column to each threadAssign a column to each thread
O(N) threadsO(N) threads
Communication links?Communication links? Depends on the approachDepends on the approach
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
Measure dislocationMeasure dislocation
A single swap operation as row and column A single swap operation as row and column shiftsshifts
For column shift length AFor column shift length A j= i + K j= i + K →→ K = i - j K = i - j
Shift length is i-j; value range is from 0 to N-1Shift length is i-j; value range is from 0 to N-1
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
Concurrency SchemeConcurrency Scheme
Minimize Minimize communicationcommunication Pre-process inside Pre-process inside
threadthreadShift each rowsShift each rows
Intra-process/thread Intra-process/thread communicationcommunication
Shift each columnsShift each columns Post-process inside Post-process inside
threadthreadShift each rows againShift each rows again
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
Concurrency Scheme - 2Concurrency Scheme - 2
We have the row shifts fixed based on row We have the row shifts fixed based on row index index Has range 0 to N-1, Has range 0 to N-1,
consistent with our initial findingconsistent with our initial finding
Now arrange the rows, so that column Now arrange the rows, so that column shifts gets us to ishifts gets us to i i - L = i’ L + i’ = i L = -j i - L = i’ L + i’ = i L = -j
So we shift each column j cells upSo we shift each column j cells up
Steps so farSteps so far
1 1 →→ 2: Column shift j up 2: Column shift j up2 2 →→ 3: Row shift based on row indices 3: Row shift based on row indices3 3 →→ 4: ? 4: ?
Change of indices so farChange of indices so far (i - j, j) (i - j, j) → → (i - j, i - j + j) (i - j, i - j + j) → → (i - j, i) = (m, n)(i - j, i) = (m, n)
One operation to change row index to jOne operation to change row index to jn - m = (i - (i - j))= jn - m = (i - (i - j))= j
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
0000 1111 2222 3333
1010 2121 3232 0303
2020 3131 0202 1313
3030 0101 1212 2323
0000 0101 0202 0303
1010 1111 1212 1313
2020 2121 2222 2323
3030 3131 3232 3333
0000 1111 2222 3333
0303 1010 2121 3232
0202 1313 2020 3131
0101 1212 2323 3030
0000 1010 2020 3030
0101 1111 2121 3131
0202 1212 2222 3232
0303 1313 2323 3333
(1) (2-a) (2-b) (3)
(4)
Efficiency of algorithm so farEfficiency of algorithm so far
O(N) row and column operationO(N) row and column operation O(NO(N22) overall considering both rows and ) overall considering both rows and
columncolumn O(N) communication linksO(N) communication links
Communication is a major bottleneckCommunication is a major bottleneck Group row shiftsGroup row shifts
Reduce communication and overall complexityReduce communication and overall complexity
Radix RepresentationRadix Representation
Radix r Radix r Base r numbersBase r numbers For k each digit place (starting from LS) For k each digit place (starting from LS)
For l steps from 0 to r-1 For l steps from 0 to r-1 group all row shifts for current stepgroup all row shifts for current step
Radix 3Radix 3Possible numbers 0, 1 and 2Possible numbers 0, 1 and 2
Second loop { For l=0 to 2 }Second loop { For l=0 to 2 }Shift all number have l in their kShift all number have l in their kthth digit place l*r^k to digit place l*r^k to the rightthe right
Special Case: Radix-2Special Case: Radix-2
Two steps only 0 and 1Two steps only 0 and 1 We only shift for 1We only shift for 1
Digits are bit representationDigits are bit representation Shift all row indices have their kShift all row indices have their kthth bit on bit on
00
11
22
33
00
11
22
33
00
11
22
33
Shift for each row k=0 k=1
= +
Algorithm complexityAlgorithm complexity
Depends on r (radix)Depends on r (radix) CC11=(r-1)[log=(r-1)[logrrN]N]
CC22=b(r-1)[N/r][log=b(r-1)[N/r][logrrN]N] Special casesSpecial cases
r=2r=2 Important when communication cost is highImportant when communication cost is high
Good when message size smallGood when message size small
r=Nr=N Good when message size is largeGood when message size is large
Best value based on communication costs, message size, Best value based on communication costs, message size, communication link performance, number of ports, etc.communication link performance, number of ports, etc.
Radix vs. message size vs. index Radix vs. message size vs. index time for 64 processorstime for 64 processors