matrix factorizations for parallel integer transforms

Matrix Factorizations for Parallel Integer Transforms

Yiyuan She1,2,3, Pengwei Hao1,2, Yakup Paker2

1Center for Information Science, Peking University

2Queen Mary, University of London

3Department of Statistics, Stanford University

Contents

1. Introduction2. Point & block factorizations3. Parallel ERM factorization (PERM)4. Parallel computational complexity5. Matrix blocking strategy6. Conclusions

Why integer transform reversible?

ncoding

Spatial Transform

B/W image

Color image

Multi-component image

Color Space T

ecoding

Inverse Spatial T

Color image

Multi-component image

InverseColor T

B/W image

Lossless?

Lossless? Lossless?

How to implement?• Wavelet construction

S transform (Blume & Fand, 1989)

TS transform (Zandi et al, 1995)

S+P transform (Said & Pearlman, 1996)

• Ladder structure (Bruekers & van den Enden, 1992)

• Lifting scheme (2D, Sweldens, 1996)

• Approximated color transform (Gormish et al, 1997)

• General wavelet transform (2D, Daubechies et al, 1998)

Matrix factorizationsP. Hao and Q. Shi, Invertible linear transforms

implemented by integer mapping, Science in China, Series E (in Chinese), 2000, 30, pp. 132-141.

P. Hao and Q. Shi, Matrix factorizations for reversible integer mapping, IEEE Trans. Signal Processing, 2001, 49 pp. 2314-2324.

P. Hao and Q. Shi, Proposal of reversible integer implementation for multiple component transforms, ISO/IEC JTC1/SC29/WG1N1720, Arles, France, 2000.

Y. She and P. Hao, A block TERM factorization of nonsingular uniform block matrices, Science in China, Series E (in Chinese), 2004, 34(2).

Can we make it more efficient?

Less factor matricesLess rounding errorInteger computationParallel computing

How to increase the degree of parallelism?

x j y+

Elementary reversible structure

• Integer factor: j• Flexible rounding: round(), floor(), ceil(), …• Generalized lifting scheme: for j =1, it is the same

as ladder structure and the lifting scheme• Implementation: y=jx+[b] and x=(1/j).(y+[b])

y=jx+[b] x=(1/j).(y+[b])

Elementary reversible matrix (ERM)• Diagonal elements: Integer factors• Triangular ERM (TERM)

– Upper TERM– Lower TERM

• Single-row ERM (SERM)–– Only one row off-diagonal nonzeros

Tm m m= +S J e s

Point factorizations (PLUS)

N N −

A PLUS DLU S S Sif det det 0T

R= ≠P A D

]0,,,,[ 12100 −⋅+=+= NNT

N ssseIseIS

( ))det(,1,,1,1 APD TR Diag=

Tm m m= +S I e s

Block factorizations (BLUS)

N N −

A PLUS DLU S S Sif ( ) ( ) existsT

R=DET P A DET D

0 0 1 2 1[ , , , ,0]TN N N −= + = + ⋅S I e s I e s s s

( ), , , , ( )TR Diag=D I I I DET P A

Tm m m= +S I e s

Parallel factorizations (PERM)(1) (2) ( )(0) (1) (2) ( 1) ( )

Kn n nK KN m m m m m N−= → → → =

1(1) (2) ( ) ( ) ( ) ( ) (1) (1) ( ) ( )

1( ) ( ) kK K K K k k

= = ∏A P P P D L U L U PD S S

PERM(0)

PERM(1)

Parallel computing PERM(0)

x P y(1)1S (1)

2S (1)3S (1)

4S (2)1S (2)

2S (2)3S (2)

Parallel computing PERM(1)

x (1)0S (1)

1S (1)2S (1)

3S (1)4S (2)

0S (2)1S (2)

2S (2)3S (2)

4S P y

Parallel multiplication

For p processors to implement multiplications of n pairs of numbers

the computational time is:

Parallel additionx

(1,5)(1,6)(1,7)(1,8)(1,9)(1,10)(1,11)(1,12)(1,13)(1,14)(1,15)(1,16)

SSSSSSSSSSSS

log if 2/ log if 2

n n pT

n p C p n p+ < = + ≥

Computational complexity *

(1)* ( ) ( ) ( 1) ( )

( 1) 2 ( ) 2 1 2

( 1) ( ) /

1 ( ) ( )

Kk k k k

T n m m m p

N Nm mp p

= + −

−≈ − =

( 0)* ( ) ( ) ( 1) ( ) 1

( 1) ( )1 1 1 2

( ) ( )

Kk k k k

kPERMk

NT n m m m pm

N N N Nm mp p

−−

−≈ − =

For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel multiplication time is:

It’s independent of the blocking manners.

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2

Computational complexity +

For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel addition time:

There is a turning point Kp, where

is close to but less than 2p.

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2

( ) ( )(0 )( ) ( ) ( 1) ( ) ( ) 1

2 2 ( 1)1

( ) ( 1) ( ) 12 ( 1)

( ) / log log

log ( )

Kk k k k k

kPERMk

Kk k k

NT n m m m p p C p mm

Nn m mm

+ −−

−−

= − − + −

( ) ( )(1)( ) ( ) ( 1) ( ) ( )

( ) ( 1) ( )2

( 1) ( ) / log log

( 1) log ( )

Kk k k k k

Kk k k

T n m m m p p C p m

= + − − + −

+ + −

∑( ) ( 1) ( )( )p p pK K Km m m− −

Blocking strategy

Since the parallel computational time has a turning point (ignoring the factors like communication time)

We propose a three-phase blocking strategy

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2

if 2 :

if 2 2 :

if 2 : 1

N p N p

p N p N p

≥ → →

≤ < → →

≤ → →

Computational complexity(1) (2) ( )(0) (1) (2) ( 1) ( )

* *1 2

2* * *

2 3PERM

( , ) 1 1 ( , ) 2

( , ) ( , ) 1 1 ( , ) 2 4

( , ) 5log 4

N N Nf N p p f p p pp p

N N N NT N p f N p f p p pp p

Nf N p N p

= + ⋅ − ⋅ + ≤

= = + ⋅ − + < < = ≥

2 2 3PERM

( , ) 1 1 ( , ) 2

( , ) ( , ) 1 1 log ( , ) 2 4

( , ) 5log log 9 1

N N Nf N p p f p p pp p

N N N NT N p f N p C p f p p pp p

f N p N N

= + ⋅ − ⋅ + ≤

= = + ⋅ − + + < <

= −2

Complexity comparison(1)

*pSERM

1( , ) ( 1) NT N p Np−= +

(1) 2pSERM

1( , ) ( 1) logNT N p N C pp

+ −= + +

p Operation O(N) O(N2)

SERM(1) O(N) O(N) Multiplications

PERM(1) O(N) O(logN) SERM(1) O(NlogN) O(NlogN)

Additions PERM(1) O(N) O(log2N)

PERM vs. parallel SERM

1 4 16 64 256 1024Number of Processors ( p )

Computational Com

plexity

PERM MultiplicationsPERM AdditionsSERM MultiplicationsSERM Additions

Computational complexity (N = 64, C = 1)

PERM vs. parallel SERM

Relative speedup( N = 64, C = 1)

1 4 16 64 256 1024Number of Processors (p )

Speedup(PERM

P ERM Multiplica tio n/SERM Multiplica tio nPERM Addition/SERM Addition

ConclusionsFor parallel computing:

Increase the degree of parallelismAccommodate more processors

For sequential computing:May be more efficient for sequential computing with special matrix computation software such as BLAS

More factorization levels possibly result in greater rounding error

Thank You

phao@cis.pku.edu.cn

phao@dcs.qmul.ac.uk

http://www.dcs.qmul.ac.uk/~phao

matrix factorizations for parallel integer transforms

Documents

code of conduct · integer code of conduct 3 purpose of the...

communication-avoiding parallel and sequential qr...

pca & matrix factorizations

high-throughput area-efficient integer transforms for video...

continuous analogues of matrix factorizations

matrix factorizations for computer network...

audio coding based on integer transforms - willkommen...

nonnegative matrix and tensor factorizations

factorizations for nd polynomial matrices*

numerical performance of incomplete factorizations for 3d

matrix factorizations, algorithms, wavelets

learning with matrix factorizations nathan srebro

reversible integer-to-integer wavelet transforms for...

regularized tensor factorizations and higher-order ... ·...

factorizations in the irreducible characters of … › grad...

the rg-factorizations in stochastic models

a study of integer sequences, riordan arrays, pascal-like...

bivariate factorizations via galois theory, with

structured factorizations in scalar product...

factors and factorizations of graphs