e. md john.e.dorband@nasa

2003 Conference on Information Science and Systems, The Johns Hopkins University, March 12-14,2003

Parallel signal processing and system simulation using a c e John E. Dorband

NASA Goddard Space Flight Center, Greenbelt, MD 20771 [email protected]

Maurice F. Aburdene Department of Elecnical Engineering, Bucknell Universiy, Lewisburg, Pa I7837

[email protected]

Abstract: Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some hndamental features of ace C and present a signal processing application (FFT).

1. Introduction

Parallel and networked computer programming techniques have become popular and important computational tools for system simulation and signal processing[ 11. The essential elements of parallel programming are concurrent computation (threads) and communications (messages). ace is based on the concept of structured parallel execution [2-31.. First the programmer designs a virtual architecture that reflects the spatial organization of an algorithm. An example is shown in Fig. 1. A virtual architecture may consist of groups or bundles of threads of execution. Second the application program reflects the temporal organization of the algorithm. The code defines what each thread performs, which together with the virtual architecture, defines the algorithm’s execution. ace allows the programmer to both envision the most logical architecture for the application and then implement the algorithm using that architecture.. The ace language allows the programmer to explicitly express that which can be performed concurrently. The purpose of ace is to facilitate the development of parallel programs by allowing programmers to explicitly describe the parallelism of an algorithm.

El *-

A r d s L Q n

Fig. 1. Sample Virtual Architecuture

1.1 a c e micro-threads

One of the essential elements of concurrent computations is micro-threads. The following ace program shows an example of micro-threads.

threads A[ 1001;

threads B[ IO]; A int av;

B.( ... A.( int bv,cv;

if (av==bv) cv=O ;

1 1

Any C code is valid, except ‘goto’s.

The difference between ace and C is that while C has only one thread of execution, ace, may have many threads of execution. Each thread may be referenced by name and index. Parallelism in ace is expressed by first defining a set of concurrently executable threads. A group of parallel threads can be viewed as a bundle of executing threads, a bundle of processes, or an array of processors. These three views will be treated synonymously. In ace, a bundle of threads is defined with the ‘threads’ statement. The statement “threads A[ lOO]” only declares the intent of the programmer to use 100 concurrent threads of executions named ‘A’ at some later point in the


code. The statement A int av allocates an integer ‘av’ for each of the 100 threads of bundle ‘A’.

1.2 a c e micro-messages

The second essential element of concurrent programming is micro-messages. Here is an example program showing communication.

threads A[ 1001;

threads B[ 1001;

A,{ int av,cv; B int bv;

av = A[$$i+l].cv ; /* get operation */

B[$$i/lO].bv += av ; /* put operation */

1 /* Note: $$i is a threads unique ID */

1.3 Virtual Architecture

One of the major advantages of ace for signal processing is that it allows the programmer to create a virtual architecture for the algorithm. Fig 2-6 show various examples of virtual architectures and communications along with the a c e code.

threads X[ If ; threads { Z[21[2] Y[3][3] ;

U L

Fig. 2. Virtual Architecture

threads A121 [2] ; A. { int a=$%; int b=3; int c;

t c=a*b;

Fig. 3. No Communications

threads A[5 I2 J[5 1 23 ; a = .A[O][-l].c ; /* North */ a = A[O][l].c ; P South *I

Fig. 4.2-D Relative Communications

threads A[513]{511J ; a = A[O][S%ix[O]-l f.c : I* Nonh *I a = A[O][%$is[O] +I].c ; /* South *! a += A[$Six[ I]-llfO].c : /* West */

Fig. 5. Absolute Communications

Aggregate Operations ( f. -. &, 1, A. >?, <? )

thrcnds X[ I J : threads Z[1][2] ) Y[3][3] ; Y.( int a; X int b; X.h+=a; f

Fig. 6. Global Aggregate

threads NODE[ 1 0001 threads { CornerE3) ] CELL[2000] ;

Fig. 7. Finite Element Example

CELL @ Corner

NUDE

2003 Conference on Information Science and Systems, The Johns Hopkins University, March 12- 14,2003

2. FFT Example

The fast Fourier transform, FFT, is widely used in science and engineering and is easily implemented on parallel architectures [4-51. Gupta and Kumar [6] presented the scalability analysis of an FFT algorithm for both mesh and hypercube architectures.

The ace program shown in Appendix A performs a NxN 2-D FFT and inverse FFT on test data. This program performs N 1-D FFTs in parallel (partialFFT) in one dimension, transposes (Transpose) the transformed data, performs a second set of 1-D FFTs, and performs another transposes of the resulting transformed data. The I-D FFT results in shuffled data and is unshuffled (UnShuffle) immediately after the I-D FFT is performed.

of threads to a N*N/2 I-D bundle of threads, thus the 2-D FFT was actually performed on a 1-D architecture.

The data was moved h m the NxN 2-D bundle

3. Summary

This paper presented the fundamental concepts of a new C based parallel programming paradigm, ace C. The intent of ace is to allow a programmer to easily express the parallelism of an algorithm, allowing the compiler to more easily map the parallelism of that algorithm on to various parallel architectures. The ease of mapping facilitates the porting of programs and development /runtime environments to different architectures, promoting the development of new architectures. In addition, a parallel implementation of a 2-D FFT was presented.

4. References [ I1G.R Andrews, Fowrdntionr of multithreaded, parallel, and distributedpmgramming, Addison-Wesley, 2000. [2] J.E. Dorband and M.F. Aburdene, “Architecture-adaptive computing environment: a tool for teaching parallel programming”, poceed inm-Frontiers in Education Conference,

[3] J.E. Dorband, “ace C Reference”, NASA Goddard Space Flight Center, Greenbelt, IvlD. [4] Michael .J. Quinn, Parallel Computing: Theoty and Practice, McGraw-Hill, Inc., 1994. [5] Harold S. Stone, High-Pe$onnance Computet Architecture, Addison-Wesley, Mass., 1987. [6] A. Gupta and V. .Kurnar, “The scalability of FFT on parallel computers”, IEEE Transactions. on Parallel and Distributed Systems, V. 4, no. 8, August 1993, pp. 922-932.

Vol. 3, pp. S2F 7-S2F112,2002.

5. Acknowledgments This research was funded in part by NASA grant NAGS- 10821.

Appendix A: Parallel FFT Program

# include <stdio. aHr> # include <math.aHr>

# define real

# define twoPI

# define PI

# define size 512 # define radius 3 # define nproc2 (size*size) # define nproc (nproc2>>1)

double

( (double) 6.28318530717959)

(twoPI/Z)

struct complex { real rr,ii ; } ; typedef struct complex complex ;

generic void ComplexAdd ( complex *C, complex *A, complex *B ) {

C->rr = A->rr f B->rr ; C->ii = A->ii f B->ii ; 1

generic void Complexsub ( complex *C, complex *A, complex *B ) {

C->rr = A->rr - B->rr ; C->ii = A->ii - B->ii ; 1

generic void ComplexMult ( complex *C, complex *A, complex *B 1 I

real D; = A->rr*B->rr - A->ii*B->ii ; D

C->ii = A->rr*B->ii + A->ii*B->rr ; C->rr = D ; I

generic complex Init-FFT ( int nxproc ) { real i; complex WS1 ;

2003 Conference on Information Science and Systems, The Johns Hopkins University, March 12- 14,2003

/ * initialize fft coefficients * /

i = $$i%(nxproc>>l) ; WSl.rr = cos((twoPI*i)/nxproc) ; WSl.ii = sin((twoPI*i)/nxproc) ;

return WS1 ; 1

generic(WorkSpace[])void partialFFT ( complex A[2], complex *W1, int nxproc 1 {

complex Al,A2 ; complex Zl,Z2,C ; int i , iproc, n ;

iproc = $Si ; n = nxproc>>2 ; / * nxproc/4 * / A1 = A[O] ; A2 = A[1] ;

ComplexAdd ( &C, &A1 , &A2 ; ComplexSub ( &A2 , &Al, &A2 ) ; ComplexMult ( &A2, &A2 , W1) ; A l = C ; ComplexMult (Wl,Wl,Wl) ;

for ( i=n ; i>O ; i>>=l ) {

if ( (iproc&i) !=O ) 1 Wl->rr = -W1->rr ; w1->ii = -Wl->ii ; 1

Z1 = .WorkSpac’e[ il .A1 ; 22 = .Workspace[-i] .A2 ; if ( (iproc&i)==O ) A2=Z1 ; else A1=Z2 ;

ComplexAdd (&C, &All &A2) ; ComplexSub ( &A2, &A1 , &A2 ) ; ComplexMult .(&A2, &A2,W1) ; A l = C ; ComplexMult (Wl,Wl,Wl] ;

Z1 = .Workspace[ i] .A1 ; 22 = .Workspace[-i] .A2 ; if ( (iproc&i)==O ) A2=Z1 ; else A1=Z2 ;

A[01 = A1 ; Ail] = A2 ;

1

generic(WorkSpace[]) void UnShuffle ( complex A[2], int nxproc ) {

complex A1,AZ ; complex Zl,Z2 ; int i, j, iproc, n ;

iproc = $Si ; n = nxproc>>2 ; / * nxproc/4 * / A1 = A[O] ; A2 = A[1] ;

for( i=n, j=2; i>j; i>>=l, j<<=l ) {

Z1 = .Workspace[ jl .A1 ; 22 = . Workspace [-j 3 .A2 ; if ( (iproc&j)==O ) A2=Z1 ; else Al=Z2 ;

Z1 = .Workspace[ i] .A1 ; 22 = .Workspace[-i] .A2 ; if ( (iproc&i)==O ) A2=Z1 ; else Al=Z2 ;

Z1 = .Workspace[ jl .A1 ; 22 = .Workspace[-jl.A2 ; if ( (iproc&j)==O ) A2=Z1 ; else A1=Z2 ;

Z1 = .Workspace[ 1l.Al ; 22 = .Workspace[-11 .A2 ; if ( (iproc&l)==O ) A2=Z1 ; else Al=Z2 ;

A[O] = A1 ; A[1] = A2 ;

I

generic(WorkSpace[]) void Transpose ( complex A[2], int nxproc ) { complex Al,A2 ; complex Z1,22 ; int i, j , iproc, no, nl ;

iproc = $Si ; no = nxproc>>l ; /* nxproc/2 * / nl = nxproc>>l ; /* nxproc/2 * / A1 = A[O] ; A2 = A[11 ;

for(i=nl, j=1; j<nO; i<<=l, j<<=l) {

Z1 = .Workspace[ j] .A1 ;


22 = .Workspace[-j].A2 ; if ( (iproc&j)==O ) A2=21 ; else A1=22 ;

21 = .Workspace[ i].Al ; 22 = .Workspace[-i] .A2 ; if ( (iproc&i)==O ) A2=Z1 ; else A1=22 ;

21 = .WorkSpace[ j].Al ; 22 = .Workspace[-j].A2 ; if ( (iproc&j)==O ) A2=21 ; else A1=22 ;

Z1 = .Workspace[ i] .A1 ; 22 = .Workspace[-il .A2 ; if ( (iproc&i)==O ) A2=21 ; else A1=22 ; A[O] = A1 ; A[1] = A2 ;

generic (Workspace [I ) void FFT ( complex A[2] , complex *WS, int nxproc ) {

complex W ;

/ * fft coefficients * / w = *ws ;

/ * X dimension fft * / partialFFT( A, &W, nxproc) ; / * X dimension unshuffle * / UnShuffle ( A, nxproc ) ; / * Transpose data * / Transpose ( A, nxproc ) ;

/ * fft coefficients * / w = *ws ;

/ * X dimension fft * / partialFFT( A, &W, nxproc) ; / * X dimension unshuffle * / UnShuffle ( A, nxproc ) ; / * Transpose data * / Transpose ( A, nxproc ) ;

1

generic (Workspace [I ) void invFFT ( complex A[2] , complex *WS,

int nxproc ) { complex W ; int Size ;

/ * fft coefficients * / W.rr = WS->rr ; / * complex conjugate * / W.ii = -WS->ii ;

/ * X dimension fft * / partialFFT( A, &W, nxproc ) ; / * X dimension unshuffle * / UnShuffle ( A, nxproc ) ; / * Transpose data * / Transpose ( A, nxproc ) ;

/ * fft coefficients * / W.rr = WS->rr ; / * complex conjugate * / W.ii = -WS->ii ;

/ * X dimension fft * / partialFFT( A, &W, nxproc ) ; / * X dimension unshuffle * / UnShuffle ( A, nxproc 1 ; / * Transpose data * / Transpose ( A, nxproc ) ;

/ * normalize result * / Size = (nxproc*nxproc) ; A[O] .rr /= (Size) ; A[O].ii /= (Size) ; A[l].rr /= (Size) ; A[1] .ii /= (Size) ;

1

threads WorkSet [nproc] ; threads Frame [size] [size] ;

int main ( int argc, char **argv ) {

WorkSet. { complex A[2l,WS ;

/ * initialize test data set * /

Frame. { int KK, LL; real RIMAGE ;

KK = $$ix[Ol-(size>>l) ; LL = $$ix[ll-(size>>l) ;

RIMn.GE = 0; if ( KK*KK+LL*LL < radius*radius ) RIMAGE = 1 ;


WorkSet. { int i=S$i; A[O].rr = Frame [ i/(size>>l) ]

A[O].ii = 0.0 ; A[1] .rr = Frame [ i/(size>>l) ]

[ i% (size>>l) ] .RIMAGE ;

[ (i% (size>>l) ) + (size>>l) ] .RIMAGE ;

A[l].ii = 0.0 ; I

1

Frame. { real RIMAGE ;

Workset.{ int i=$$i; Frame [ i/ (size>>l) ]

[ i% (size>>l) ] . RIMAGE = sqrt( A[O] .rr*A[O] .rr

+ A[Ol.ii*A[Ol.ii ) ; Frame [ i/(size>>l) 3

[ (i% (size>>l) )+(size>>l) I . RIMAGE

= sqrt( A[1] .rr*A[lI .rr + A[l].ii*A[l].ii ) ;

I

printf ( "%c%lOg", ( ($$ix[Ol)?' ':I\n' ) , RIMAGE ) ;

if (!$Si) printf ( l'\n'r); I

/ * initialize fft coefficients * /

WS = Init-FFT ( Frame. ($$Nx[Ol) 1 ;

/ * perform FFT * /

FFT ( A, &WS, Frame.($$Nx[O]) ) ;


Workset.{ int i=$$i; Frame [ i/ (size>>l) ]

[ i% (size>>l) 3 .RIMAGE = sqrt ( A[03 .rr*A[O] .rr

+ A[O].ii*A[O].ii ) ; Frame [ i/ (size>>l) 3

[ (i% (size>>l) )+(size>>l) 3 . RIMAGE = sqrt ( A[1] .rr*A[l] .rr + A[l].ii*A[l].ii ) ;

print f ( 'I % c% 1 0 g " , ( ($$ix[O])?' ':'\nl ) , RIMAGE ) ;

if (!$Si) printf ( "\n") ; }

/ * perform inverse FFT * /

invFFT ( A, &WS, Frame.($$Nx[O]) ) ;

/ * clean up result * /

if ( A[O].rr < le-13 ) A[O].rr=O.O ; if ( A[O].ii < le-13 ) A[O].ii=O.O ; if ( A[l].rr < le-13 ) A[l].rr=O.O ; if ( A[l].ii < le-13 ) A[ll.ii=O.O ;


Workset.{ int i=$$i; Frame [ i/ (size>>l) 3

[ 1% (size>>l) 3 .RIMAGE = sqrt( A[0] .rr*A[O] .rr

+ A[O] .ii*A[O] .ii ) ; Frame [ i/ (size>>l) 3

[ (i%(size>>l) )+(size>>l) ] . RIMAGE = sqrt ( A[1] .rr*A[l] .rr + A[1] .ii*A[l] .ii ) ;

1

printf ( "%c%lOg", ( ($$ix[Ol)?' l:l\n' ) , RIMAGE ) ;

if ( !$Si) printf ("\n") ; I

return 0;

NASA's GodQrdSpace FliM Center

Name

John Dorband .

William Campbell

Richard Rood

Route Sheet ~-

Date" - Pur- Initials" pose' In out

Code

935

ORIGINATOR

John Dorband TYPED BY

Nancy Dixon

935

CODE PHONE. NUMBER 935.0 6-94 19

CODE PHONE NUMBER 935.0 6-3898

930

QUALITY CHECK ( r c c r c q ; please (CYICW and initlrl)

BR: x DIV DIR

230

MAL LOG

BR Log # DIV Log # DIR Log #

935

SUBJECT

Request for approval for publication release: Parallel signal processing and system simulation using ace

234.0

DUEDATE

0211 9/03

'PURPOSE LEGEND "INITIALs/DATE: Ensure that all initials and inlout dates are completed for all l ies GSFC 11-20 (999) 1. Forinfomation 2. For reviewIapprovaVconwrence 3. Fornecessaryadion 4. For signalwe

e. md john.e.dorband@nasa

Documents