a parallel vlsi architecture of kalman-filter-based algorithms for signal reconstruction

12
* Corresponding author. Fax: #1-819-376-5219. E-mail address: Daniel } Massicotte@uqtr.uquebec.ca (D. Massicotte) INTEGRATION, the VLSI journal 28 (1999) 185}196 A parallel VLSI architecture of Kalman-"lter-based algorithms for signal reconstruction Daniel Massicotte* Department of Electrical Engineering, Universite & du Que & bec a % Trois-Rivie % res, C.P. 500, Trois-Rivie % res, Que & bec, G9A 5H7 Canada Received 23 August 1999 Abstract The problem of improving the performance of the implementation in VLSI technology of Kalman-based algorithms for signal reconstruction in real time is discussed. A parallel approach is proposed to develop a systolic architecture expressly for this speci"c application. We show that the autoregressive model of Kalman "ltering for signal reconstruction is particularly adapted to parallel processing and is well suited for implementation. Although intended to improve signal reconstruction, other applications where a similar autoregressive model of Kalman "ltering is required are allowed. The performance of the parallel architec- ture is validated by comparison with Motorola's general-purpose DSP56002 digital signal for real-world spectrometric signal reconstruction. ( 1999 Elsevier Science B.V. All rights reserved. Keywords: Parallel architecture; Systolic architecture; Kalman "lter; Signal reconstruction; VLSI implementation 1. Introduction Signal reconstruction is a very common inversion problem in such "elds as telecommunications (e.g. channel equalization), metrology, biomedical engineering, seismology and spectrometry [1,2]. It consists of estimating a signal x, i.e., the ideal signal, when knowing the signal y 8 , which is related to x by a causal relationship. A disturbing noise g n a!ects that relationship, making the problem ill posed. The discrete form of the convolution equation is y 8 n " M + m/1 h n~m x m #g n for n"1, 2, 3, 2 , N, (1) 0167-9260/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 6 0 ( 9 9 ) 0 0 0 1 8 - 8

Upload: daniel-massicotte

Post on 02-Jul-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

*Corresponding author. Fax: #1-819-376-5219.E-mail address: Daniel}[email protected] (D. Massicotte)

INTEGRATION, the VLSI journal 28 (1999) 185}196

A parallel VLSI architecture of Kalman-"lter-based algorithmsfor signal reconstruction

Daniel Massicotte*

Department of Electrical Engineering, Universite& du Que&bec a% Trois-Rivie% res, C.P. 500, Trois-Rivie% res,Que&bec, G9A 5H7 Canada

Received 23 August 1999

Abstract

The problem of improving the performance of the implementation in VLSI technology of Kalman-basedalgorithms for signal reconstruction in real time is discussed. A parallel approach is proposed to developa systolic architecture expressly for this speci"c application. We show that the autoregressive model ofKalman "ltering for signal reconstruction is particularly adapted to parallel processing and is well suited forimplementation. Although intended to improve signal reconstruction, other applications where a similarautoregressive model of Kalman "ltering is required are allowed. The performance of the parallel architec-ture is validated by comparison with Motorola's general-purpose DSP56002 digital signal for real-worldspectrometric signal reconstruction. ( 1999 Elsevier Science B.V. All rights reserved.

Keywords: Parallel architecture; Systolic architecture; Kalman "lter; Signal reconstruction; VLSI implementation

1. Introduction

Signal reconstruction is a very common inversion problem in such "elds as telecommunications(e.g. channel equalization), metrology, biomedical engineering, seismology and spectrometry [1,2].It consists of estimating a signal x, i.e., the ideal signal, when knowing the signal y8 , which is relatedto x by a causal relationship. A disturbing noise g

na!ects that relationship, making the problem

ill posed. The discrete form of the convolution equation is

y8n"

M+

m/1

hn~m

xm#g

nfor n"1, 2, 3,2, N, (1)

0167-9260/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 7 - 9 2 6 0 ( 9 9 ) 0 0 0 1 8 - 8

Page 2: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

where hn

is the impulse response function. Numerous methods for solving the previous problemlead to high computational requirements, and an application-speci"c integrated circuit (ASIC)speci"cally for signal applications becomes necessary, especially when the speed, the accuracy, thepower consumption and size play a role.

The choice of Kalman-"lter-based algorithms is justi"ed by their broad "eld of applications.Implemented algorithms are based on the steady-state version of the Kalman "lter, which performsfor a wide "eld of speci"c applications [1}6]. Indeed, the use of a co-processor for the Kalman gainis allowed when the non-stationary version of Kalman is required. The architectures of general-purpose DSPs available on the market are intended to be versatile, and they are not optimisedfor any particular algorithm. Computing operations are performed over many cycles using onlyone multiplier/accumulator (M/A). A programmable processor dedicated to a Kalman-"lter-basedalgorithms of spectrometric data and a DSP designed for a stationary version of the Kalman-"lterusing only one M/A unit were proposed in [7,8], respectively. But for these architectures, verylarge-scale integration induces a problem with the size of connection paths, because the lineresistivity and capacitance slow signal propagation down, especially in long connecting nets. Toavoid synchronization problems in long lines repeated restoration of signals becomes necessary,and in extreme cases even delaying circuitry may be required. Automatic routing may causeconnection nets to occupy large spaces on physical chip, and so compromise the implementation.Systolic implementation may solve these problems. Many authors (e.g. [9}11]) have suggestedsystolic architectures based on a non-stationary versions of the Kalman-"lter. Applying specializedprocessors in these architectures to a stationary data model My8

nN for M*64, is expensive,

and makes introducing constraints in the processing di$cult. To avoid this inconvenience,a parallel architecture in very large-scale integration (VLSI) is proposed, enabling routingand cabled control, while increasing speed. Localization of connections decreases the linesize and, consequently, signal propagation time, improving reliability, e$ciency and ease ofintegration.

In Section 2, the implemented algorithm and the systolic approach are proposed to takeadvantage of the parallelization of the algorithm, making a compromise between the area andcomputation time for the design. Section 3 describes the modules of the VLSI implementation,procedures and prototyping. In Section 4, an example of reconstruction is presented and perfor-mance evaluation shows a comparison with a commercial general-purpose DSP56002 digitalsignal processor. The conclusion is given in Section 5.

Throughout the paper, the notation of vectors and matrices is in bold, italic type for clarity, andthe *

i@j(m) is the mth element of the vector * at time i given the data available at time j.

2. Parallel implementation of algorithms

The following equations sum up the basis of the implemented Kalman-"lter-based algorithmsfor signal reconstruction. Integral versions are presented in [3]

z(n@n

"z(n@n~1

#k=

(h, bx )[y8n!y(

n], (2)

z(n@n~1

"Uz(n~1@n~1

, (3)

186 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196

Page 3: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

y(n"hTz(

n@n~1. (4)

The vector of the steady-state Kalman gain k=

may be calculated in advance and depends on thevector h and the regularization parameter bx chosen to minimize the reconstruction error. Theestimates x(

nof x

nare extracted algebraically from the estimates of the state vector z(

n@n,

x(n"z(

n`d@n`d, (1) (5)

where d is the delay of estimation given by the "xed-lag smoothing included in the model withoutthe addition of calculations. This delay a!ects the latency result but not the throughput of thearchitecture.

If one takes into account that for physical reasons xn

can be non-negative, the followingconstraint, de"ned as the non-negativity function f, is imposed on the solution

z(n@n

(m)"Gaz(

n@n(m) if z(

n@n(m)(0

z(n@n

(m) if z(n@n

(m)*0for m"1, 2,2, M (6)

to obtain an improved algorithm called KALMAN`

, where M is the dimension of the vector z anda3[0, 1] is a parameter to be optimized empirically. An iterative version of the algorithm, calledITERKAL

`, has been developed [4] and can also be implemented. The estimate y(

n`1is computed

in parallel with the estimate z(n@n

, necessary for obtaining the estimation error using the next sampley8n`1

. The particular form of the state matrix U which is sparse makes possible the followingrecursive equations [1,7]:

z(n`1@n

(m!1)"f (z(n@n~1

(m)#k=

(m)In), (7)

y(n`1

(m!1)"h(m!1)z(n`1@n

(m!1)#y(n`1

(m!2) (8)

for n"1, 2, 3,2, N, m"1, 2, 3,2, M and In"y8

n!y(

nis de"ned as the innovation. These

equations allow localized communication, regularity and recursiveness, enabling the parallelapproach. In addition the equation's uniformity [12,13], which is the basis of the local nature, dueto the fact that right hand side variables are indexed by values obtained by shifting, allows us toconsider a parallel architecture where a processing element (PE) is only connected with its closeneighbors.

In comparison with a sequential architecture, the computation time can be decreased signi"-cantly by using the classic divide-to-conquer parallelization [13]. The problem is divided inS sub-problems, which are solved separately, while the initial problem's solution is obtained bycombining the partial solutions. With this approach an improvement of about a factor S withrespect to a sequential architecture can be reasonably expected. In the case where S is an integermultiple of M, a change of variables m"s#(r!1)S with r"1, 2, 3,2, M/S is made in Eqs. (7)and (8) and the limit conditions on the indices are:

n"1, 1)r)M/S, 1)s)SPz(n@n~1

(l#(r!1)S)"0, (9)

1)n)N, r"1, s"1Py(n`1

(s!2#(r!1)S)"0,

Py(n`1

(s!1#(r!1)S)"0,

D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196 187

Page 4: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

Fig. 1. Linear semi-systolic architecture with ring topology (SYSKAL).

Ph(s!1#(r!1)S)"0,

Px(n"z(

n`1@n(s!1#(r!1)S). (10)

1)n)N, r"M/S, s"SPz(n`1@n

(s#(r!1)S)"z(n`1@n

(s!1#(r!1)S),

PIn`1

"y8n`1

!y(n`1

. (11)

The notation CPE indicates that C is a set of conditions of the indices and E an equation. The lastcondition of Eq. (11) corresponds to the measurement signal of a new sample y8

n`1to obtain the

innovation In`1

. Modi"cations during the computation process are given by the limit conditions atthe beginning of the sampling when n"1, on the "rst elements of computation when r"1, and onthe last elements of computation when s"S.

According to these previous equations, the whole computation process can be done on a proces-sor array, assuming that the number of processors is "nite, the array topology is regular, theconnections are localized, and that each processor executes a single task at each time. A ringtopology of S processing element (PE) array is proposed in Fig. 1 resulting from the uniform andrecurrent Eqs. (7) and (8). A common clock synchronizes all S PEs, and the data #ow moves at thesame time between PEs via their internal registers. This architecture is called linear semi-systolicarchitecture with ring topology (SYSKAL), according to the de"nition of the term semi-systolicgiven in [14]. The solution of the initial problem, estimate y(

n`1, is obtained by combining the

solutions of the estimate S sub-problem using a fast adder. The estimate y(n`1

yields the innovationIn`1

used to reconstruct the next sample x(n`1

.Fig. 2 shows a PE architecture composed of two cells C

zand C

y: C

zcells calculate Eq. (7) and

Cy

cells calculate Eq. (8), producing the estimate of signal y8n`1

required to calculate innovationIn`1

. All PES

have the same structure, except the "rst, PE1, and the last, PE

S. Limit conditions

of Eqs. (10) and (11) are, respectively, satis"ed by PE1, which owns one out register less in cell

Cz

and one zero value register in cell Cy, and by PE

S, which di!ers by an added multiplexer. Each

Cycell owns one register set to zero that acts like a temporization delay of one clock cycle to satisfy

the sequential computation of Eqs. (7) and (8). Eq. (8) requires the solution of Eq. (7).To illustrate the whole running of the linear semi-systolic architecture with ring topology, an

example of the implementation of the algorithm KALMAN`

for a measure sample reconstruction

188 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196

Page 5: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

Fig. 2. PE architecture: (a) Cz

cell and (b) Cy

cell.

Fig. 3. Example of SYSKAL architecture to illustrate the data #ow position in initial state of the PEs with M"12 andS"4, for n"1.

is shown in Fig. 3. The initial state of the architecture corresponds to the beginning of thecomputation process of a measure sample n with a fast adder with a tree topology. The architectureparameters are M"12 and S"4. In Fig. 3, data indexes z(

n@n~1, k

=and h have been replaced

by their numerical values, a whole cycle of data moving in registers corresponding to one

D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196 189

Page 6: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

reconstruction point is done in M/S cycles. That is demonstrated by following the data #ow (z(n@n~1

,k=

and h) through PE registers. One supplementary cycle is needed to satisfy limit conditionsEq. (11). This evaluation, that the fast adder is included in the data path [15], does not requireany supplementary cycle to execute computation. Indeed, it is always possible to pipeline the adderto increase the clock frequency, adding some cycles to limit conditions of Eq. (11).

3. VLSI implementation

Regarding the architecture implementation, a systolic version is advantageous in terms ofconceptual simplicity and modularity. The conception is based on the use of few basis PEs toimprove performance on a reduced silicium surface. This is a major advantage compared witharchitectures requiring complex layout [7,8]. The modularity of parallel architecture makes itadaptable to the size and nature of the problem under consideration.

3.1. Regularity and modularity of the architecture

A surface criterion becomes an important physical constraint when VLSI implementation of thepreviously developed architecture is done. In fact, the number of PEs to be implemented, S, and thenumber of registers per PE are limited. As a result, the impulse response h is restricted, andconsequently, the "eld of applications supported by the processor is reduced. The SYSKALarchitecture may be modular, allowing parallel running of several SYSKAL processors (we note byN

Pthe number of SYSKAL processors), which keep their linear structure when they are linked

together in cascade. To respect the data #ow through PEs of processor P, a multiplexer is availableat the PE

Sinput to receive the data z(

n@n~1(s!1) calculated according to Eq. (7) and sent by

a previous PE1. Another multiplexer is also available at the "rst PE

1output to add a register in the

last pile of z(n@n~1

(s) if the SYSKAL processor is not the last one. Each processor connected incascade can access the partial results y( H

n`1of the estimates y8

n`1to calculate the innovation I

n`1. It

makes the VLSI implementation of the iterative version of Kalman "lter for signal reconstructioneasier [4].

3.2. Calculation of innovation for a parallel processor application (NP'1)

Fig. 4 shows the calculation of the innovation In`1

which requires the partial results y( Hn`1

of theestimates y

n`1obtained at the adder output (Fig. 1). In the case of one processor (N

P"1), these

results are accumulated in an accumulator register R2. At the last cycle of the calculation, the resultof estimation is completed and sent to register R3 passing through a bits shifter; then it issubstracted from the measure y8

n`1available. Next, the innovation I

n`1is sent to PEs of a same

processor. The bits shifter consists of a barel shifter and is controlled by a DELTA variable which isused to correct the quanti"cation errors in the calculation process [1].

In the case of several processors in parallel (NP'1), each y( H

n`1accumulated by each register R2

of each processor is propagated to the next processor. Fig. 4 shows register states at the step ofpropagation of each y( H

n`1to the next processor. The calculation of the innovation sequence occurs

190 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196

Page 7: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

Fig. 4. Innovation calculation for application with SYSKAL processors in parallel (NP'1).

in the following steps:

1. local value y( Hn`1

is accumulated in the register R2;2. y( H

n`1is propagated to the next processor via multiplexer MUX1. y( H

n`1received from the

previous processor is loaded in register R1 and at the same time added to the local value:

1. 2.1. if two processors are used, the result is immediately loaded in register R3 and a new value ofthe rough signal is loaded in register R4; then the next calculation cycle begins;

1. 2.2. if more than two processors are used, the result is loaded in register R2 and step 3 follows;

3. the multiplexer MUX1 is meant to propagate the y( Hn`1

value in register R1 to the next processorp#1;

4. the y( Hn`1

values are propagated through the chain of processors until each processor hasreceived all the y( H

n`1values of the other processors. At each new propagation, the new

y( Hn`1

value is added in the register R2. During the last propagation, the addition result is loadedin the register R3, the register R2 is set to zero and the y8

n`1signal is loaded in register R4.

The innovation calculation is executed in NP!1 clock cycles.

3.3. Cz

and Cy

cells implementation

Cz

cells calculate Eq. (7), the implementation of this equation is shown in Fig. 5(a). During the"rst step, a Wallace tree executes k

=(s#(r!1)S)I

nproduct, reducing the number of partial

products to two quantities. Then, both quantities are injected in a carry save adder (CSA) with thez(n@n

operand to be added. The CSA reduces the three quantities to two quantities and another addercalculates the entire operation z(

n@n~1(s#(r!1)S)#k

=(s#(r!1)S)I

nat the tree root level. The

economy is one adder per PE.

D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196 191

Page 8: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

Fig. 5. PE VLSI implementation: (a) Cz

cell and (b) Cy

cells.

Cy

cells execute partial computations of y(n`1

, Eq. (8), as shown in Fig. 5(b). A Wallace tree withBooth encoding makes multiplication operations. All the tree roots produce eight partial products,which are reduced to two quantities after passing through an extension of the Wallace trees. Thisextension consists of two CSA layers, which reduce four quantities to two (CSA4 to 2). The tworesulting quantities are added using an adder to produce y(

n`1. The economy is one adder per PE

and log2

S adders in the tree.

3.4. The non-negativity constraint

The non-negativity constraint, Eq. (6), requires multiplication by a constant a of the value z(n(m) if

it is negative. To avoid one or more additional cycles, it has been demonstrated that if a equals oneof the values 0, 1

16, 18, 14, 12

or 1, then an improvement of about 50% in the reconstruction can beexpected [3]. It is equivalent to a left shifting of a point and that any additional clock cycle is notneeded. A Barrel shifter is used that requires no clock cycle to do an asynchronous shifting. Theshifter is placed in the last pipeline stage of the appropriate M/A, then the shifting is done by asimple sign detection of the M/A out result.

3.5. Processor complexity

The SYSKAL processor was simulated and synthesized using Mentor Graphics software tools.A parameterized model in VHDL language of SYSKAL was designed to obtain more rapidly ande$ciently a design adapted to one application with special requirements like, for example, thenumber of bits for z(

n@n, h, and k

=, the data or the number of PEs or registers per PE. The processor

SYSKAL is composed of four PEs containing 16 registers (C"16) for z(n@n~1

, k=

, and h, and a statemachine controlling these PEs according to the executed algorithm with non-negativity con-straints. This state machine makes it possible to use a co-processor for the computation of theKalman gain. The processor uses a 16-bit word length for y8 , z( , x( and a 8-bit word length for h and

192 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196

Page 9: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

k=

. These four PEs, data word lengths, and 16 registers are used to form an acceptable compromisebetween speed, number of applications with (M"64), and silicon area. Nevertheless, the processorallows the use of many processors connected in parallel when the dimension of a vector h is largerthan 64 elements.

The synthesis is done with a low optimization e!ort on the speed and area and uses a 1.2 lmCMOS technology. One multiplier used 4557 transistors, C

zcells used 14 916 transistors, C

ycells

used 7509 transistors, the state machines used 3596 transistors, and the calculation of innovationused 5036 transistors, for a total of 102 208 transistors for the ASIC. The clock frequency of theprocessor is 40 MHz and is principally limited by the speed of the multiplier.

4. Performance evaluation

The computing performance of the proposed SYSKAL parallel architecture, using as a criterionthe number of clock cycles necessary to execute the reconstruction of one sample with or withoutnon-negativity constraint, is (M/S#N

PS#1), where N

PSis the number of pipeline stages into the

critical path including the pipeline of the multipliers and adder in cells (NPS"0 if not pipelined).

Moreover, SYSKAL has the following latency and throughput (M/S#NPS#1#d)t

PEand

(M/S#NPS

#1)/tPE

, respectively, where tPE

is the slower PE running time and d is the estimationof x( delay.

The previous section has shown that a SYSKAL processor can be implemented with fewercomponents than a general-purpose DSP such as the DSP56002, which contains about 200 000transistors in 0.65 lm CMOS technology with a clock at 80 MHz. Naturally, the word lengths inboth devices are not the same, but in a custom design such as ours, one can tailor the word lengthto the exact requirements of the application to reduce implementation costs, energy consumptionand heat dissipation, which is very important in some systems applications.

The expected computing performance of the parallel architecture has been estimated in thefollowing way: (i) the Kalman-"lter-based algorithm of reconstruction was programmed for theMotorola general-purpose DSP56002; (ii) the performance was assessed using the number of cyclesrequired for the execution of the algorithm. The comparison was based on the assumption thatthe clock frequency is the same for both processors, that the number of PEs is "xed to obtainthe equivalent area (S"8), that the pipeline of DSP56002 is full, and that the data of the DSP56002are in the internal memory. Results obtained with another programmable processor dedicated tothe same class of algorithms, previously developed and called DSPKAL [7], are added. All resultsof the comparison for S"8, N

P"1, N

PS"0, and N"1 are shown in Fig. 6. These results indicate

that the architecture SYSKAL asked 20S times less than the DSP56002.To illustrate the whole running of the proposed processor, an example of spectrometric signal

reconstruction is shown in Fig. 7. This was obtained by treating real-world data acquired by meansof the ANRITSU MV02 series optical spectrum analyzer. In this example, the computation timesof reconstruction with DSP56002 and SYSKAL processor are 5.08 ms and 65.6 ls, respectively.But, if we considered the technology and the number of transistors of the DSP56002 for ourprocessor, we can obtain theoretically a clock frequency of 75 MHz with S"8 and the times ofreconstruction become only 20 ls.

D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196 193

Page 10: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

Fig. 6. Number of cycles evaluation for di!erent dimension of impulse vector h with S"8, NP"1, N

PS"0, and N"1.

Fig. 7. Example of spectrometric signal reconstruction using SYSKAL processor with bx "3]108, S"4, NP"1,

NPS"0, N"125, and M"76.

5. Conclusion

A new parallel architecture for signal reconstruction algorithms is proposed which weassume performs better than a general-purpose digital signal processor such as the DSP56002. Incomparison with the general-purpose DSP56002, using the same criterion to execute a reconstruc-tion of one sample is 20S times faster. Moreover, the proposed architecture can be pipelined toobtain a clock rate N

PStimes faster than that of the DSP56002, where N

PSis the number of pipeline

stages. The gain in performance, however, is obtained to the detriment of the architectural#exibility for treating di!erent algorithms (KALMAN, KALMAN

`and ITERKAL

`). In fact, the

set of algorithms accepted by our proposed architecture is limited to the algorithms based onthe autoregressive model of Kalman "lter presented in [1,3,4]. Although intended for signal

194 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196

Page 11: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

reconstruction, this architecture can be used for other applications where a similar autoregressivemodel of Kalman "ltering is required. The SYSKAL architecture owns properties of modularityand locality and is adaptable to the dimension of the impulse response. In that case, computationtime is only a!ected by the longer propagation time between two neighboring processors in thering topology. Results of performance evaluation show that the proposed architecture presentsa linear-rate speed-up, i.e., it achieves an O(SN

P) speed-up, in terms of processing rates where S and

NP

are number of PEs and SYSKAL processors respectively. Finally, we want to point out that theproposed processor can be applied to many applications where fast real time signal reconstructionor correction is essential such as non destructive evaluation (NDE), biomedical engineering,seismology, spectrometry, and channel equalization [1,2].

Acknowledgements

The author thanks Mr P.-L. Cantin, G. Fortin and ED . Granger, ED cole Polytechnique deMontreH al, for discussions and the help in the VHDL model of the proposed architecture, MrF. Pesson, UniversiteH du QueH bec a Trois-Rivieres, for his help in the simulation of the processor.This work was supported by Natural Sciences and Engineering Research Council of Canada;Canadian Microelectronics Corporation and Mentor Graphics contributed with equipmentsupport.

References

[1] D. Massicotte, An approach to the implementation in VLSI technology of reconstruction algorithms class, Ph.D.Thesis, Dept. of Electrical Engrg. ED cole Polytechnique de MontreH al, 1995 (French).

[2] R.Z. Morawski, Uni"ed Approach to Measurand Reconstruction, IEEE Trans. Instrum. Measurement 43 (2) (1994)226}231.

[3] D. Massicotte, R.Z. Morawski, A. Barwicz, Incorporation of a positivity constraint into a Kalman-"lter-basedalgorithm for correction of spectrometric data, IEEE Trans. Instrum. Measurement 44 (1) (1995) 2}7.

[4] D. Massicotte, R.Z. Morawski, A. Barwicz, Kalman-Filter-Based Algorithms of spectrometric data correction, Part1: an iterative algorithm of deconvolution, IEEE Trans. Instrum. Measurement 46 (3) (1997) 678}684.

[5] N.D. Crump, A Kalman "lter approach to the deconvolution of seismic signals, Geophys. 39 (1974) 432}444.[6] G. Demoment, R. Reynaud, Fast minimum variance deconvolution, IEEE Trans. Acousti. Speech, Signal Process.

33 (4) (1985) 1324}1326.[7] A. Barwicz, D. Massicotte, Y. Savaria, M.-A. Santerre, R.Z. Morawski, An integrated structure for Kalman-"lter-

based measurand reconstruction, IEEE Trans. Instrum. Measurement 43 (3) (1994) 405}410.[8] Reynaud, R., Kalman "ltering for the monodimensional signal deconvolution: study of the new fast-Kalman-type

algorithm, numerical experimentation, implementation in a specialized "xed-point arithmetic processor, Ph.D.Thesis, Paris XI, 1986, French.

[9] F.M.F. Gaston, G.W. Irwin, Systolic approach to square root information Kalman "ltering, Int. J. Control 50 (1)(1989) 225}248.

[10] S.Y. Kung, J.N. Hwang, Systolic array designs for Kalman "ltering, IEEE Trans. Signal Process. 39 (1) (1991)171}182.

[11] P. Rao, M. Bayoumi, An algorithm speci"c VLSI parallel architecture for Kalman "lter, IEEE Press: VLSI SignalProcessing IV (1991) 264}273.

[12] P. Quinton, V. Van Dongen, The mapping of linear recurrence equations on regular arrays, J. VLSI SignalProcessing 1 (2) (1989) 95}113.

D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196 195

Page 12: A parallel VLSI architecture of Kalman-filter-based algorithms for signal reconstruction

[13] P. Quinton, Y. Robert, Systolic Algorithmes and Architectures, Prentice-Hall, Englewood Cli!s, NJ, 1991.[14] H.T. Kung, Why systolic architecture, IEEE Comput. Magazine 15 (1982) 37}46.[15] X. Huang, W.-J. Liu, B.W.Y. Wei, A high-performance CMOS redundant binary multiplication-and-accumulation

(MAC) unit, IEEE Trans. Circuits Systems I: Fundamental Theory Appl. 41 (1) (1994) 33}39.

Daniel Massicotte was born in QueH bec, Canada, in 1964. He received the B.Sc.A. and M.Sc.A.degrees in electrical engineering and industrial electronics in 1987 and 1990 respectively fromthe UniversiteH du QueH bec a Trois-Rivieres (UQTR), PQ, Canada. He obtained the Ph.D. degreein electrical engineering in 1995 at the ED cole Polytechnique de MontreH al, PQ, Canada.

From 1987 to 1990 he worked at the Acoustooptical/Ultrasonics Laboratory at the Univer-siteH du QueH bec a Trois-Rivieres and in 1990 he joined the Microelectronics Research Group atthe ED cole polytechnique de MontreH al and the Laboratory of Measuring Systems at theUniversiteH du QueH bec a Trois-RivieH res. Since 1994, he has been a professor with the Depart-ment of Electrical Engineering at the UniversiteH du QueH bec a Trois-Rivieres, and since 1999 hehas been Head of the Laboratory of Signal and Integrated Systems. Since 1997, he has beenEducational Activities Chair of the St. Maurice Section of IEEE. He received the Douglas R.Colton Medal for Research Excellence awarded by the Canadian Microelectronics Corporation

and the PMC-Sierra High Speed Networking and Communication Award in 1997 and 1999 respectively. His researchinterests include implementation of algorithms in VLSI technologies and digital signal processing for the communica-tions and measurement problems.

Dr. Massicotte is also member of the `Ordre des IngeH nieurs du QueH beca, member of the `Groupe de Recherche enED lectronique Industrielle'' (GREI), associate member of the Montreal &&Groupe Interuniversitaire en Architecture desOrdinateurs et VLSI'' (GRIAO-IRO).

196 D. Massicotte / INTEGRATION, the VLSI journal 28 (1999) 185}196