fpgas make a radar signal processor on a chip a · pdf filefpgas make a radar signal processor...

FPGAs Make a Radar Signal Processor on a Chip a Reality

Ray Andraka, P.E., Andraka Consulting Group, IncAndrew Berkun, Jet Propulsion Laboratory

AbstractRadar signal processors heavily tax the capabilities ofconventional microprocessor based signal processingsystems. Higher performance systems using customsilicon cost too much for the typically small productionvolumes, and are not flexible enough for researchapplications. Field programmable gate arrays offerthe performance of custom silicon while maintainingthe economies and flexibility of the microprocessorbased solutions. Recent devices possess the densityand performance to realize a complete radar signalprocessor in a single FPGA, including complex down-conversion of the IF to base-band. Performing thedown-conversion and matched filtering in the FPGAeliminates the specialty chips previously needed forthese functions, making it possible to use commerciallyavailable FPGA boards. Our system packs 4 channelsinto a pair of FPGAs. Each channel (a demodulator, amatched filter and a Doppler pulse pair processor)executes over 5 billion multiplies per second.

1. Introduction

Radar works by bouncing electromagnetic energyoff a target, recording the echo and making someuseful observation from the data. A fundamentalproblem in radar is that the vast majority of thereflected energy does not make it back to the receiver.Much of the processing in a radar system is to improvethe signal to noise ratio of the received signal. Oursystem is intended to measure the amount and velocityof rain in and below a cloud from an aircraft above thecloud layer.

Our radar uses pulse compression, that is to say wespread the transmitted pulse out in time and thenprocess the received echo with a matched filter to de-spread it. We also band-limit the return and digitallydemodulate the signal to a complex baseband. Finally,we measure and average the echo power and thedoppler shift between successive echoes to obtain thedesired measurements. Our system uses four channelswith different frequencies and field polarizations toobtain a more accurate picture of the weather.

The net processing for each channel requires over 5billion multiplies per second - well beyond thecapabilities of traditional microprocessor solutions.The low production volume and the desire to be able toadjust the processing in the future has driven us to anFPGA based solution. Furthermore, by doing thedemodulation and matched filtering in FPGAs ratherthan with specialty devices, we can use commerciallyavailable boards, thereby avoiding the cost, risk, andschedule of designing our own. Application of a fewalgorithmic tricks, careful FPGA floorplanning and theremarkable density and performance of the currentFPGAs allows us to put all the processing for twocomplete channels into a single FPGA.

2. Doppler pulse pair radar basics

2.1. Pulse pair processing

The echo for each radar pulse is demodulated anddigitized into a large number of complex samples(range gates in radar terminology). Each sample indexcorresponds to a specific time offset from the start ofthe radar pulse, so each sample represents the reflectedenergy at a specific range. The complex time series ofsamples at a given range gate can be processed with aFourier transform to obtain the doppler spectrum of theecho at that range, from which the mean velocity andvariance can be obtained. For a pulsed radar, it can beshown that the pulse pair algorithm provides a reliableestimate of the mean doppler frequency[1] at eachrange gate:

−

=+=

1

0

*1

1 N

kkk ss

NR

where sk are successive samples at a range gateand * indicates the complex conjugate.

Since this algorithm is simpler to compute than an FFTand is not biased by receiver noise, it is the algorithmof choice for most meteorological applications. Weneed to compute the average power along with themean doppler shift to provide a reference and formeasuring the total area of the raindrops at a givenaltitude. The average power at each range is:

Ray Andraka

Ray Andraka

Copyright 1999 IEEE. Published in the Proceedings of the Asilomar Conference on Signals, Systems, and Computers, October 24-27, 1999, Monterey, CA, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.

−

=

=1

0

* ||Re1 N

kkk ss

NR

The mean doppler frequency output for each rangegate is a complex value. Its phase indicates how muchthe raindrops have moved from one echo to the next.The magnitude gives a weight to the phase value(stronger echoes count more than weaker ones). Thedifference between the average doppler magnitude andthe average power measurement provides a measure ofthe ‘spread’ of the velocities in a set of echoes.

2.2. Pulse compression

Transmitted pulse power plays a large part in thesignal to noise ratio of the received echoes. Higherpower means better detection. The range resolution ofradar is inversely proportional to the width of thetransmitted pulse, so we desire to make the pulse asnarrow as practical to obtain the best range resolutionwe can. Unfortunately, narrowing the pulse means thatthe peak pulse power has to be increased to keep thetotal pulse power constant. High peak powers presentmany practical problems in the transmitter design [2].

Instead, we can transmit a long coded pulse, andthen compress the echo into an impulse in the receiverusing a matched filter. The idea is to spread thetransmit energy out over time to limit the peak transmitpower, then use correlation at the receiver to recoverthe range resolution. In our case, the coded transmitpulse is a linearly swept frequency ‘chirp’. Thetransmit pulse can be viewed as a convolution of animpulse and a filter with an impulse response equal tothe desired transmit waveform. Ideally, the receivefilter is the inverse filter so that filtering the transmitwaveform restores the original impulse. The impulseresponse of this receive filter is the complex conjugateof the time-reversed chirp. In practice we need towindow the filter to suppress range sidelobes [3] andwe will apply corrections for non-linearity and noise inthe transmitter and receiver. We desire to do thematched filtering in the digital part of the processing togive us the flexibility of changing the transmitwaveform and frequency, optimizing the filter for a

particular doppler range, and applying corrections forthe analog front end. The matched filter’s referencewaveform is as long as the transmit waveform, whichin this case is 40 µsec. In order to keep the number oftaps in the filter to a minimum, we operate the filter atthe lowest sample rate possible. At 5 MHz, thistranslates to a minimum filter length of 200 complextaps. We’ve extended the filter to 256 taps toaccommodate longer chirps in the future.

2.3. Demodulation and video filtering

Quadrature phase detection is used to obtain the realand imaginary parts of the complex echo envelopenecessary for doppler pulse pair processing. By doingthe demodulation from IF to complex baseband in thedigital domain, we avoid the channel mismatch anddrift problems associated with an analog quadraturephase detector. Additionally, using a digital videofilter gives us a much better and more flexible filterthan we can afford to build in the analog domain. Thefinal analog IF is a 4MHz wide signal on a 5 MHzcarrier. By oversampling at 20 MHz we can use ananti-alias filter with a fairly gentle roll-off, and as wewill see shortly, we gain a significant processingadvantage for the digital demodulation.

The digital video filter attenuates the noise outsidethe 3 to 7 MHz passband. That noise is mostly receiverthermal noise, so it is Gaussian and is spread evenlyacross the digitized spectrum. The video filterattenuates the out of band noise by 20 dB. Quadraturedemodulation requires the removal of the negativefrequency passband image either before or after thecomplex mixer. Since decimation aliases that imageonto the desired signal, the phase splitter has toattenuate the image by at least 85 dB.

The complex mixer shifts the echo spectrum downby ωc so the positive frequency component becomes acomplex baseband signal. The phase splitter suppressesthe negative frequency part which is shifted to -2ωc.The resulting complex baseband signal has a 2 MHzbandwidth, so we can decimate it 4:1 without aliasingto minimize the processing load in the matched filter.

Avg power

½ K Fifo

½ K Fifo

50 µsecMatched

Filter(256

complexcoefficients)e-jwct

(5 MHz)

Decimateby 4

PhaseSplit

12 bitADC

VideoBandpass

Filter

Real input signal with 4 MHzBandwidth on 5 MHz carriersampled at 20 MHz

Avg doppler

½ K Fifo

Complex samples

*

*

* Complex Conjugate

Figure 1. Radar signal processor (one of four channels shown)

3. Making it happen in an FPGA

The matched filter in each channel alone requiresover 1000 multiplies per sample, a processing load thateasily overwhelms a microprocessor. By takingadvantage of the parallelism available in an FPGA andusing a few algorithmic tricks, we can easily handle theprocessing for two channels in one FPGA.

3.1. Algorithmic efficiencies

The video filter and the phase splitter (a Hilberttransform filter) can be combined into a single analyticfilter, which is still a bandpass filter, but it only passesthe positive frequency components of the real inputsignal. The impulse response of that filter has to becomplex, so the filter requires two real filters forimplementation. That composite filter response can bedemodulated to move it to the other side of themixer[4] as shown in Figure 2. Note that the translatedfilter response is a low pass filter. Rearranging thedemodulator provides an opportunity for significantcomputational optimizations.

e-jωct =cos(ωct)-jsin(ωct)

-jsin(ωct)

cos (ωct)Lowpass Filter

√2 f(t)

Lowpass Filter√2 f(t)

I

Q

a) conventional arrangement

b) Filter translated to follow mixer

Decimateby 4Video

BandpassFilter

2f(t) cos(ωct)

PhaseSplit

Decimateby 4

Figure 2. Equivalent demodulator structures.Decimating the filter output involves discarding

three of every four output samples. Since we do not usethe discarded samples, there is no need to perform theassociated computations [5]. Figure 3 shows thecomputations for three successive retained outputs fora FIR filter when the output is decimated by four.Inspection reveals that each coefficient only multipliesevery fourth sample. The filter can be decomposed intoa sum of 4 sub-filters, each of which filters a particular“phase” of the input signal. One filter phase ishighlighted in the illustration. Each sub-filter operatesat the decimated sample rate rather than the input rate,so even though there are as many taps as the originalfilter, the processing load is a quarter of what it was.

Figure 4 shows the resulting decimate by four filterstructure.

Yk+0 = xkC0 + xk-1C1 + xk-2C2 + xk-3C3 + xk-4C4 +…Yk+4 = xk+4C0 + xk+3C1 + xk+2C2 + xk+1C3 + xkC4 +…Yk+8 = xk+8C0 + xk+7C1 + xk+6C2 + xk+5C3 + xk + 4C4 +…

Figure 3. Filter output after decimating by 4

16 Tap FIR

16 Tap FIR

15 Tap FIR

16 Tap FIR

20 MHzInput

5 MHzOutput

x0,x4,x8...

x1,x5,x9...

x2,x6,x10...

x3,x7,x11...

Figure 4. 63 tap decimate by 4 FIR filter.The N coefficients of a FIR filter with linear phase

are symmetric so that Ci = CN-i. The number ofmultiplies can be cut in half by folding the tappeddelay line and pre-adding sample pairs beforemultiplying by the coefficient as is shown in Figure 5.

Yk = xkC0 + xk-1C1 + xk-2C2 + xk-3C3 + …+ xk-nCn

Yk = (xk+ xk-n)C0 + (xk-1+ xk-n+1)C1 + …+ (xk-n/2)Cn/2

Figure 5. Symmetry reduces multiplicationsAs with many systems, we have some freedom in

the selection our input sample rate. If it is set to beexactly 4 times the local oscillator, then the complexlocal oscillator, e-jwt, is represented by the repeatingcomplex sequence: 1, -j, -1, j. When followed by adecimate by four filter, the LO value is constant foreach phase of the filter. The mixer can be eliminatedby weighting each of the sub-filters with the LO valuecorresponding to that phase [6]. Note that the phaseangles selected weight half of the sub-filters with zero,eliminating those sub-filters as shown in Figure 6.

16 Tap FIR

16 Tap FIR

15 Tap FIR

16 Tap FIR

20 MHz Real Input

5 MHzComplexOutput

x0,x4,x8...

x1,x5,x9...

x2,x6,x10...

x3,x7,x11...

1+0j

0+j

-1+0j

0 -j

I

Q

Figure 6. Complete decimating downconverterwith 63 tap video filter

If the length of the filter is odd and its coefficientsare symmetric, the I and Q filters each use only half ofthe 32 unique coefficients (I uses 16, Q uses the other15). Using the symmetry cuts the multiplications in thefilter in half again. We took advantage of decimationand symmetry in our demodulator. The matched filteris non-symmetric so these tricks do not apply there.

3.2. Bit serial arithmetic

Modern FPGAs can easily handle synchronousdesigns with clocks faster than 100 MHz withextensive pipelining and careful design. In designs likethis, where the data rate is a comparatively low 5 MHz,a conventional bit parallel design leaves much of theFPGA’s capability unused.

Bit parallel designs process all of the bits of a dataword simultaneously. In contrast, a bit serial structureprocesses the data one bit at a time. The advantage isthat all of the bits pass through the same logic, result-ing in a huge reduction in the required hardware. Typi-cally, a bit serial design requires only about 1/nth of thehardware needed for the equivalent n-bit parallel de-sign[7][8]. The price of this logic reduction is that theserial hardware takes n clock cycles to execute, whilethe equivalent parallel structure executes in one.

3.3. Distributed arithmetic

Bit serial arithmetic alone doesn’t reduce the size ofthe complex matched filter enough to fit it in an FPGA(A thousand 12 bit serial multipliers alone would justabout fill a Virtex XCV1000. Another trick, distributedarithmetic [9][10], is the key to compacting the filtersto a manageable amount of hardware. This techniquerearranges the multiplies and adds of a sum of productsat the bit level to take advantage of small tables of pre-computed sums.

The distributed arithmetic approach is an inherentlyserial process, but the computations for each bit can beperformed in parallel to obtain more performance ifnecessary. Figure 7 shows a four tap FIR filter madeof serial by parallel “scaling accumulator” multipliers.The AND gates compute a 1 x n partial product whichis added to the shifted sum of the previous partialproducts for each bit in the serial input. Thecoefficients, Ci, are generally constants. The first shiftregister serializes the input and the remaining onesdelay the input by one sample time each.

Shift Reg

Shift RegC0

<<1Scaling Accum

Scaling Accum

Scaling Accum

Scaling Accum

C1

C2

C3Shift Reg

A•C0+ B•C1+ C•C2+ D•C3

Xk

Shift Reg

Figure 7. Four tap serial FIR filter

Each of the scaling accumulators simply sums the1 x n partial products for the multiplication of theassociated coefficient. The adders can be rearranged sothat the 1 x n partial products from all the coefficientsare summed before adding the result to the previous bitsum as shown in Figure 8. Note that the 1x n multipliesare still done at the beginning; all we have done ischanged the order that we sum the partial products.This simple change directly eliminates k-1 registeredadders in a k tap serial filter.

Shift Reg

Shift Reg

Shift Reg

Shift Reg

C0

C1

C2

C3 A•C0+ B•C1+ C•C2+ D•C3

<<1Scaling Accum

Figure 8. Serial FIR filter rearranged to addertree and one scaling accumulator

Since the coefficients are constants, the AND gatesand adder tree can be reduced to a look-up table with afour bit input. The sixteen table entries are the sums ofthe 1 x n products of the inputs and coefficients for allthe possible input combinations, as shown in Figure 9.The table is made wide enough to accommodate thelargest sum without overflow.

C0

Ai

C1

Bi

Ci

C2

C3

Di

Addr Data0000 00001 C00010 C10011 C0 + C1

1110 C1+ C2+ C31111 C0 + C1+ C2+ C3

Figure 9. Serial adder tree and coefficientscan be implemented as look-up table in ROM

The size of the look-up table grows exponentiallywith the number of inputs. Rather than creating alarger table, we combine the outputs of smaller tableswith adders. That way, the size of the circuit growslinearly with the number taps instead of exponentially.This architecture is well suited to FPGAs that use 16x1look up tables as their basic logic resource.

The actual circuit is heavily pipelined to increasethe circuit throughput at the expense of clock latency.A serial distributed arithmetic filter with 12 bitcoefficients implemented in the slowest speed grade(XCV1000-4) Xilinx Virtex is capable processing atmore than 150 Mbps, regardless of the number of taps.A 256 tap filter with 12 bit coefficients occupies about715 Virtex configurable logic blocks (CLBs), which isless than 3 CLBs per tap.

3.4. Interleaved processing

The filter design can be clocked at more than twicethe bit rate we require for 12 bit inputs, even in theslowest parts. We can take advantage of the excesscapacity by multiplexing two channels through one setof hardware when the processing for both channels isidentical. A distributed arithmetic filter is modified tohandle two channels by simply doubling the delaybetween taps and adding a clock delay in theaccumulator feedback loop. The delay in theaccumulator loop lets us accumulate two sumssimultaneously: one sum is accumulated on the oddclocks, and one on the even clocks. The input to thefilter is interleaved serial data so that one channel’sdata is shifted in on even clocks and the other on oddclocks. Both channels must use the same filtercoefficients.The demodulator circuit uses the same filtercoefficients for all the channels, so by interleaving bits,one demodulator circuit can handle two channels. Thematched filter coefficients are unique to each channelso it cannot be shared among channels. However,since the matched filter has complex inputs andcoefficients, it processes the I and Q data streams withidentical pairs of real filters. We can interleave the ‘I’and ‘Q’ samples so that we implement the complexmatched filter with a pair of two channel filters insteadof four filters as shown in Figure 10.

Im|h(ω)|

Re|h(ω)|I,Q

Q,II,Q

-,+

Figure 10. Complex filter using interleavedsample 2 channel filters

4. The FPGA solution

A two channel processor fits into one Xilinx VirtexXCV1000-4 FPGA using the optimizations described.The two channel processor consists of a 2 channeldecimating demodulator, (Figure 4 modified for 2interleaved channels), a pair of complex matched filters(Figure 10) and a pulse pair processor for each channel.The decimator and matched filter both have 12 bitcoefficients and 13 bit inputs (we added a 13th bit totake advantage of the SNR improvement out of thevideo filter). The demodulator and complex matchedfilters for two channels occupy approximately 3100CLBs - about half of the FPGA’s logic cells.

The complex multipliers in the doppler pulse pairprocess are implemented with 24 x 26 scalingaccumulator multipliers, (the 24 bit parallel input is theprevious pulse sample). The FIFO buffers for the

previous pulse and averaging are implemented in theVirtex Block RAM, so even the memory for theprocessor is inside the FPGA.

The entire 2 channel design occupies approximately55% of the logic plus all of the block RAM in theXCV1000. The design is implemented on anAnnapolis Microsystems WildstarTM board whichcontains 3 Virtex XCV1000-4s. We’ve used two of theFPGAs for the four channel processor, leaving the thirdfor future processing needs. The design is clocked at130 MHz to allow two interleaved 13 bit 5MHzprocesses. Timing analysis indicates the design iscapable of better than 150MHz, a healthy 15% margin.

5. Conclusions

In this paper we have discussed the implementationof a doppler weather radar processor in an FPGA. Wepresented a series of design optimizations andhardware tricks that allowed us to perform theequivalent of over 10 billion multiplies per second inone part. The tricks and techniques discussed here areextendable to a large class of signal processingapplications in communications, imaging and otherareas, all of which can benefit from an FPGAimplementation.

6. References

[1] Serafin, R.J., “Meteorological Radar,” in Skolnik, M.,Radar Handbook, 2nd Edition, McGraw-Hill, 1990.[2] Weil, T.A., “Transmitters,” in Skolnik, M., RadarHandbook, 2nd Edition, McGraw-Hill, 1990.[3] Farnett, E.C. and Stevens, G.H., “Pulse CompressionRadar,” in Skolnik, M., Radar Handbook, 2nd Edition,McGraw-Hill, 1990.[4] Lee, E.A. and Messerschmitt, D.G., DigitalCommunication, 2nd Edition, pp. 203-206, Kluwer AcademicPublishers, Norwell, Massachusetts, 1994.[5] Vaidyanathan, P.P., Multirate Systems and Filter Banks,Chapter 4, Prentice-Hall, Englewood Cliffs, NJ., 1993.[6] Frerking, M.E., Digital Signal Processing inCommunication Systems, Kluwer Academic Publishers,Norwell, Massachusetts, 1993.[7] Denyer, P. and Renshaw, D.,VLSI Signal Processing: aBit Serial Approach, Addison-Wesley Publishers LTD,England, 1985.[8] Andraka, R.J., “FIR Filter Fits in an FPGA Using a BitSerial Approach,” Proceedings, 3rd Annual PLD conferenceand exhibit, March 30-31, 1993. CMP Publications,Manhasset, NY, 1993.[9] Peled, A. and Liu, B., “A New Hardware Realization ofDigital Filters,” IEEE Trans. Acoust., Speech, SignalProcessing, vol. ASSP-22, pp. 456-462, Dec. 1974.[10] Goslin, G and Newgard, B., “16 Tap, 8 Bit FIR FilterApplications Gude,” Xilinx Application Note, Nov, 1994.Available: http://www.xilinx.com/appnotes/fir_filt.pdf

fpgas make a radar signal processor on a chip a · pdf filefpgas make a radar signal processor...

Documents