signal processing in fpga.pdf

7/27/2019 signal processing in fpga.pdf

1/7

Implementing a Quantitative Model for the Effective

Signal Processing in the Auditory Systemon a Dedicated Digital VLSI Hardware

A. Schwarz, B. Mertsching M. Brucke, W. Nebel J. Tschorz, B. Kollmeier

University of Hamburg University of Oldenburg University of Oldenburg

Computer Science Department, Computer Science Department, Physics Science Department,

IMA Group VLSI Group Medical Physics Group

D-22527 Hamburg, Germany D-26111 Oldenburg, Germany D-26111 Oldenburg, [email protected] [email protected] [email protected]

1. Introduction

The binaural perception model introduced in [1] de-

scribes the effective signal processing in the human audi-

tory system and provides an appropriate internal

representation of acoustic signals. Its capabilities were suc-

cessfully demonstrated as a preprocessing algorithm for

speech recognition [2], objective speech quality measure-

ment [3] and digital hearing aids. The algorithm processes

stereo signals and includes a gammatone filter bank (30

bandpass filters equidistant distributed on the ERB scale

from 73 to 6700 Hz) to model spectral properties of the hu-

man ear like spectral masking and frequency-dependent

bandwidth of auditory filters.

max

t1

t2

t3

t4

t5

1

kHz

8

Hz

stereo

gammatone

filterbank

halfwave

re

adaptation

loops

inre

lowpass

filtering

absolute

threshold

lowpass

filtering

filteradaptationloops

gammatonefilterbank

envelope absolute lowpassextraction

input

threshold

stereo1 kHz

t1 t2 t3 t5t4

8 Hz

Figure 1. Processing scheme of the binaural perception model introduced in [1].

Abstract

A digital VLSI implementation of an algorithm model-

ing the effective signal processing of the human auditory

system is presented. The model consists of several stages

psychoacoustically and physiologically motivated by the

signal processing in the human ear and was successfully

applied to various speech processing applications. The pro-

cessing scheme was partitioned for implementation in a set

of three chips. Due to local properties of the signal dynamic

and the necessary arithmetical precision different ap-

proaches for number representation and appropriate arith-

metic operators were investigated and implemented. It is

demonstrated how an application of the model has been

used to determine the necessary wordlengths for a transfer

of the algorithm into a version suitable for hardware imple-

mentation. Fix point arithmetic is used in the linear parts of

the origin algorithm and a special small floating point op-

erator set was developed for the nonlinear part. This part

was coded in behavioral VHDL and synthesized with

Synopsys Behavioral Compiler. The hardware algorithm is

being evaluated on different implementation levels for a

FPGA and will be manufactured as ASICs in a later ver-

sion. The presented FPGA chip set will be combined with a

commercial DSP system (TMS320C6201) for real time and

reconfigurable signal processing.


2/7

A stage modeling inner hair cell behavior (envelope ex-

traction) is followed by five adaptation loops (with time

constants between 5 and 500 ms) to consider dynamical ef-

fects as nonlinear adaptive compression and temporal

masking (see Fig. 1). The demonstrated VLSI design con-tains additional components to determine differences in

phase and magnitude of each channel (Fig. 2).

2. Hardware design specifications

Due to the complexity the design was partitioned into

three chips (Fig. 2). Besides the serial data interfaces chip 1

contains the 30-channel binaural filter bank, the envelope

extraction, and a module computing phase differences and

magnitude quotients in each stereo output of the channels.

A single bandpass filter is multiplexed through all 30 chan-

nels both for the left and right stereo signal. Including a six-stage pipelined multiplier and one adder/subtractor this ker-

nel is realized by a quad cascade of a single stage complex

IIR filter. This saves chip area but requires a 50 MHz sys-

tem clock to operate with a 16276 Hz sampling frequency.

RAM units save temporary data and filter constants are

read from a ROM (see Fig. 2).For further processing three system outputs are avail-

able. A high speed interface (30 MBit/s) provides real and

imaginary parts of the right and left stereo data for all filter

bank channels (chip 1). The adaptive compressed data of

the left (1st chip 2) and the right stereo signal (2nd chip 2)

and the phase and amplitude information of the filter bank

outputs are combined within a second data stream (12

MBit/s). The 4-wire serial interfaces of the chip set (16 bit

data words) support a direct interface to the serial ports of

most DSP-devices. In a constellation with a DSP-device

able to serve the fast serial ports (TI TMS320C6201) a sys-

tem solution for auditory signal processing is provided.

3

3

Highspeed Output Interface

Panic

Sync

Input Interface

RectificationHalfwave

IIR Lowpass1kHz 1st Order

QuotientMagnitude-

Controller

TempReg Reg

Mux Mult

Mux Op1Reg

Op2

GammatoneFilter Bank

Phase-Difference

Input Interface

Output Interface

LogicInput

Output

Logic

Scale & Lowpass

Serial/Parallel Converter

Parallel/Serial Converter

SerialDataSync

DecimationLowpass &

Sub

Add

SerialDataIn

INPUT FROM DSP/CODECReset

50 MHz Clock

SerialDataOut

30 Mbit/s

LowspeedOutputInterface

12 Mbit/s

12 Mbit/s

Sync

3

OUTPUT TO DSP COMBINED OUTPUT

3

3SerialDataOut

Sync

Panic

ASIC 1 / 1st ASIC 2 (left)

12 Mbit/s30 Mbit/s

Valid

ROMConstants

RAMState Mem

1st Order

Lowpass

Core Input

AdaptationLoops

Panic

Core OutputInitBusyValid

Valid

Divider

Controller

("Rolled")Multiplexed

Five

ASIC 1 / 2nd ASIC 2 (right)

ROM RAM

50 MHz ClockReset24 MHz Clock

Figure 2. Structure and wiring scheme of the internal components of the chip set.


3/7

Each of the five adaptation loops contains a divider

whose quotient is fed back by a 1st order IIR lowpass pro-

viding the divisor. This feedback, the necessary precision

and signal dynamic requires large fix point wordlengths or

a logarithmic number format. The dividers are very area ex-pensive and therefore the most critical components in the

design. A fourfold subsampling and data serialization in

chip 1 allow a multiplexed loop kernel monaurally imple-

mented in two chips (two of chip 2). The loop kernel con-

tains RAM cells storing the states of all lowpass filters for

the 30 serial processed frequency channels.

3. Floating point to fix point to floating point -

arithmetic suitable for auditory signal

processing

A direct implementation of an IEEE 32 bit single preci-

sion floating point arithmetic of the model is not possible

due to limitations of area and timing. To gain an optimal

implementation different methods are applied to the linear

filter bank and the nonlinear adaptation loops respectively.

The main problem when converting number formats and

dedicated arithmetic is the determination of the required

numerical precision. Because the necessary quantization

depends on applications and typical signal dynamic the per-

ception model was recoded in C++ using new classes of

scalable data types and necessary operators. This class

takes the internal wordlength as a parameter and saves the

values exactly in the same format as they would be saved ina register on an ASIC. Thus numerical effects of imprecise

arithmetic can be simulated in target applications.

The kernel arithmetic of gammatone filterbank was de-

signed and successfully evaluated in a fix point notation.

After evaluating a scalable fix point version of the nonlin-

ear adaptation loops and recognizing the high area con-

sumption for especially the dividers a small floating point

class was successfully tested.

3.1. Arithmetic transformation for linear gamma-

tone filters

Principle. The necessary internal wordlength for the

gammatone filter bank can be assessed in a straight-forward

way, because the filters are linear time invariant systems

where classical numerical parameters like SNR can be ap-

plied. It is sufficient to record the filter responses for -pulses for each filter parameterized with different internal

wordlengths. Figure 3 shows the mean square error (rela-

tive error, i.e. noise-to-signal ratio) between one of these

implementations and the original specification with IEEE

single precision floating point arithmetic. The choice of a

certain maximal square error (e.g. 10-3 for all channels)

leads directly to the necessary internal wordlength. Allow-

ing an error of 0.001 a minimal wordlength of 24 bits is nec-

essary for the lowest filter bank channel (Fig. 3).

Figure 3. Error introduced by fix point quan-tization in the gammatone filter bank.

Numerical operations. The filter algorithm consists of

a fourfold first-order filter which contains only add and

multiply by constants operations.

Number formats. Due to the increased analysis band-

width the error for a given wordlength decreases with in-

creasing center frequency and channel number

respectively. All channels use the same operator structure,

thus a general number format of 24 bits fix point is re-quired.

3.2. Arithmetic transformation for nonlinear

adaptation loops

Principle. The determination of an optimal quantiza-

tion in the adaptation loops is much more difficult because

they show a strong nonlinear behavior.

It was demonstrated in [3] that the perception model can

supply an objective speech quality measure q. Speech sig-

nals distorted by low-bit-rate codecs used in mobile tele-

phone devices are compared to their undistorted versionand a quality measure q is given, which is correlated with a

subjective Mean Opinion Score (MOS) of the test signals.

Because this testbench is very sensitive to limited number

precision and signal dynamic in the perception model, it

can be used to evaluate modifications caused by limited

quantization and arithmetic (Fig. 4). An optimized quanti-

zation of the nonlinear adaptation loops (small wordlengths

i.e. small chip area vs. reliable signal processing) was found

by empirical wordlength variation. The results were veri-

fied processing two different large speech signal sets vary-

ing the input signal levels from -10 to 50 dB.

1e-08

1e-07

1e-06

1e-05

1e-04

1e-03

1e-02

1e-01

1e+00

5 10 15 20 25 30

meansquareerror

number of filter-bank channel

30 bit

28 bit

26 bit

24 bit

22 bit

20 bit

18 bit

16 bit


4/7

Data analysis. Histograms were recorded at internal

nodes to investigate signal levels during the processing of

typical speech (ETSI-test data [4][5]) and noise input sig-nals (Fig. 5).

Figure 5. Histograms of output and divisorin the adaptation loops for typical speechsignals.

The divisors of the loops have an individual threshold,

and their lower bounds are introduced to reduce unwanted

peaks. The dynamic range is obviously limited. Only posi-

tive values occur in the loops, divisors never exceed 1.0,

and the loop outputs are concentrated near zero. This is to

be expected since small amplitudes are very frequently intypical speech signals according to their probability density

distribution [6].

Numerical operations. The original C-code contains in

the loops and the following scaling and lowpass unit all ba-

sic arithmetic operators (Table 1.). The current quotients

qi[n] in the loops are calculated from local lowpass filter

outputs bi[n-1] of the last cycle. The current lowpass output

is derived from its last output bi[n-1] and the new quotient

qi[n]. The output of the last loop q5[n] is shifted and scaled

to s[n] in the scaling unit and after last lowpass filter the re-

sult o[n] is given to the output interface. All Cx(i)

are con-

stants.

Table 1. Operations in the adaptation loops,

i is the loop number and n represents sam-ple numbers.

An useful simplification for the hardware specification

is the fact that all values remain in the positive range up to

last output of the last loop. Indeed, the scaling unit intro-

duces a sign bit which propagates to the output.

Number formats. Considering the necessary precision

of the kernel arithmetic and available arithmetic cores in

the synthesis tool libraries (Synopsys DesignWare), two

approaches are possible. Simulations with the integer pro-

Perception

Model

Perception

Codec

original

signal

distorted

signal

frequency

weighting

correlation

cross- comparation/

correlation

subjective

MOS-data

q

weighting

frequency

Model

IEEE 32 bit floating, fixed or

small floating point arithmetic

Figure 4. Speech quality measurement used as a testbench for changes in kernel arith-metic of the adaptation loops in the perception model.

0.00 10.00 20.00 30.00

value

100

102

104

106

108

1010

frequency

loop0loop1loop2loop3loop4

0.00 0.20 0.40 0.60 0.80 1.00

value

100

102

104

106

108

frequency

divisor0divisor1divisor2divisor3divisor4

division in loop i

i = [0, 1, 2, 3, 4]

q0[n] = x[n] b0[n-1] (1st loop)

qi[n] = qi-1[n] bi[n-1] (others)

lowpass in loop i bi[n] = C1i*qi[n] + C2i*bi[n-1]

scaling unit s[n] = (q5[n] - C3) * C4

completing

lowpass

o[n] = C5*s[n] + C6*o[n-1]


5/7

totype show that, using the available fix point operators, a

number format of 4 integer (int part) and 15 fraction bits

(frac part) is sufficient and all constants Cx(i) have to be

quantized in 19 fraction bits.

When dividing or multiplying these fix point numbersthe internal wordlengths must be greater to hold all possible

digits: in case of the divider 34 bits (eq. 1) and the multipli-

er 38 bit (eq. 2). The dividend has to be prescaled (shifted)

because the integer part of the quotient can grow by the

fraction bits of the divisor (complementary to multipliers).

The product wordlength is the sum of the wordlength of

the operands a and b. Operand b (filter constants) only have

a fraction part (fract part b). In addition a 20 bit fix point

adder and subtractor are necessary. The most expensive op-

erator is the 34 bit divider with an unacceptable huge area

demand and it seems to be near the limits for handling by

the design tools.

A floating point number format has been introduced for

the adaptation loops to reduce the area requirements and

long signal propagation delays through the operator combi-

national nets (Table 2.).

The speech quality measure testbench shows that thesmall floating point divider with 6 significant bits and 6 bit

exponent in the unsigned operands is sufficient (Fig. 6) and

has a impressively reduced area demand (see Table 4.).

Table 2. Properties of the small floatingpoint number format.

Furthermore, this number format matches the require-

ments of speech processing systems much better than a fix

point system with an equidistant resolution, since its loga-

rithmical range partitioning has the best resolution at the

lower end (near zero) of the representable dynamic rangewhere speech signals are concentrated. For the same rea-

son, i.e. the probability density distribution of speech sig-

nals, the A- and -law characteristics in the AD and DAconverters with companding are efficient standards for tele-

communication systems. A similar approach is introduced

in [8] for a neural net implementation for speech recogni-

tion purposes, where the net weights could be successfully

quantized in a floating point format of only 1 sign bit, 1 bit

mantissa and 3 bit exponent.

Prototype and VHDL implementation. Since design

tool libraries do not support scalable floating point datatypes and -operators respectively, an own prototype was

developed. Similar as proposed in [9] floating point opera-

tors has been designed which incorporate fix point sub units

provided by the synthesis tools.

But a test and simulation environment which can evalu-

ate signal distortions with a meaningful coverage process-

ing large data streams (ETSI-test data [5]) is not possible on

logic VHDL simulation level. Therefore, a C++ class was

designed whose operators work identically like the desired

hardware version and allow extensive tests of different

wordlengths.

Multiplication (eq. 3) and division by (eq. 4) use fix

point library elements for multiplication/division of the sig-nificants and addition/subtraction of the exponents respec-

tively [10].

The small floating point division is enclosed in normal-

ization operations for each operand and the result in order

to get a leading 1 in the MSBs and to reduce complexity in

data handling. Under- or overflow during normalizationforces signal clipping to zero or full scale. The internal

wordlength of the divider is twice the length of the oper-

ands to preserve the precision of the operands. Normaliza-

tion and shrinking to the operand wordlength follow. Adder

and subtractor need exponent aligning before the mantissas

can be summed or subtracted. If the operands are very dif-

ferent, one of them can disappear during aligning. When

subtracting similar large values an additional dirty zero

problem can occur, i.e. calculation errors grow. But in this

case we could observe a general sufficient distance between

subtrahend and minuend.

Divider:

(precisionp=5)

Multiplier, Adder, Subtractor:

(precision p=13)

significand s=6

exponent e=6

significand s=14

exponent e=6

binary excess 100000

largest error =/2 * p =0.03125 (div)

largest error =/2 * p =0.00012207 (mul, add, sub)(machine epsilon)[7]

max binary value (div) 111111.111111

min binary value (div) 100000.000000

binary zero (div) 000000.100000

div wordlength = (int part + frac part), (frac part) (1)

mul wordlength = (int part a), (frac part a+frac part b) (2)

s1 2e1

( ) s2 2e2

( ) s1 s2( ) 2e1 e2+( )

= (3)

s1 2e1

( ) s2 2e2

( ) s1 s2( ) 2e1 e2( )

= (4)


6/7

The use of pure behavioral code synthesizable by

Synopsys Behavioral Compiler presumes some more work.

Shortly described, the Behavioral Compiler analyzes data

dependencies and the required operator usage, schedules

the design, and builds a controller. The type of the automat-ically created finite state machine for the controller may be

specified. A binary encoding is used in this case. All oper-

ators are implemented as combinational nets for easy tim-

ing and scheduling and are handled as dedicated multi-

cycle (-delayed) blocks if necessary. Overloading the oper-

ators (+, -, *, /) allows inferring in VHDL and a straight for-ward coding of the algorithm. In addition, a RAM module

of the target library was manually created and is handled by

wrappers in behavioral code in order to have indexed cell

access to the lowpass values via an array data type.

Except for the RAM block, the design is coded com-

pletely independent of a target library, because no specificcores of the FPGA technology are instanced. Thus there is

no need for code modifications when the target library

changes.

4. Synthesis and simulation results

A prototype of the core design of chip 1 (input interface,

gammatone filterbank, halfway rectification, lowpass filter,

and output interface) was implemented on a Xilinx

XC4062XL-2 device. A complete mapped FPGA-cell

netlist is transferred to the Xilinx place&route tools. When

the temporary values are stored on an external RAM 2186logic cells are allocated. The FPGA utilization is about 40%

(Table 3.). The timing constraints according to the sam-

pling rate of the whole system are met even though the

RAM access limits the clock to 32 MHz.

Table 3. Allocated resources of a XilinxXC4062XL-2 device for the chip1 design.

After compilation and mapping the chip2-design to the

FPGA look-up-table cell level (not mapped to FPGA-

gates), an EDIF netlist is transferred to the vendor specific

place&route tool. Here, the design is mapped to physical

cells and connected. Table 4. presents the allocated hard-

ware resources and timing analysis results when targeting

an Altera Flex10K100A-1 device.

Table 4. Allocated resources of an AlteraFlex10k100A-1 device for the chip2 design.

The state vector of the controller has eight bits storing

142 states. Timing analysis shows that the most critical path

is a part of this controller, reducing the maximum clock fre-

quency. Since 50 MHz could not be reached for a common

clock, one of the two FPGA clock networks drives the ker-

nel with 24 MHz while the other is used for the interface

parts. Because very few I/O pins are used by the design pin

locking causes no routing problems.

Simulation in the testbench was performed extensively

on prototype level (C++) with large sample data streams.The enormous simulation times on VHDL logic level allow

only single value or short data stream evaluation.

The following results for versions of the chip2-arith-

metic could be calculated (Fig. 6) using the perception

model as a testbench. Diagram (a) shows that the model

works correctly and the objective speech quality measure is

well correlated with the subjective MOS (indicated by the

linear correlation coefficient r). Nearly no losses can be

found in diagram (c) due to fixed point quantization errors

when the resolution is 4 integer and 30 fraction bits. In (d)

enormous losses in the data correlation appear after reduc-

ing the wordlength to 4 integer and 24 fraction bits. Thesmall floating point implementation works well with an op-

erand width of 6 bits mantissa for division, 14 bits mantissa

for all other, and 6 bits exponent for all operations.

Real time experiments become possible with the com-

pletion of the demonstrator board and, after installing it on

the DSP card, a powerful signal processing system with a

reconfigurable coprocessor is available.

interfaces 273 logic cells

= 5 % LC usage

kernel 1913 logic cells

= 35 % LC usage

memory external RAM

max clock frequency(external RAM access)

32 MHz

interfaces 195 logic cells

= 5 % LC usage

kernel, scaling unit and low-pass

2983 logic cells= 59% LC usage

memory

(in Flex10K EAB blocks)

3600 bits

= 14 % EAB usage

max clock frequency (kernel)

(timing constraints violation)

24 MHz

small float divider

(6 bit mantissa, 6 bit exponent

operand width)

94 logic cells,

205 ns delay

fix point divider

(34 bit)

1186 logic cells,

1527 ns delay


7/7

5. Conclusion

In this paper we present our work on the digital VLSI-

implementation of a speech perception model. The hard-

ware design of the algorithm was derived from a recoded

version of the model in C/C++ using special classes for fix

point and small floating point quantization. An application

of the model (speech quality measurement) is used to deter-

mine optimized wordlengths in a dedicated hardware. The

development of the perception model as a FPGA/ASIC for

a target system, e.g. a PC-card, provides efficient co-pro-

cessing power and allows real time implementations of

complex auditory-based speech processing algorithms.

References

[1] Dau, T., Pschel, D. and Kohlrausch, A.: A quantitative

model of the effective signal processing in the auditory

system I. Journal of the Acoustical Society of America

(JASA) 99 (6): 3631-3633, 1996.

[2] Tchorz, T., Wesselkamp, M. and Kollmeier, B.: Gehrge-

rechte Merkmalsextraktion zur robusten Spracherkennung

in Strgeruschen. Fortschritte der AkustikDAGA 96:

532-533, DEGA, Oldenburg, Germany, 1996.

[3] Hansen M. and Kollmeier B.: Using a quantitative psycho-

acoustical signal representation for objective speech quality

measurement. In: Proc. ICASSP-97, Intl. Conf. on Acous-

tics, Speech and Signal Proc.: 1387, Munich, Germany,

1997.

[4] Hansen, M.: Assessment and prediction of speechtransmis-

sion quality with an auditory processing model, Disserta-

tion, Oldenburg, Germany, 1998.

[5] ETSI, TM/TM5/TCH-HS.: Selection Test Phase II: Listen-

ing test results with German speech samples. Technical Re-

port 92/35, FI/DBP-Telekom. Experiment 1, IM4, 1992.

[6] Vary, P., Heute, U., Hess, W.:Digitale Sprachsignalverar-

beitung. Teubner, Stuttgart, Germany, 1998.

[7] Goldberg, D.: What every Computer Scientist Should

Know About Floating-Point Arithmetic, Computing Sur-veys, March 1991.

[8] Wst, H., Kasper, K., Reininger, H.: Hybrid Number Rep-

resentation for the FPGA-Realization of a Versatile Neuro-

Processor. Proc. EUROMICRO98, 694-701, Vsteras,

Sweden, 1998.

[9] Shirazi, N., Walters, A., Athanas, P.: Quantitative Analysis

of Floating Point Arithmetic on FPGA Based Custom Com-

puting Machines. Technical Report, Virginia Polytechnic

Institute and State University, Blacksburg, Virginia, 1995.

[10] Hennessy, J. L., Patterson, D. A.: Computer Architecture -

A Quantitative Approach. Morgan Kaufmann Publishers,

Inc., San Francisco, California, 1996.

1

1.5

2

2.5

3

3.5

4

4.5

0.75 0.8 0.85 0.9 0.95 1

"pmx6_6_div_sparc.rpt"

1

1.5

2

2.5

3

3.5

4

4.5

0.75 0.8 0.85 0.9 0.95 1

subjectiveMOS

objective measure q

1

1.5

2

2.5

3

3.5

4

4.5

0.75 0.8 0.85 0.9 0.95 1

subjectiveMO

S

objective measure q

1

1.5

2

2.5

3

3.5

4

4.5

0.75 0.8 0.85 0.9 0.95 1

subjectiveMOS

objective measure q

IEEE float single prec.(a)r=0.935

r=0.63

4 int bits / 24 frac bits(d)r=0.927

4 int bits / 30 frac bits(c)

(b) Add, Sub, Mul: 14(M) 6(E)r=0.928 Div: 6(M) 6(E)

Figure 6. Results for a complete objective speech quality measurement with theETSI half-rate selection test data [4][5].

signal processing in fpga.pdf

Documents