signal processing in fpga.pdf
TRANSCRIPT
-
7/27/2019 signal processing in fpga.pdf
1/7
Implementing a Quantitative Model for the Effective
Signal Processing in the Auditory Systemon a Dedicated Digital VLSI Hardware
A. Schwarz, B. Mertsching M. Brucke, W. Nebel J. Tschorz, B. Kollmeier
University of Hamburg University of Oldenburg University of Oldenburg
Computer Science Department, Computer Science Department, Physics Science Department,
IMA Group VLSI Group Medical Physics Group
D-22527 Hamburg, Germany D-26111 Oldenburg, Germany D-26111 Oldenburg, [email protected] [email protected] [email protected]
1. Introduction
The binaural perception model introduced in [1] de-
scribes the effective signal processing in the human audi-
tory system and provides an appropriate internal
representation of acoustic signals. Its capabilities were suc-
cessfully demonstrated as a preprocessing algorithm for
speech recognition [2], objective speech quality measure-
ment [3] and digital hearing aids. The algorithm processes
stereo signals and includes a gammatone filter bank (30
bandpass filters equidistant distributed on the ERB scale
from 73 to 6700 Hz) to model spectral properties of the hu-
man ear like spectral masking and frequency-dependent
bandwidth of auditory filters.
max
t1
t2
t3
t4
t5
1
kHz
8
Hz
stereo
gammatone
filterbank
halfwave
re
adaptation
loops
inre
lowpass
filtering
absolute
threshold
lowpass
filtering
filteradaptationloops
gammatonefilterbank
envelope absolute lowpassextraction
input
threshold
stereo1 kHz
t1 t2 t3 t5t4
8 Hz
Figure 1. Processing scheme of the binaural perception model introduced in [1].
Abstract
A digital VLSI implementation of an algorithm model-
ing the effective signal processing of the human auditory
system is presented. The model consists of several stages
psychoacoustically and physiologically motivated by the
signal processing in the human ear and was successfully
applied to various speech processing applications. The pro-
cessing scheme was partitioned for implementation in a set
of three chips. Due to local properties of the signal dynamic
and the necessary arithmetical precision different ap-
proaches for number representation and appropriate arith-
metic operators were investigated and implemented. It is
demonstrated how an application of the model has been
used to determine the necessary wordlengths for a transfer
of the algorithm into a version suitable for hardware imple-
mentation. Fix point arithmetic is used in the linear parts of
the origin algorithm and a special small floating point op-
erator set was developed for the nonlinear part. This part
was coded in behavioral VHDL and synthesized with
Synopsys Behavioral Compiler. The hardware algorithm is
being evaluated on different implementation levels for a
FPGA and will be manufactured as ASICs in a later ver-
sion. The presented FPGA chip set will be combined with a
commercial DSP system (TMS320C6201) for real time and
reconfigurable signal processing.
-
7/27/2019 signal processing in fpga.pdf
2/7
A stage modeling inner hair cell behavior (envelope ex-
traction) is followed by five adaptation loops (with time
constants between 5 and 500 ms) to consider dynamical ef-
fects as nonlinear adaptive compression and temporal
masking (see Fig. 1). The demonstrated VLSI design con-tains additional components to determine differences in
phase and magnitude of each channel (Fig. 2).
2. Hardware design specifications
Due to the complexity the design was partitioned into
three chips (Fig. 2). Besides the serial data interfaces chip 1
contains the 30-channel binaural filter bank, the envelope
extraction, and a module computing phase differences and
magnitude quotients in each stereo output of the channels.
A single bandpass filter is multiplexed through all 30 chan-
nels both for the left and right stereo signal. Including a six-stage pipelined multiplier and one adder/subtractor this ker-
nel is realized by a quad cascade of a single stage complex
IIR filter. This saves chip area but requires a 50 MHz sys-
tem clock to operate with a 16276 Hz sampling frequency.
RAM units save temporary data and filter constants are
read from a ROM (see Fig. 2).For further processing three system outputs are avail-
able. A high speed interface (30 MBit/s) provides real and
imaginary parts of the right and left stereo data for all filter
bank channels (chip 1). The adaptive compressed data of
the left (1st chip 2) and the right stereo signal (2nd chip 2)
and the phase and amplitude information of the filter bank
outputs are combined within a second data stream (12
MBit/s). The 4-wire serial interfaces of the chip set (16 bit
data words) support a direct interface to the serial ports of
most DSP-devices. In a constellation with a DSP-device
able to serve the fast serial ports (TI TMS320C6201) a sys-
tem solution for auditory signal processing is provided.
3
3
Highspeed Output Interface
Panic
Sync
Input Interface
RectificationHalfwave
IIR Lowpass1kHz 1st Order
QuotientMagnitude-
Controller
TempReg Reg
Mux Mult
Mux Op1Reg
Op2
GammatoneFilter Bank
Phase-Difference
Input Interface
Output Interface
LogicInput
Output
Logic
Scale & Lowpass
Serial/Parallel Converter
Parallel/Serial Converter
SerialDataSync
DecimationLowpass &
Sub
Add
SerialDataIn
INPUT FROM DSP/CODECReset
50 MHz Clock
SerialDataOut
30 Mbit/s
LowspeedOutputInterface
12 Mbit/s
12 Mbit/s
Sync
3
OUTPUT TO DSP COMBINED OUTPUT
3
3SerialDataOut
Sync
Panic
ASIC 1 / 1st ASIC 2 (left)
12 Mbit/s30 Mbit/s
Valid
ROMConstants
RAMState Mem
1st Order
Lowpass
Core Input
AdaptationLoops
Panic
Core OutputInitBusyValid
Valid
Divider
Controller
("Rolled")Multiplexed
Five
ASIC 1 / 2nd ASIC 2 (right)
ROM RAM
50 MHz ClockReset24 MHz Clock
Figure 2. Structure and wiring scheme of the internal components of the chip set.
-
7/27/2019 signal processing in fpga.pdf
3/7
Each of the five adaptation loops contains a divider
whose quotient is fed back by a 1st order IIR lowpass pro-
viding the divisor. This feedback, the necessary precision
and signal dynamic requires large fix point wordlengths or
a logarithmic number format. The dividers are very area ex-pensive and therefore the most critical components in the
design. A fourfold subsampling and data serialization in
chip 1 allow a multiplexed loop kernel monaurally imple-
mented in two chips (two of chip 2). The loop kernel con-
tains RAM cells storing the states of all lowpass filters for
the 30 serial processed frequency channels.
3. Floating point to fix point to floating point -
arithmetic suitable for auditory signal
processing
A direct implementation of an IEEE 32 bit single preci-
sion floating point arithmetic of the model is not possible
due to limitations of area and timing. To gain an optimal
implementation different methods are applied to the linear
filter bank and the nonlinear adaptation loops respectively.
The main problem when converting number formats and
dedicated arithmetic is the determination of the required
numerical precision. Because the necessary quantization
depends on applications and typical signal dynamic the per-
ception model was recoded in C++ using new classes of
scalable data types and necessary operators. This class
takes the internal wordlength as a parameter and saves the
values exactly in the same format as they would be saved ina register on an ASIC. Thus numerical effects of imprecise
arithmetic can be simulated in target applications.
The kernel arithmetic of gammatone filterbank was de-
signed and successfully evaluated in a fix point notation.
After evaluating a scalable fix point version of the nonlin-
ear adaptation loops and recognizing the high area con-
sumption for especially the dividers a small floating point
class was successfully tested.
3.1. Arithmetic transformation for linear gamma-
tone filters
Principle. The necessary internal wordlength for the
gammatone filter bank can be assessed in a straight-forward
way, because the filters are linear time invariant systems
where classical numerical parameters like SNR can be ap-
plied. It is sufficient to record the filter responses for -pulses for each filter parameterized with different internal
wordlengths. Figure 3 shows the mean square error (rela-
tive error, i.e. noise-to-signal ratio) between one of these
implementations and the original specification with IEEE
single precision floating point arithmetic. The choice of a
certain maximal square error (e.g. 10-3 for all channels)
leads directly to the necessary internal wordlength. Allow-
ing an error of 0.001 a minimal wordlength of 24 bits is nec-
essary for the lowest filter bank channel (Fig. 3).
Figure 3. Error introduced by fix point quan-tization in the gammatone filter bank.
Numerical operations. The filter algorithm consists of
a fourfold first-order filter which contains only add and
multiply by constants operations.
Number formats. Due to the increased analysis band-
width the error for a given wordlength decreases with in-
creasing center frequency and channel number
respectively. All channels use the same operator structure,
thus a general number format of 24 bits fix point is re-quired.
3.2. Arithmetic transformation for nonlinear
adaptation loops
Principle. The determination of an optimal quantiza-
tion in the adaptation loops is much more difficult because
they show a strong nonlinear behavior.
It was demonstrated in [3] that the perception model can
supply an objective speech quality measure q. Speech sig-
nals distorted by low-bit-rate codecs used in mobile tele-
phone devices are compared to their undistorted versionand a quality measure q is given, which is correlated with a
subjective Mean Opinion Score (MOS) of the test signals.
Because this testbench is very sensitive to limited number
precision and signal dynamic in the perception model, it
can be used to evaluate modifications caused by limited
quantization and arithmetic (Fig. 4). An optimized quanti-
zation of the nonlinear adaptation loops (small wordlengths
i.e. small chip area vs. reliable signal processing) was found
by empirical wordlength variation. The results were veri-
fied processing two different large speech signal sets vary-
ing the input signal levels from -10 to 50 dB.
1e-08
1e-07
1e-06
1e-05
1e-04
1e-03
1e-02
1e-01
1e+00
5 10 15 20 25 30
meansquareerror
number of filter-bank channel
30 bit
28 bit
26 bit
24 bit
22 bit
20 bit
18 bit
16 bit
-
7/27/2019 signal processing in fpga.pdf
4/7
Data analysis. Histograms were recorded at internal
nodes to investigate signal levels during the processing of
typical speech (ETSI-test data [4][5]) and noise input sig-nals (Fig. 5).
Figure 5. Histograms of output and divisorin the adaptation loops for typical speechsignals.
The divisors of the loops have an individual threshold,
and their lower bounds are introduced to reduce unwanted
peaks. The dynamic range is obviously limited. Only posi-
tive values occur in the loops, divisors never exceed 1.0,
and the loop outputs are concentrated near zero. This is to
be expected since small amplitudes are very frequently intypical speech signals according to their probability density
distribution [6].
Numerical operations. The original C-code contains in
the loops and the following scaling and lowpass unit all ba-
sic arithmetic operators (Table 1.). The current quotients
qi[n] in the loops are calculated from local lowpass filter
outputs bi[n-1] of the last cycle. The current lowpass output
is derived from its last output bi[n-1] and the new quotient
qi[n]. The output of the last loop q5[n] is shifted and scaled
to s[n] in the scaling unit and after last lowpass filter the re-
sult o[n] is given to the output interface. All Cx(i)
are con-
stants.
Table 1. Operations in the adaptation loops,
i is the loop number and n represents sam-ple numbers.
An useful simplification for the hardware specification
is the fact that all values remain in the positive range up to
last output of the last loop. Indeed, the scaling unit intro-
duces a sign bit which propagates to the output.
Number formats. Considering the necessary precision
of the kernel arithmetic and available arithmetic cores in
the synthesis tool libraries (Synopsys DesignWare), two
approaches are possible. Simulations with the integer pro-
Perception
Model
Perception
Codec
original
signal
distorted
signal
frequency
weighting
correlation
cross- comparation/
correlation
subjective
MOS-data
q
weighting
frequency
Model
IEEE 32 bit floating, fixed or
small floating point arithmetic
Figure 4. Speech quality measurement used as a testbench for changes in kernel arith-metic of the adaptation loops in the perception model.
0.00 10.00 20.00 30.00
value
100
102
104
106
108
1010
frequency
loop0loop1loop2loop3loop4
0.00 0.20 0.40 0.60 0.80 1.00
value
100
102
104
106
108
frequency
divisor0divisor1divisor2divisor3divisor4
division in loop i
i = [0, 1, 2, 3, 4]
q0[n] = x[n] b0[n-1] (1st loop)
qi[n] = qi-1[n] bi[n-1] (others)
lowpass in loop i bi[n] = C1i*qi[n] + C2i*bi[n-1]
scaling unit s[n] = (q5[n] - C3) * C4
completing
lowpass
o[n] = C5*s[n] + C6*o[n-1]
-
7/27/2019 signal processing in fpga.pdf
5/7
totype show that, using the available fix point operators, a
number format of 4 integer (int part) and 15 fraction bits
(frac part) is sufficient and all constants Cx(i) have to be
quantized in 19 fraction bits.
When dividing or multiplying these fix point numbersthe internal wordlengths must be greater to hold all possible
digits: in case of the divider 34 bits (eq. 1) and the multipli-
er 38 bit (eq. 2). The dividend has to be prescaled (shifted)
because the integer part of the quotient can grow by the
fraction bits of the divisor (complementary to multipliers).
The product wordlength is the sum of the wordlength of
the operands a and b. Operand b (filter constants) only have
a fraction part (fract part b). In addition a 20 bit fix point
adder and subtractor are necessary. The most expensive op-
erator is the 34 bit divider with an unacceptable huge area
demand and it seems to be near the limits for handling by
the design tools.
A floating point number format has been introduced for
the adaptation loops to reduce the area requirements and
long signal propagation delays through the operator combi-
national nets (Table 2.).
The speech quality measure testbench shows that thesmall floating point divider with 6 significant bits and 6 bit
exponent in the unsigned operands is sufficient (Fig. 6) and
has a impressively reduced area demand (see Table 4.).
Table 2. Properties of the small floatingpoint number format.
Furthermore, this number format matches the require-
ments of speech processing systems much better than a fix
point system with an equidistant resolution, since its loga-
rithmical range partitioning has the best resolution at the
lower end (near zero) of the representable dynamic rangewhere speech signals are concentrated. For the same rea-
son, i.e. the probability density distribution of speech sig-
nals, the A- and -law characteristics in the AD and DAconverters with companding are efficient standards for tele-
communication systems. A similar approach is introduced
in [8] for a neural net implementation for speech recogni-
tion purposes, where the net weights could be successfully
quantized in a floating point format of only 1 sign bit, 1 bit
mantissa and 3 bit exponent.
Prototype and VHDL implementation. Since design
tool libraries do not support scalable floating point datatypes and -operators respectively, an own prototype was
developed. Similar as proposed in [9] floating point opera-
tors has been designed which incorporate fix point sub units
provided by the synthesis tools.
But a test and simulation environment which can evalu-
ate signal distortions with a meaningful coverage process-
ing large data streams (ETSI-test data [5]) is not possible on
logic VHDL simulation level. Therefore, a C++ class was
designed whose operators work identically like the desired
hardware version and allow extensive tests of different
wordlengths.
Multiplication (eq. 3) and division by (eq. 4) use fix
point library elements for multiplication/division of the sig-nificants and addition/subtraction of the exponents respec-
tively [10].
The small floating point division is enclosed in normal-
ization operations for each operand and the result in order
to get a leading 1 in the MSBs and to reduce complexity in
data handling. Under- or overflow during normalizationforces signal clipping to zero or full scale. The internal
wordlength of the divider is twice the length of the oper-
ands to preserve the precision of the operands. Normaliza-
tion and shrinking to the operand wordlength follow. Adder
and subtractor need exponent aligning before the mantissas
can be summed or subtracted. If the operands are very dif-
ferent, one of them can disappear during aligning. When
subtracting similar large values an additional dirty zero
problem can occur, i.e. calculation errors grow. But in this
case we could observe a general sufficient distance between
subtrahend and minuend.
Divider:
(precisionp=5)
Multiplier, Adder, Subtractor:
(precision p=13)
significand s=6
exponent e=6
significand s=14
exponent e=6
binary excess 100000
largest error =/2 * p =0.03125 (div)
largest error =/2 * p =0.00012207 (mul, add, sub)(machine epsilon)[7]
max binary value (div) 111111.111111
min binary value (div) 100000.000000
binary zero (div) 000000.100000
div wordlength = (int part + frac part), (frac part) (1)
mul wordlength = (int part a), (frac part a+frac part b) (2)
s1 2e1
( ) s2 2e2
( ) s1 s2( ) 2e1 e2+( )
= (3)
s1 2e1
( ) s2 2e2
( ) s1 s2( ) 2e1 e2( )
= (4)
-
7/27/2019 signal processing in fpga.pdf
6/7
The use of pure behavioral code synthesizable by
Synopsys Behavioral Compiler presumes some more work.
Shortly described, the Behavioral Compiler analyzes data
dependencies and the required operator usage, schedules
the design, and builds a controller. The type of the automat-ically created finite state machine for the controller may be
specified. A binary encoding is used in this case. All oper-
ators are implemented as combinational nets for easy tim-
ing and scheduling and are handled as dedicated multi-
cycle (-delayed) blocks if necessary. Overloading the oper-
ators (+, -, *, /) allows inferring in VHDL and a straight for-ward coding of the algorithm. In addition, a RAM module
of the target library was manually created and is handled by
wrappers in behavioral code in order to have indexed cell
access to the lowpass values via an array data type.
Except for the RAM block, the design is coded com-
pletely independent of a target library, because no specificcores of the FPGA technology are instanced. Thus there is
no need for code modifications when the target library
changes.
4. Synthesis and simulation results
A prototype of the core design of chip 1 (input interface,
gammatone filterbank, halfway rectification, lowpass filter,
and output interface) was implemented on a Xilinx
XC4062XL-2 device. A complete mapped FPGA-cell
netlist is transferred to the Xilinx place&route tools. When
the temporary values are stored on an external RAM 2186logic cells are allocated. The FPGA utilization is about 40%
(Table 3.). The timing constraints according to the sam-
pling rate of the whole system are met even though the
RAM access limits the clock to 32 MHz.
Table 3. Allocated resources of a XilinxXC4062XL-2 device for the chip1 design.
After compilation and mapping the chip2-design to the
FPGA look-up-table cell level (not mapped to FPGA-
gates), an EDIF netlist is transferred to the vendor specific
place&route tool. Here, the design is mapped to physical
cells and connected. Table 4. presents the allocated hard-
ware resources and timing analysis results when targeting
an Altera Flex10K100A-1 device.
Table 4. Allocated resources of an AlteraFlex10k100A-1 device for the chip2 design.
The state vector of the controller has eight bits storing
142 states. Timing analysis shows that the most critical path
is a part of this controller, reducing the maximum clock fre-
quency. Since 50 MHz could not be reached for a common
clock, one of the two FPGA clock networks drives the ker-
nel with 24 MHz while the other is used for the interface
parts. Because very few I/O pins are used by the design pin
locking causes no routing problems.
Simulation in the testbench was performed extensively
on prototype level (C++) with large sample data streams.The enormous simulation times on VHDL logic level allow
only single value or short data stream evaluation.
The following results for versions of the chip2-arith-
metic could be calculated (Fig. 6) using the perception
model as a testbench. Diagram (a) shows that the model
works correctly and the objective speech quality measure is
well correlated with the subjective MOS (indicated by the
linear correlation coefficient r). Nearly no losses can be
found in diagram (c) due to fixed point quantization errors
when the resolution is 4 integer and 30 fraction bits. In (d)
enormous losses in the data correlation appear after reduc-
ing the wordlength to 4 integer and 24 fraction bits. Thesmall floating point implementation works well with an op-
erand width of 6 bits mantissa for division, 14 bits mantissa
for all other, and 6 bits exponent for all operations.
Real time experiments become possible with the com-
pletion of the demonstrator board and, after installing it on
the DSP card, a powerful signal processing system with a
reconfigurable coprocessor is available.
interfaces 273 logic cells
= 5 % LC usage
kernel 1913 logic cells
= 35 % LC usage
memory external RAM
max clock frequency(external RAM access)
32 MHz
interfaces 195 logic cells
= 5 % LC usage
kernel, scaling unit and low-pass
2983 logic cells= 59% LC usage
memory
(in Flex10K EAB blocks)
3600 bits
= 14 % EAB usage
max clock frequency (kernel)
(timing constraints violation)
24 MHz
small float divider
(6 bit mantissa, 6 bit exponent
operand width)
94 logic cells,
205 ns delay
fix point divider
(34 bit)
1186 logic cells,
1527 ns delay
-
7/27/2019 signal processing in fpga.pdf
7/7
5. Conclusion
In this paper we present our work on the digital VLSI-
implementation of a speech perception model. The hard-
ware design of the algorithm was derived from a recoded
version of the model in C/C++ using special classes for fix
point and small floating point quantization. An application
of the model (speech quality measurement) is used to deter-
mine optimized wordlengths in a dedicated hardware. The
development of the perception model as a FPGA/ASIC for
a target system, e.g. a PC-card, provides efficient co-pro-
cessing power and allows real time implementations of
complex auditory-based speech processing algorithms.
References
[1] Dau, T., Pschel, D. and Kohlrausch, A.: A quantitative
model of the effective signal processing in the auditory
system I. Journal of the Acoustical Society of America
(JASA) 99 (6): 3631-3633, 1996.
[2] Tchorz, T., Wesselkamp, M. and Kollmeier, B.: Gehrge-
rechte Merkmalsextraktion zur robusten Spracherkennung
in Strgeruschen. Fortschritte der AkustikDAGA 96:
532-533, DEGA, Oldenburg, Germany, 1996.
[3] Hansen M. and Kollmeier B.: Using a quantitative psycho-
acoustical signal representation for objective speech quality
measurement. In: Proc. ICASSP-97, Intl. Conf. on Acous-
tics, Speech and Signal Proc.: 1387, Munich, Germany,
1997.
[4] Hansen, M.: Assessment and prediction of speechtransmis-
sion quality with an auditory processing model, Disserta-
tion, Oldenburg, Germany, 1998.
[5] ETSI, TM/TM5/TCH-HS.: Selection Test Phase II: Listen-
ing test results with German speech samples. Technical Re-
port 92/35, FI/DBP-Telekom. Experiment 1, IM4, 1992.
[6] Vary, P., Heute, U., Hess, W.:Digitale Sprachsignalverar-
beitung. Teubner, Stuttgart, Germany, 1998.
[7] Goldberg, D.: What every Computer Scientist Should
Know About Floating-Point Arithmetic, Computing Sur-veys, March 1991.
[8] Wst, H., Kasper, K., Reininger, H.: Hybrid Number Rep-
resentation for the FPGA-Realization of a Versatile Neuro-
Processor. Proc. EUROMICRO98, 694-701, Vsteras,
Sweden, 1998.
[9] Shirazi, N., Walters, A., Athanas, P.: Quantitative Analysis
of Floating Point Arithmetic on FPGA Based Custom Com-
puting Machines. Technical Report, Virginia Polytechnic
Institute and State University, Blacksburg, Virginia, 1995.
[10] Hennessy, J. L., Patterson, D. A.: Computer Architecture -
A Quantitative Approach. Morgan Kaufmann Publishers,
Inc., San Francisco, California, 1996.
1
1.5
2
2.5
3
3.5
4
4.5
0.75 0.8 0.85 0.9 0.95 1
"pmx6_6_div_sparc.rpt"
1
1.5
2
2.5
3
3.5
4
4.5
0.75 0.8 0.85 0.9 0.95 1
subjectiveMOS
objective measure q
1
1.5
2
2.5
3
3.5
4
4.5
0.75 0.8 0.85 0.9 0.95 1
subjectiveMO
S
objective measure q
1
1.5
2
2.5
3
3.5
4
4.5
0.75 0.8 0.85 0.9 0.95 1
subjectiveMOS
objective measure q
IEEE float single prec.(a)r=0.935
r=0.63
4 int bits / 24 frac bits(d)r=0.927
4 int bits / 30 frac bits(c)
(b) Add, Sub, Mul: 14(M) 6(E)r=0.928 Div: 6(M) 6(E)
Figure 6. Results for a complete objective speech quality measurement with theETSI half-rate selection test data [4][5].