TEL AVIV UNIVERSITY
The Iby and Aladar Fleischman Faculty of Engineering
The Zandman-Slaner School of Graduate Studies
IMPLEMENTATION OF THE COCHLEA
MODEL IN VLSI
A thesis submitted toward the degree of
Master of Science in Electrical and Electronic Engineering
by
Udi Shtalrid
May 2005
TEL AVIV UNIVERSITY
The Iby and Aladar Fleischman Faculty of Engineering
The Zandman-Slaner School of Graduate Studies
IMPLEMENTATION OF THE COCHLEA
MODEL IN VLSI
A thesis submitted toward the degree of
Master of Science in Electrical and Electronic Engineering
by
Udi Shtalrid
This research was carried out in the Department of Electrical Engineering - Systems
under the supervision of Prof. Miriam Furst Yust
May 2005
ii
Table of Contents
Table of Contents iii
List of Tables v
List of Figures vii
Abstract ix
Acknowledgements xi
Introduction 1
1 The Ear: Anatomy and Model 5
1.1 Structure of the Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The Development of Models . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Related Hardware Model Implementations . . . . . . . . . . . . . . . 12
1.4 Motivation of the Present Study . . . . . . . . . . . . . . . . . . . . . 14
2 The Model Description 15
2.1 Cochlear Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Existing Software Solution . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 The model’s equations . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 The software algorithm solution . . . . . . . . . . . . . . . . . 23
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 The Hardware Model 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Determination of the time step size and spatial resolution . . . . . . . 31
3.3 The Hardware Model Description . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Eunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iii
3.3.2 Gunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Punit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.4 Dunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.5 MEunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Delta Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Evaluation of the Hardware Algorithm 57
4.1 Punit Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Results for Different Configurations . . . . . . . . . . . . . . . . . . . 62
4.3 Determining the Variables’s Presentation . . . . . . . . . . . . . . . . 67
4.4 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.1 Fast Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.2 High Speed Multiplication . . . . . . . . . . . . . . . . . . . . 78
4.5 Power Consumption Analysis . . . . . . . . . . . . . . . . . . . . . . 83
5 FPGA Design and Simulation 87
5.1 The Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6 Discussion 95
A List of Symbols and parameters 100
B Mathematical Methods 102
B.1 The Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . 102
B.2 Initial condition problem numerical solution . . . . . . . . . . . . . . 105
B.2.1 Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2.2 Modified Euler Method . . . . . . . . . . . . . . . . . . . . . . 106
C Booth Recoding 108
D Delay Calculation 110
E FPGA Instruction Code 111
Bibliography 113
iv
List of Tables
3.1 Comparison of Eunit work-load and latency between software and
hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 List of original model Parameters . . . . . . . . . . . . . . . . . . . . 37
3.3 Comparison of Gunit work-load and latency between software and
hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Comparison of Punit work-load and latency between software and
hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Comparison of Dunit work-load and latency between software and
hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Comparison of MEunit work-load and latency between software and
hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Total work-load in the software model . . . . . . . . . . . . . . . . . 50
3.8 Total work-load in hardware model . . . . . . . . . . . . . . . . . . . 50
3.9 Total work-load in hardware model for 5× 3 configuration . . . . . . 51
3.10 The critical path in hardware model . . . . . . . . . . . . . . . . . . . 52
4.1 Synthetic input signals . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 The Hebrew words input signals . . . . . . . . . . . . . . . . . . . . . 60
4.3 The lower and upper bounds of the variables for the hardware model 71
4.4 Fix point representation for the delta model . . . . . . . . . . . . . . 73
4.5 Number of operation for critical path . . . . . . . . . . . . . . . . . . 76
4.6 Tadd and fadd for different hardware model configurations . . . . . . . 81
v
4.7 Power consumption for different hardware model configurations . . . 85
5.1 The contents of the FPGA Register Banks . . . . . . . . . . . . . . . 89
5.2 Number of instructions per unit in the FPGA design . . . . . . . . . 90
5.3 Relative error of C vs. VHDL implementations . . . . . . . . . . . . . 91
5.4 Amount of logic needed for an adder and multiplier in FPGA . . . . 93
A.1 List of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
E.1 The Instruction code for FPGA Controller . . . . . . . . . . . . . . . 112
vi
List of Figures
1 The number of people suffering from hearing loss . . . . . . . . . . . 2
2 Hearing aid market penetration . . . . . . . . . . . . . . . . . . . . . 3
1.1 Human ear: The outer,middle and inner ear . . . . . . . . . . . . . . 6
1.2 A Lateral view of a chinchilla cochlea . . . . . . . . . . . . . . . . . . 7
1.3 Stylized mammalian cochlea . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Radial segment of the cochlea duct . . . . . . . . . . . . . . . . . . . 9
1.5 A scheme of the Organ of Corti . . . . . . . . . . . . . . . . . . . . . 10
2.1 Cochlear model geometry . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 An equivalent electrical circuit model of the outer-hair cell . . . . . . 19
2.3 Software design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Software convergence unit . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 MSE of ”tz” as a function of the time step size . . . . . . . . . . . . . 31
3.2 MSE of ”tz” as a function of the spatial resolution . . . . . . . . . . . 32
3.3 Hardware flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 The punit as a bottle-neck . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 The parallel punit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Jacobi matrix convergence . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Hardware flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 A Pipeline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 An output example of the hardware model . . . . . . . . . . . . . . . 58
vii
4.2 Relative error for different time iterations . . . . . . . . . . . . . . . . 61
4.3 Relative error for different p iterations . . . . . . . . . . . . . . . . . 62
4.4 Relative error for different combinations of time and p iterations . . . 63
4.5 Relative error for different combinations of time and p iterations for
synthetic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Relative error for different configurations when noise is applied . . . . 65
4.7 The influence of the time step . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Variables’s representation . . . . . . . . . . . . . . . . . . . . . . . . 68
4.9 Histogram of the basilar membrane velocity . . . . . . . . . . . . . . 69
4.10 Histograms of the basilar membrane acceleration . . . . . . . . . . . . 70
4.11 Relative error for different quantization of the model’s variables . . . 72
4.12 The system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.13 The processor architecture . . . . . . . . . . . . . . . . . . . . . . . . 75
4.14 Asic design uncoiled . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.15 Power consumption vs. relative error . . . . . . . . . . . . . . . . . . 85
5.1 FPGA design diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 A waver of the FPGA design . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 The energy of the output signal . . . . . . . . . . . . . . . . . . . . . 92
B.1 Euler and Modified Euler approximation method . . . . . . . . . . . 107
viii
Abstract
A one-dimensional cochlear model with embedded outer hair cells (OHC) was recently
developed by Cohen and Furst [4]. The cochlear’s model output is used to reconstruct
speech signals and improve the signal-to-noise ratio (Weisz and Furst [37]). The
reconstructed speech signals can be used in various applications such as hearing aids,
cellular communication and voice recognition where we seek an improvement in the
signal-to-noise ratio. It is the purpose of this study to test whether the cochlear model
can be used as a significant preprocessor for different speech analysis system.
The cochlear model software solution is unsuitable for a real-time application. It
requires massive computations, which are power consuming, and a long computational
latency. Its main equation is solved serially, being the bottleneck of the algorithm.
Hardware implementation is attractive when power-efficiency and real-time per-
formance are a design consideration. It may have orders of magnitude of improvement
in performance.
The cochlear model algorithm was investigated and modified to fit parallel and
pipeline architectures for hardware design, in order to reduce the computational la-
tency and the amount of computations. The serial solution was converted to an
iterative parallel solution that made real-time performance feasible. We have found
a basic parallel configuration obtaining a relative error of less than 1% compared to
the original algorithm on a set of tested stimuli. All those configurations have shown
that the cochlear model can be implemented in real-time, with a clock frequency
ranging between 10 to 250 MHz, and with reasonable energy consumption ranging
between 0.06 to 1.13 Watts, according to a specific configuration. Other hardware
ix
x
simplifications were tested, such as the determination of the necessary wordlengths
for a transfer of the floating point version of the model into a reduced floating point
version and fix point representation. The modified algorithm was evaluated and val-
idated against the original one using synthetic input signals and a set of recorded
Hebrew words.
The hardware model algorithm was written in VHDL and implemented on a FPGA
simulator. An actual architecture was planned for an ASIC design.
Acknowledgements
I would like to thank my fellows at the Auditory Signal Processing Laboratory: Azi
Cohen, Oren Bahat, Vered Weisz, Tomer Goshen, Nir Fink, and especially Ronen
Akerman, for his help in this research. I would like to thank my friend Maor Shitrit
for his help in programming.
I am grateful to my family for their support and love.
Finally, I would like to thank my advisor Professor Miriam Furst for providing
guidance and keeping me on track.
The work presented in this thesis was supported by ”RAMOT”, Tel-Aviv Univer-
sity.
Udi Shtalrid
May, 2005
xi
Introduction
Hearing is one of our greatest gifts. The auditory sense in mammals, particulary in
humans, is capable of almost unbelievable feats. The ear does an exquisite job of
transforming acoustic signals, varying enormously in amplitude and waveform, into a
regimented neural code. With hearing, we monitor our environment in all directions,
communicate with one another, and listen to the music of instruments and babbling
children. The loss of this treasure can bring severe behavioral deficits as well untold
personal agony. A poignant example is provided in a moving letter from the 31 year
old Beethoven to his brothers, in which he attempts to describe his misery due to a
progressive hearing loss. Not only was he unable to appreciate performance of his
music, he found it difficult to communicate with other people and became a virtual
recluse [12].
Nobody knows the exact number of hearing-impaired people. Professor Adrian
Davis of the British MRC Institute of Hearing Research estimates that there were
440 million hearing-impaired people world-wide in 1995 [35]. He also predicts that
the total number of people suffering from hearing loss of more than 25 dB will exceed
560 million by 2005. In developing countries where people are more exposed to noisy
environments the numbers are twice as large. The estimation of the number of people
who will suffer from hearing loss of more than 25 dB is shown in Figure 1.
1
2
Figure 1: A forecast of the number of people suffering from hearing loss of more than25 dB made by Adrian Davis.
Current hearing-aids do not perform well in noisy background. They are very
helpful for severe and profound hearing impairment but not for most people who
suffer from mild-to-moderate hearing loss. People with mild-to-moderate hearing loss
suffer mainly from misunderstanding of speech in a noisy background.
A hearing-aid industry market tracking survey from 1984-2000 conducted by Sergei
Kochkin [35] shown in Figure 2, indicates that the hearing-aid market penetration
has been low. Only one out of five, who need hearing-aid device, will purchase it.
About 10 to 20 percent of the people, who hold a hearing-aid, abound them.
To understand how the ear can go wrong we must start by explaining the phys-
iology of hearing, and how it operates, when it is functioning normally. During the
last century, many experiments and studies pioneered by G. Von Bekesy in 1928,
which earned him a Nobel Prize, contributed to the construction of a computational
cochlear models of the auditory system. In recent years, a significant progress has
been made in understanding the contribution of the mammalian cochlear outer hair
3
Figure 2: Hearing aid market penetration survey conducted by Sergei Kochkin
cells (OHC) to the normal auditory signal processing. The outer hair cells which
act as local amplifiers were mathematically modeled. A creation of a reliable, good
cochlear model, which mimics the functionality of the inner ear correctly, is of great
importance. If we could mimic the inner ear well, then a hearing-aid instrument can
be developed for hearing-impaired people, where an electronic cochlea substitutes the
damaged human cochlea.
In the work of Cohen and Furst [4] a classical one-dimensional cochlear model has
been modified to include the OHC activity. This model is solved in the time domain
and it does not require any assumptions on the stationary of the input signal. The
model simulates audiograms for normal ears and ears with OHC loss. The model,
while characterizing phonal trauma by a random loss of outer hair cells along the
cochlear partition, succeed in explaining a well known phenomena, in which loss of
sensitivity of 4 kHz was found independently of the type of noise exposure. The out-
put of the cochlea model is a time-frequency matrix which represents the partition
4
velocities along the cochlea basilar membrane. We use the reconstruction algorithm
which was developed by Weisz and Furst [37] to re-synthesize the speech signal from
the cochlear representation. This algorithm includes an estimation of both the travel-
ing wave delays and the amplification factors for each of the cochlear partitions. The
reconstructed signal is obtained in the time domain as a weighted shifted sum of the
cochlear partition responses. The algorithm is based on applying a time-frequency
mask on the cochlear presentation, before reconstructing. The mask acquisition is
based on following the energy modulation across the cochlear partitions.
In this work, we modify the cochlear model algorithm presented in the work
of Cohen and Furst [4] and propose a special hardware algorithm to fit hardware
design. A digital implementation seems potentially suitable. However, because of
its inherent complexity and massive computations, the application of the auditory
model to system poses a significant engineering challenge. For most application,
such as hearing implants and hearing aids, system is constrained to be real-time, low
power, and low cost. Nowadays, it takes about 1000 times the real-time to compute
the cochlear model on a general purpose workstation. Therefore, an approach of
parallel and pipeline architecture is applied.
This thesis is organized as follows: We start by describing the anatomy and phys-
iology of the cochlea in chapter one followed by an explanation of its mathematical
equations and time domain solution of the cochlear model in chapter two. Chapter
three describes the proposed hardware model. The simulation results and analysis are
discussed in chapter four. Finally, chapter five describes a VHDL implementation of
the hardware model on a FPGA simulator and the conclusions are drown in chapter
six.
Chapter 1
The Ear: Anatomy and Model
In this chapter we review the anatomy of the ear. A profound understanding of the
cochlear anatomy enables researchers develop models that mimic the ear. We also
review related hardware model implementations and present the motivation for this
thesis.
1.1 Structure of the Ear
The mammalian ear is composed of three regions, the outer, middle and inner ear
regions as sketched in Figure 1.1. The outer region includes the pinna and external
canal. The pinna functions as a ”collecting horn”. It intercepts sound waves from the
free space and channels them via the external ear canal to the eardrum. The sound
pressure arriving at the eardrum is amplified at all frequencies, becoming greater
than 5.6 (15 dB) over almost a two-octave frequency range (2-6 kHz). The pinna
significantly modifies the incoming sound at medium and high frequencies.
The eardrum forms the boundary between the outer and the middle ear. The
resulting eardrum vibrations are transmitted through an air-filled middle ear by a
three-bone structure (ossicles) to a membrane covered opening in the bony wall of
5
6
Figure 1.1: A sketch of the human ear, displaying the outer, middle and inner earregions.
the spiral-shaped structure of the inner ear called the cochlea. This opening is called
the oval window and it forms the boundary between the middle and the inner ear.
The three ossicles are tiny, they are the smallest bones in the body. The transmission
sound energy through the middle ear, in humans, is most efficient at frequencies
between 0.5 to 4 kHz.
The inner ear, also called the cochlea, consists of a fluid filled duct coiled as a
snail shell or corkscrew. A photomicrograph of a partially dissected chinchilla cochlea
is shown in Figure 1.2. From the acoustic point of view, the curvature of the scalae
is negligible. The propagation of the sound waves in the cochlea is almost exactly as
it would be in a straight cochlea or an ”uncoiled” one. The mammalian cochlea is
illustrated in Figure 1.3 as an uncoiled cochlea, having a longitudinal, vertical and
7
Figure 1.2: Lateral view of a chinchilla cochlea with the bony shell removed. Arrowspoint to remnants of cochlear partition in the various turns. H, helicotrema; M,modiolus; OW, oval window; RW, round window; S, stapes; ST, scala tympani; SV,scala vestibuli [33].
radial dimensions.
The perilymphatic space has the shape of an elongated U, the top arm of which
is called scala vestibuli and the bottom arm which is called scala tympani. The space
between the two arms of the mammalian perilymphatic space is the endolymphatic
space, labeled scala media. An extremely thin Reissner’s membrane separates the
scala media from the scala vestibuli, and the cochlear partition, a flexible structure
that contains the sensory hair cells, separates the scala media and the scala tympani.
At the apical end is the helicotrema, a short duct connecting the two perilymphatic
scalae. Thus, when the stapes pushes the oval window inward, the U-shaped column of
perilymph is free to slide through its casing and push the round window outward. Such
movements result in pressure differences between both sides of the basilar membrane
8
Figure 1.3: Stylized mammalian cochlea, shown as if the cochlear partition werestraight [9].
causing the flexible cochlear partition to vibrate.
The region of the cochlea adjacent to the oval window is called the base and the
region farthest away from the stapes is appropriately named the apex. The basic
structure of the cochlear partition is shown in Figure 1.4. Forming the basic platform
of the cochlear partition is the basilar membrane, which is attached on one side to
the bony spiral lamina and on the farthest side to the spiral ligament. The basilar
membrane is narrower and thicker in the base than it is in the apex. These longitu-
dinal differences in the structure of the basilar membrane are presumed to account in
large part for the different resonant measured at different points along the cochlear
partition.
Resting on the basilar membrane is a small but complicated superstructure, known
as the Organ of Corti, which contains the sound-sensing cells. The tectorial membrane
extends from the lip of the spiral limbus to overlie the apical surface of the Organ of
Corti. An expanded view of the Organ of Corti is shown if Figure 1.5.
9
Figure 1.4: Radial segment of the cochlea duct, showing all three scalae and the basicdivisions of the cochlear partition.
The sound sensing cells are called hair cells because they appear to have tufts of
hairs, called stereocilia, protruding on their top. The hair cells are divided into inner
and outer hair cells. The inner hair cells form a single row running from base to apex
whereas the outer hair cells form up to five rows. In humans, there are about 3,500
inner hair cells, each with about 40 stereocilia and 15,000 outer hair cells, each with
140 stereocilia protruding from it. When the basilar membrane moves up and down,
a shearing motion is created, the tectorial membrane moves to the side relative to the
tips of the hair cells. As a result, the stereocilia of the hair cells move and rotate. The
movement of the stereocilia leads to flow of electrical current through the hair cells,
which leads to the generation of action potentials. These potentials give rise to nerve
spikes in the neurones of the auditory nerve. The inner hair cells act to transduce the
mechanical movement into neural activity whereas the outer hair cells change their
10
Figure 1.5: Typical Organ of Corti. 1, basilar membrane; 5, outer hair cells; 12,tectorial membrane; 15, bony spiral lamina; 20, inner hair cells
length and size due to these potentials. Thus, the outer hair cells effect the physical
properties of the basilar membrane presuming it accounts for better filtering along
the cochlear partition.
1.2 The Development of Models
The first recognized model of the cochlea was published by Helmholtz in 1862 [19]
in an appendix of ”On Sensation Of Tone”. Helmholtz linked the cochlea to a bank
of highly tuned resonators, which were selective for different frequencies, much like
a piano or a harp, with each resonator representing a different place on the basilar
membrane. The model he proposed was not very satisfying since many important
features were left out. The most important of which includes the cochlear fluid which
11
couples the mechanical resonators together. But, given the publication date, it is an
impressive contribution by this early great master of physics and psychophysics.
The next major contribution was made by Wegel and Lane [32], and stands in
a class of its own even today. The paper was the first to quantitatively describe
the details of the upward spread of masking, and propose a ”modern” model of the
cochlea. If Wegel and Lane had been able to solve their model’s equations, they would
have predicted cochlear traveling waves.
It was the experimental observations of the Hungarian researcher G. Von Bekesy,
starting in 1928 on human cadavers’ cochleae, which unveiled the physical nature
of the basilar membrane traveling wave [16]. Von Bekesy, found that the cochlea is
analogous to a ”dispersive” transmission line where different frequency components,
which make up the input signal, travel at different speeds along the basilar membrane,
thereby isolating those various frequency components at different places along the
basilar membrane. He properly named this dispersive wave a ”traveling wave”. He
observed the traveling wave using stroboscopic light in dead human cochlea at sound
levels well above the pain threshold, namely above 140 dB SPL. These high sound
pressure levels were required to obtain displacement levels that were observable under
his microscope. Von Bekesy’s pioneering experiments were considered so important
that in 1961 he received the Nobel prize.
Over the intervening years these experiments have been greatly improved, but
Von Bekesy’s fundamental observations of the traveling wave still stand. Today, we
find that the traveling wave has a more sharply defined location on the basilar mem-
brane for pure tone input than observed by Von Bekesy. In fact, according to mea-
surements made over the last 20 years, the response of the basilar membrane to a pure
12
tone can change in amplitude by more than five orders of magnitude per millimeter
of distance along the basilar membrane.
One of the most common models today are the transmission line models, also
called the one dimensional models. The one dimensional model is built from cascade
sections of inductors, capacitors and resistors, which represent the mass of the fluids
of the cochlea, partition resistance and stiffness, respectively.
Two [29] and Three [13] dimensional models where also introduced. The two
dimensional model argues that the long wave approximation is not fulfilled in the
region of maximum response of the membrane. The three dimensional model takes
into account that the pressure and fluid flow can vary across the width of the cochlea
partition. Both, the two and three dimensional models are more complex and involve
complicated mathematics, thus harder to solve numerically. The one dimensional
model simulations have gained more appreciation because they require less memory
and fewer computations than the two and three dimensional models, and yet are
successful in predicting large number of phenomena.
1.3 Related Hardware Model Implementations
The field of neuromorphic engineering has the long term objective of taking architec-
tures from our understanding of biological systems to develop novel signal processing
systems. There have been several implementations of electronic cochlea in VLSI
technology.
The first electronic cochlea model was implemented in analog VLSI. The elec-
tronic cochlea, first proposed by Lyon and Mead [30] was a cascade of biquadratic
13
filter sections which mimic the qualitative behavior of the human cochlea. The orig-
inal implementation was published in 1988 and used continuous time subthreshold
transconductance circuits to implement the cascade of 480 stages. In 1992, Watts et.
al. reported a 50-stage version with improved dynamic range, stability, matching and
compactness [36]. In addition, a switched capacitor cochlea filter was proposed by
Bor et. al. in 1996 [20]. Although touted for their low power consumption, analog
VLSI subthreshold circuits are fraught with difficulty due to variations in process and
temperature which affect the stability, accuracy and size of the filters.
In spite of these difficulties that plague analog VLSI, the amount of work done so
far in developing digital implementations has been scanty.
Several digital VLSI cochlea implementations were reported. Starting in 1992,
Summerfield and Lyon reported an application-specific integrated circuit (ASIC) im-
plementation which employed bit-serial second order filters [10]. In 1997, Lim et. al.
reported a VHDL-based pitch detection system which used first-order Butterworth
bandpass filters for cochlea filtering [34]. The hardware test of this design has not
been reported. Later in 1998, Brucke et. al. designed a VLSI implementation of a
speech preprocessor which used gammatone filter banks to mimic the cochlea [27].
The design was apparently submitted for fabrication, but test results of the actual
hardware have not been presented. Recently, Leong et. al. [28] presented an FPGA-
based implementation of Lyon and Mead’s electronic cochlea filter and its application
to a real-time cochleagram display. The filter was generated by a tool which takes
filter coefficients to compile an application-optimized design with arbitrary precision.
This implementation along with Brucke et. al. used fixed-point arithmetic and they
also explored tradeoffs between wordlength and precision.
14
All of these implementations are fairly simplistic and in most cases even their
target performance compares poorly with biological data. In this thesis we use the
enhanced cochlear model developed by Cohen and Furst [4], which also integrates the
outer hair cell function. This model is solved in the time domain unlike the solutions
of all the other hardware implementations which were solved in the frequency domain.
In all of the VLSI implementations developed, the output may be a cochleagram or
the gain of a specific frequency channel. Our model on the contrary, not only displays
a cochleagram but also takes a further step and uses a reconstruction algorithm on
the cochleagram to produce a reconstruct output signal.
1.4 Motivation of the Present Study
The newly developed cochlear model algorithm from the work of Cohen and Furst [4]
solves the equations of the cochlea for the input signal and displays a detailed
frequency-time domain representation of the output signal. The reconstruction algo-
rithm, which was developed by Weisz and Furst [37], uses this representation format
to reconstruct the output signal. Both algorithms compose a system which can be
used as a hearing-aid device as it mimics the functionality of the ear. We hope to
achieve better performance compared to other state-of-the-art speech enhancement
devices and hearing-aids available today.
The development of a new hearing-aid system must be planned to work in real-
time. Therefore, the feasibility of implementing the new algorithm as a real-time
application must be investigated. In this work we evaluate the cochlear model algo-
rithm and modify it to be more efficient for real-time and hardware implementation.
Parallel and pipeline architectures are proposed and verified.
Chapter 2
The Model Description
In this chapter the basic mathematics of the algorithm is explained. The model is
a one dimensional cochlear model with embedded outer hair cell model developed
by Cohen and Furst [4]. The model is analyzed for low level stimuli where it can
be treated as a linear model. The solution of the cochlear model equations was
implemented in software. The software algorithm solution is described.
2.1 Cochlear Fluid Dynamics
In the simple one-dimensional model (Zwislocki [21]; Zweig et al [17]; Viergever [26];
Furst and Goldstein [25]), the cochlea is considered as an uncoiled structure with
two fluid-filled rigid-walled compartments separated by an elastic partition. The
basic equations are obtained by applying fundamental physical principles such as
conservation of mass and the dynamics of deformable bodies.
Cohen and Furst [4] integrated the one dimensional cochlear model with the outer
hair cell model. These two models control each other through cochlear partition move-
ment and cochlear partition cross pressure variables. Figure 2.1 illustrates an uncoiled
cochlea approximated by two fluid-filled rigid-walled compartments separated by an
15
16
helicotrema
oval window
round window
basilar membrane
scala tympani
scala vestibuli
x
base apex
Figure 2.1: Cochlear model geometry
elastic partition.
In order to arrive at a mathematically tractable model, simplifying assumptions
are inevitable. An extensive mathematical one dimensional cochlear model can be
found in [26].
Let x be the longitudinal coordinate such that at the basal end x = 0 and at the
apical end x = ` where ` is the uncoiled cochlea length. Let t be the time variable.
Let Pv(x, t) and Pt(x, t) be the pressure through the scala vestibuli and through the
scala tympani, respectively.
The intermediate cannel between the scala vestibuli and the scala tympani is
named the scala media and is represented by the elastic partition. The vertical
displacement of the partition along the x dimension is denoted by ξbm(x, t). The fluid
velocity along the x dimension is Uv(x, t) and Ut(x, t) for the scala vestibuli and the
17
scala tympani, respectively.
The principle of conservation of mass yields the equations:
A∂Uv
∂x− β
∂ξbm
∂t= 0, (2.1.1)
A∂Ut
∂x+ β
∂ξbm
∂t= 0, (2.1.2)
where β(x) is the basilar membrane width and A(x) is the scalae cross section area.
Intuitively, the mass of perilymph compressed by the membrane vertically is pushed
horizontally to the neighboring cross section.
Both scalae tympani and vestibuli contain perilymph, which is assumed to be
incompressible and inviscid fluid. The motion equations for each scala using Newton’s
second law are written as:
∂Pv
∂x+ ρ
∂Uv
∂t= 0, (2.1.3)
∂Pt
∂x+ ρ
∂Ut
∂t= 0, (2.1.4)
where ρ is the perilymph density. The difference in the pressure between neighboring
sections is the force which pulls the mass of the perilymph of the section.
This set of equations is completed by the equation of motion of the cochlear
partition. The partition is, mechanically, a flexible structure embedded in a rigid
framework. It is assumed that the flexible part, the basilar membrane, and the
structure above it has point wise mechanical properties. This means that the velocity
at any point of the partition is related to the pressure difference across the partition
at that point only and not at neighboring points.
The pressure difference across the cochlear partition is defined as:
P = Pt − Pv (2.1.5)
18
The cochlear partition is regarded as a flexible boundary between scala tympani
and scala vestibuli, whose mechanical properties are describable in terms of point-wise
mass density, stiffness and damping. Thus at every point along the cochlear duct, the
partition’s velocity is driven by the pressure difference P across the partition. From
the conservation of mass principle we can derive the relationship between the fluid
velocity and the basilar membrane displacement ξbm.
Combining equations, Eq 2.1.1 -Eq 2.1.5, yields the differential equation for P :
∂2P
∂x2− 2ρβ(x)
A
∂2ξbm
∂t2= 0 (2.1.6)
and the boundary conditions:
P (x, t) = S(t) x = 0
P (x, t) = 0 x = ` (2.1.7)
where S(t) is the pressure difference at the stapes and input stimuli. Since the cochlea
stimuli starts from a rest condition, the initial value conditions ∀x ∈ [0, `] are:
ξbm(x, 0) = 0 (2.1.8)
ξbm
dt(x, 0) = 0
The model includes the pressure produced by the OHCs, Pohc, therefore:
Pbm = P + Pohc (2.1.9)
The third equation imitates the basilar membrane as an electrical transmission line.
Pbm(x, t) = m(x)∂2ξbm
∂2t+ r(x)
∂ξbm
∂t+ s(x)ξbm (2.1.10)
where m(x), r(x) and s(x) represent the basilar membrane mass, resistance, and stiff-
ness per unit area, respectively.
19
The complete model of the cochlea integrates the outer hair cell model. The OHC
membrane is divided into two regions, the apical part facing scala media and the
basolateral part embedded in the organ of corti. The basic outer hair cell model rep-
resents these two cell membrane segments as two parallel resistance and capacitance
circuits. Figure 2.2 represents an equivalent electrical circuit model for the OHC.
ψ0
Gb Cb
- ψ
¡¡µGa ¡
¡µCa
& %
Vsm
Figure 2.2: An equivalent electrical circuit model of the outer-hair cell
Changes in the outer hair cell length are controlled by the voltage change across
the outer hair cell basolateral membrane ψ. Solving the electrical circuit in Figure 2.2
yields a differential equation for ψ [11]:
dψ
dt+ ωohcψ = λ(
dCa
dt+ Ga) + ωohcψ0 (2.1.11)
where Ca and Ga are the capacitance and conductance of the apical part, respectively.
ωohc and λ are defined as:
ωohc = Ga+Gb
Ca+Cb≈ Gb
Cb= const. = 2π · 1000
λ = Vsm
Cb+Ca≈ Vsm
Cb= const.
The capacitance Ca and conductance Ga of the apical part are affected by the stere-
ocilia movement. They undergo changes due to active opening of ion channels in the
apical part of the outer cell. The outer hair cell stereocilia are shallowly but firmly
20
embedded in the under-surface of the tectorial membrane. Since the tectorial mem-
brane is attached on one side to the basilar membrane, a sheer motion arises between
the tectorial membrane and the organ of corti as the basilar membrane moves up and
down (Pickles [22]). The model assumes Ga and Ca are functions of ξbm(the basilar
membrane vertical displacement).
The voltage variation across the basolateral part of the OHC causes a length
change (∆lOHC) in the OHC. Thus, the force FOHC that an OHC exhibits due to
voltage change is derived by
Fohc = Kohc(∆`ohc(ψ) + ξbm) (2.1.12)
The pressure that the OHCs contribute to the basilar membrane pressure is derived
from,
Pohc = γ(x)Fohc (2.1.13)
where γ(x) is the relative density of healthy OHCs per unit area along the cochlear
duct. γ(x) is referred to as the OHC gain, whose value ranges from 0 to 1.
When linear dependencies are assumed, i.e.,
GA ∝ (ξbm),
CA ∝ (ξbm),
∆lOHC ∝ (ψ)
(2.1.14)
and by the substitution of the linear assumptions 2.1.14 in equations 2.1.11 , 2.1.12
and 2.1.13 we derive the differential equation for Pohc, [11], [4]:
dPohc
dt+ ωohcPohc = γ(x)
[α2
dξbm
dt+ α1ξbm
](2.1.15)
where the values of α1(x) and α2(x) are:
α1(x) = − r(x)s(x)m(x)
α2(x) = r(x)ωohc
21
2.2 Existing Software Solution
In this section we summarize the cochlear model equations and describe the existing
software solution in the time domain.
2.2.1 The model’s equations
The cochlea model is described by three equations.
The pressure difference P along the cochlear partition is computed by Eq: 2.1.6,
∂2P
∂x2− 2ρβ(x)
A
∂2ξbm
∂t2= 0
with the boundary conditions as stated in Eq: 2.1.7,
P (x, t) = S(t) x = 0
P (x, t) = 0 x = `
The second equation describes the basilar membrane as an electrical transmission
line. The equation according to Eq: 2.1.10 is:
Pbm(x, t) = m(x)∂2ξbm
∂2t+ r(x)
∂ξbm
∂t+ s(x)ξbm
with the following initial values ∀x ∈ [0, `] :
ξbm(x, 0) = 0∂ξbm
∂t(x, 0) = 0
The third and last equation imitates the outer hair cells behavior. The equation
developed in Eq: 2.1.15:
dPohc
dt+ ωohcPohc = γ(x)
[α2
dξbm
dt+ α1ξbm
]
The contribution of the pressure generated by the outer hair cells was given in
Eq: 2.1.9,
Pbm = P + Pohc
22
When substituting Pbm in the second equation (Eq: 2.1.10) we get:
P + Pohc = m(x)∂2ξbm
∂2t+ r(x)
∂ξbm
∂t+ s(x)ξbm (2.2.1)
The velocity of the basilar membrane is defined as:
vbm =∂ξbm
∂t(2.2.2)
so we can rewrite eq: 2.2.1 and get the expression for the membrane acceleration:
v′bm =1
m[P + Pohc − rvbm − sξbm] (2.2.3)
Substituting the acceleration expression Eq: 2.2.3 in the model’s first equation,
∂2P
∂x2− 2ρβ(x)
A
∂2ξbm
∂t2= 0
yields:
∂2P
∂x2− 2ρβ
mA[P + Pohc − rvbm − sξbm] = 0 (2.2.4)
We define Q(x) as a function of the spatial variable x,
Q(x) =2ρβ
m(x)A(2.2.5)
and,
G(x, t) = Pohc − r(x)vbm − s(x)ξbm (2.2.6)
Substituting Eq: 2.2.5 and Eq: 2.2.6 in Eq: 2.2.4 yields,
∂2P
∂x2−QP = QG (2.2.7)
23
2.2.2 The software algorithm solution
In this subsection we describe the solution of the cochlear model algorithm as im-
plemented in software. We divide the solution into units. Each unit is responsible
for solving a different equation. The description is given in high level in order to
understand the flow of the solution. We analyze the software solution in the next
chapter where we introduce the hardware solution and compare it to the software.
The time domain model equations are solved numerically. Two of the model’s
equations are initial value condition problem and one equation is a boundary value
problem. The solution is performed in two sequential steps [25]. The initial value
condition problem is solved by an iterative method and the boundary value problem is
solved by the finite difference method using a variation of LU-decomposition method.
The existing software solution is divided into units as seen in figure 2.3.
All of the model’s variables are calculated for each time step. The time variable
step size is defined as ht. In the software algorithm ht is not a fixed number. The
spatial variable step size is hx = l/N , where l is the basilar membrane length and N
is the number of sections. Each point along the basilar membrane is marked by xi.
The algorithm solution starts by assuming we know the following variables Pohc, vbm
and ξbm for a particular time t = T for every xi . At t = 0 the variables are initialized
according to the initial conditions to zero. We approximate these variables for the
next time point at t = T + ht using the Euler method (Appendix B.2.1):
ξbm(x, t) = ξbm(x, T ) + ht × vbm(x, T ),
vbm(x, t) = vbm(x, T ) + ht × v′bm(x, T ),
Pohc(x, t) = Pohc(x, T ) + ht × P ′ohc(x, T )
24
Eunit
Gunit
Punit
Dunit
MEunit
Cunit
first iteration ?
converge ?
yes
no
no yes
Input block
Output block
T i t
e r a t
i o n s
Software Design Flow
Figure 2.3: Software design flow
These calculations are done in the Eunit .
The following unit is the Gunit . We calculate G(x, t) since we need it for the
next unit which solves the boundary value problem. The G vector is calculated using
eq: 2.2.6:
G(x, t) = Pohc(x, t)− r(x)vbm(x, t)− s(x)ξbm(x, t)
The next unit solves the boundary differential equation, we call it Punit . We
find an approximation to the pressure difference P for every nodal point xi for the
25
time t = T + ht. The boundary differential equation is described by eq: 2.2.7:
∂2P
∂x2−QP = QG
with the boundary condition:
P (x, t) = S(t) x = 0
P (x, t) = 0 x = `
This differential equation is represented as linear set of equations, AP = B, where,
P =
P0
P1
...
PN−1
PN
B =
S(T )
0...
0
0
+ h2x
0
G1Q1
...
GN−1QN−1
0
and the matrix A is:
A =
1 0
1 −(2 + h2xQ1) 1
. . . . . . . . .
1 −(2 + h2xQN−1) 1
0 1
This linear set of equations is solved by LU-decomposition (see Appendix B.1). The
LU-decomposition method is a good analytical solution for tridiagonal matrix condi-
tion problem. The matrix A is rewritten as A = LU where:
L =
α0
1 α1
. . . . . .
1 αN−1
1 αN
U =
1 γ0
1 γ1
. . . . . .
1 γN−1
1
26
and,
α0 = −(1 + h2x
Q0
2)
αi = −(2 + h2xQi)− γi−1 i = 1, 2, · · · , N − 1
γi = 1αi
i = 0, 1, 2, · · · , N − 1
αN = 1− γN−1
The solution of the desired vector P is done in two steps and recursively (Ap-
pendix B.1). Thus, it takes 2N + 2 serial steps to complete the pressure vector
P computation. It requires 2N + 1 multiplications and 2N additions. This computa-
tional method for the boundary equation is a bottle neck for hardware design since
it is done serially.
The next unit is called Dunit . It calculates the membrane acceleration v′bm and
the derivative of the outer hair cell pressure P ′ohc using eq 2.2.3:
v′bm =1
m[P + Pohc − rvbm − sξbm]
and eq 2.1.15:
P ′ohc = γ(x)
[α2
dξbm
dt+ α1ξbm
]− ωohcPohc
In order to improve the convergence of the initial differential equation the Modified
Euler method is used. It is an iterative method, the number of iterations depends on
the accuracy required (Appendix B.2.2).
ξbm(T + ht) = ξbm(T ) + ht/2 [vbm(T ) + vbm(T + ht)] ,
vbm(T + ht) = vbm(T ) + ht/2 [v′bm(T ) + v′bm(T + ht)] ,
Pohc(T + ht) = Pohc(T ) + ht/2 [P ′ohc(T ) + P ′
ohc(T + ht)]
This procedure is indicated as Munit in Figure 2.3.
The magnitude of the variables including the input, might undergo significant
changes during the computation process. The variables’s magnitude range is about
27
200dB. In order to keep the computation error bounded the approximation of the
membrane’s velocity and acceleration are checked each computational iteration. If
the convergence test represented by the Cunit fails, the algorithm recomputes the
basilar membrane variables again until the variables converge. The convergence test
unit (Cunit) also controls the time step variable ht according to the approximation
error. If the approximation error of the variables is small we may take a larger time
step size ht for the next time point. The decision block diagram is illustrated in
Figure 2.4.
Compare the values of v(x,t) and v’(x,t) with the values received in the previous
iteration. Are they close enough ?
Advance to the next
time step.
Increase step size.
Run another iteration.
Restart this time step
with smaller step size.
Not close at all
Not close enough Very close
Close enough
Figure 2.4: Software convergence unit
2.3 Summary
The cochlea model consists of one boundary value equation and three first-order
initial value equations. The solution of the algorithm is done in two phases. The
boundary value problem is solved analytically with the finite difference method using
the LU method and the initial value equations are solved numerically using Euler and
28
Modified Euler methods. The software solution block diagram is shown in Figure 2.3.
The algorithm uses a fixed resolution along the cochlea, but variable time steps.
The time step size is controlled by the convergence unit (Cunit) which compares the
variables ξbm, ξ′bm values to the values received in the previous iteration and decreases
or increases the time step size if needed. The algorithm continues to the next time
point if the estimated truncation error for all nodal points is less than some threshold.
Chapter 3
The Hardware Model
3.1 Introduction
In the previous chapter we introduced the software algorithm solution for the cochlear
model in the time domain. The simulation of the algorithm yielded a processing time
ratio of about 1 to 1,000, thus it takes about 1,000 seconds to process 1 second of
speech on a pentium4 computer. It was clear that this kind of solution was unsuitable
for real-time application due to long latency.
In order to shorten simulation time for future research applications and for in-
vestigation of feasibility and potential use in hearing aids, a special hardware model
is proposed. A hardware implementation offers a low cost and high speed capability
that would appear to be an attractive approach. As the VLSI technology is getting
smaller and faster we believe the algorithm may work in real-time.
In this chapter we modify and fit the solution of the cochlear model for hardware
design. Each unit is discussed, examined and its modifications are explained.
We concentrate on reducing the algorithm work-load and mainly its large compu-
tational latency which makes it unsuitable for a real-time application. Examination
29
30
of the existing software solution described in the last chapter revealed two major
problems.
The first problem encountered was the variation of the number of computational
iterations in the solution of the algorithm. The need for a constant throughput in
a hardware design is essential and the number of time iterations had to be fixed.
Moreover, we define a constant time step size ht, on the contrary to the software
solution where the number of iterations and the specific time step size were determined
by the convergence unit (Cunit) each iteration. The need for an unvarying time step
size is also essential for the hardware design in order to have a constant throughput.
Finding of the best time step size and of the number of computational iterations
simplify the model solution significantly and makes it more suitable for hardware
design. We find the optimum time step size and the number of time iterations which
converge and have good results.
The second problem encountered was the large latency computational method
of the boundary condition problem as implemented in the software solution. The
boundary condition problem in software is solved by a variation of the LU method,
which is done serially. This serial solution is considered the bottle neck of the al-
gorithm. A new iterative solution is proposed for the hardware solution. The new
solution is solved numerically and had to be integrated into the algorithm. We now
had two iterative numerical solutions integrated one inside the other and had to fix
the number of iterations to be deterministic.
In the proposed hardware algorithm we managed to reduce the work-load and to
reduce the critical-path latency of the algorithm. The modifications of the algorithm
were verified against the original software algorithm.
31
In the following section, we use the software simulations of the cochlear model to
analyze and determine the optimum spatial resolution hx and the optimum constant
time step size ht for the hardware model.
3.2 Determination of the time step size and spatial
resolution
The analysis [7] focused on determining the preferred working points, which are the
optimum time step size ht and the optimum cochlear resolution hx.
In order to estimate the optimum time step size ht, the model was tested for
different sets of step sizes with different input signals. A reference configuration at
the software algorithm was chosen as, ht = 1e − 7 Sec and N = 4096 sections.
Figure 3.1 represents the relative error as a function of maximum time step size for
Figure 3.1: Relative error of the word ”tz” as a function of the time step size
different number of cochlear sections. The results are plotted for the phoneme ”tz”.
A significant change can be seen when ht = 1µSec. Similar results were obtained
32
for other input signals.
We have run similar simulations in order to estimate the best cochlear partition.
Figure 3.2 represents the relative error as a function of the number of cochlear sections
for different time step sizes. The results are for the phoneme ”tz”. Similar results
Figure 3.2: Relative error of the word ”tz” as a function of the spatial resolution
were obtained for other input signals. If we choose a relative error of 10−3 as the
maximum permissable error then:
ht ≤ 1µSec,
N = 512, thus,
hx = l/N = 3.5/512
We have modified the values of ht and hx to be power of 2 in the hardware model to
simplify multiplications with these parameters.
ht = 2−20 ≈ 1µSec,
hx = 2−7 ≈ 3.5/512
33
3.3 The Hardware Model Description
This section describes the units of the cochlear model algorithm as shown in figure 3.3.
We describe the functionality of each unit and discuss the modifications made for the
hardware model.
There are several key points which must be applied when approaching a hardware
design. The algorithm presented in chapter two which is the base for the hardware
design algorithm was not planed for short latency or efficient use.
Our first key point was to convert the parameters to be a number with a base
of 2. This way, every multiplication of a variable with a parameter would turn out
to be a shift operation, which is much easier to implement than multiplication and
almost without delay. This key point also includes the parameter ht and hx which
were changed to numbers with the base of 2. xi represents the ith coordinate on the
cochlear membrane cored where 0 ≤ xi ≤ 3.5 cm. We define
xi = x0 + i · hx
where x0 = 0. We chose hx = 2−7, thus xN = 3.5 for i = 448 and i is defined for
0 ≤ i ≤ 448. We have chosen the number of sections, N , to be 448. ht was changed
to 2−20 which is very close to 1µSec.
The second key point is to design a synchronous design which will have a constant
throughput. The software algorithm uses the convergence unit to decide about the
next time step size, moreover, it decides the number of computational iterations for
each time step. This way, the computation for each time step takes different time.
In the hardware design we have fixed the time step size to be constant and equal to
ht = 2−20. The number of time iterations was also studied and fixed.
34
Design flow
Eunit
Gunit
Punit
Dunit
MEunit
Input block
Output block
P iterations
T i t
e r a t
i o n
s
Figure 3.3: Flow chart of the time domain solution algorithm for the hardware model.
35
The third key point is to design the algorithm with a minimal latency. As massive
computations are needed for each time step, our goal is to design a real-time appli-
cation. In order to shorten the processing time, a parallel architecture is planned.
Moreover, we also plan and investigate a pipeline architecture. Design of parallel and
pipeline architectures certainly shortens the computation latency but as expected,
there are tradeoffs. In the following sections we describe each of the algorithm’s units
in details.
3.3.1 Eunit
The Eunit is the Euler computation unit. It predicts the basilar membrane displace-
ment ξbm, velocity vbm and the outer hair cell pressure Pohc variables for all of the
sections on the basilar membrane cored for the next time step T + ht. It uses the
Euler method as explained in chapter two. The three equations used in the software
design for x0 ≤ x ≤ xN are:
ξbm(x, T + ht) = ξbm(x, T ) + ht × vbm(x, T )
vbm(x, T + ht) = vbm(x, T ) + ht × v′bm(x, T )
Pohc(x, T + ht) = Pohc(x, T ) + ht × P ′ohc(x, T )
(3.3.1)
Fixing ht to be 2−20 in the hardware design replaces the multiplications in the equa-
tions to shift operations for x1 ≤ x ≤ xN−1:
ξbm(x, T + ht) = ξbm(x, T ) + shift(vbm(x, T ),−20)
vbm(x, T + ht) = vbm(x, T ) + shift(v′bm(x, T ),−20)
Pohc(x, T + ht) = Pohc(x, T ) + shift(P ′ohc(x, T ),−20)
(3.3.2)
where x represents the basilar membrane partition. The membrane variables com-
puted above are irrelevant at the boundary points x0 and xN since the membrane’s
36
pressure P is already known at these points. Thus, they are not computed. As ex-
plained before, the hardware design uses 448 sections unlike the software which uses
512 sections.
As seen from equations 3.3.1, the software Eunit requires three multiplications and
three additions for every x coordinate for one time step while the hardware design
only needs three additions. The Eunit is only calculated once for each time step.
Since there is no dependency in the Eunit between the x coordinates, it is possible to
calculate all coordinates at once assuming we do it in parallel having N − 1 Adders.
Hence, the latency of the hardware design could be reduced to one addition. Table 3.1
summarizes the latency and work-load of the Eunit.
Parameter Software HardwareLatency additions 3(N − 1) 1
multiplications 3(N − 1) 0Work additions 3(N − 1) 3(N − 1)
multiplications 3(N − 1) 0shifts 0 3(N − 1)
Table 3.1: Comparison of Eunit work-load and latency between software and hardwareimplementations.
3.3.2 Gunit
The Gunit computes two separate variables, the g vector and the outer hair cell
pressure derivative P ′ohc.
The g variable doesn’t have any physical meaning, it is a mathematical description
of the vector b in the system Ax = b for the boundary condition computation in the
37
next unit. The vector g is defined by the following equation:
g(x, T + ht) = −K(x)× (ξbm(x, T + ht)/c(x) + r(x)vbm(x, T + ht) + γPohc(x, T + ht))
(3.3.3)
where,
g(x0) = input
g(x448) = 0
K(x) = 2ρβm(x)A
−→ 2−6
m(x)
The original parameters in the calculation are defined in table 3.2.
List of model ParametersParameter Value/Definition units Description
` 3.5 cm Cochlear Lengthρ 1 gr/cm3 Density of perilymphβ 0.15 cm Width of the basilar membraneγ 0.5 Outer hair cell gainA 25 cm2 Cross-sectional area of the cochlea scalaem 1.267 · 10−6e1.5x gr/cm2 Basilar membrane mass per unit areac 7.8 · 10−5e1.5x gr/cm2sec2 Basilar membrane stiffness per unit arear 0.25e−0.6x gr/cm2sec Basilar membrane resistance per unit area
Table 3.2: List of original model Parameters
The mass m(x), restrain r(x) and elasticity c(x) vectors are fixed. Those numbers
are computed beforehand and are stored in tables.
By rewriting equation 3.3.3 as,
g(x, T+ht) = (−K(x)/c(x))ξbm(x, T+ht)−K(x)r(x)vbm(x, T+ht)−γK(x)Pohc(x, T+ht)
and computing the equations coefficients beforehand, the computation of the g vari-
able for each section requires three multiplications and two additions. In the best
38
case, when having 3(N − 1) multipliers in parallel we can obtain a latency of one
multiplication and two additions.
The computation of the outer hair cell pressure derivative P ′ohc is given by the
following equation:
P ′ohc(x, T +ht) = Kohc(x)×(vbm(x, T +ht)−w1(x)ξbm(x, T +ht))−w0×Pohc(x, T +ht)
(3.3.4)
where,
Kohc(x) = −r(x)× w0
w0 = 2πFohc = 2π × 1500 = 9424.778 −→ 212
w1(x) = 1m(x)c(x)w0
The value of the parameter w0 was also rounded to a number of base 2 converting
the multiplication of Pohc to shift operation. The computation of the P ′ohc may take
place anywhere between the Gunit and the MEunit as the value of P ′ohc is only needed
either for the Eunit or MEunit.
Since we have already calculated the value of ξbm(x, T + ht)/(m(x)c(x)) for the g
vector calculation, there is no need to recalculate it again, we will simply store and
reuse it. In this case the multiplication of w1 with ξbm turns to a shift operation of
w−10 .
The calculation of the P ′ohc requires one multiplication and two additions per
section. The latency of the P ′ohc is not taken into account as it can be done between
the Gunit and the MEunit.
It is important to mention, that the calculations of all the units except the Eunit,
are repeated for a couple of times, as seen in figure 3.3. The number of time iterations
39
was determined upon simulations and will be discussed later.
The work-load and latency of the Gunit is summarized in table 3.3.
Parameter Software HardwareLatency additions 2(N − 1) 2
multiplications 3(N − 1) 1Work additions 4(N − 1) 4(N − 1)
multiplications 6(N − 1) 4(N − 1)shifts 0 2(N − 1)
Table 3.3: Comparison of Gunit work-load and latency between software and hard-ware implementations.
3.3.3 Punit
The Punit is the unit which solves the boundary condition problem as described in
eq: 2.2.7,
∂2P
∂x2−QP = QG (3.3.5)
with the boundary condition:
P (x, t) = S(t) x = 0
P (x, t) = 0 x = `
The solution of the boundary condition problem is the major calculation in the
algorithm. The method chosen for its solution in software was the LU-decomposition
method described in Appendix B.1. The LU-decomposition method is solved seri-
ally. It requires 2 × (N + 1) sequential steps which make it unsuitable for hardware
implementation. The Punit is a bottle neck as shown in figure 3.4.
The algorithm used in the software solution provides an exact solution. We choose to
consider an iterative method that will be solved in parallel. Iterative methods work
40
Eunit
Euler
Gunit
Gunit
Punit
Dunit
Dunit
MEunit
MEunit input Output
Sections
Figure 3.4: Illustration of the punit as a bottle-neck
by continually refining an initial approximate solution so that it becomes closer and
closer to the correct solution. In some cases, iterative algorithms require substantially
less time and/or fewer processors than do their exact algorithm counterparts.
We have chosen the Jacobi Relaxation method [24] in order to solve the boundary
condition problem. The Jacobi Relaxation method is an iterative method which
enables us to solve the equation in a parallel way. In the following subsection we
explain the Jacobi method.
Jacobi Relaxation
Considers the N × N system of equations A~x = ~b, where we assume that A = (aij)
is invertible (so that ~x has a unique solution), and that the diagonal entries of A are
nonzero. Rewriting the ith equation and solving for xi, we find that:
xi =−1
aii
(∑
j 6=i
aijxj − bi) (3.3.6)
for 0 ≤ i ≤ N , given an approximate solution ~x(t) to the system of equations. One
41
natural way to update the solution would be to reformulate equation 3.3.6 as:
xi(t + 1) =−1
aii
(∑
j 6=i
aijxj(t)− bi) (3.3.7)
Updating the solution for ~x by equation 3.3.7 is known as Jacobi iteration or Jacobi
relaxation, and can produce solutions that are close to optimal in a reasonable number
of iterations provided that the matrix A satisfies certain properties.
Let define D as
D =
a11 0
0 a22 0. . . . . . . . .
0 aNN
and M as
M = D−1(D − A)
If we rewrite equation 3.3.7 in vector form:
~x(t + 1) = −D−1((A−D)~x(t)−~b)
= M~x(t) + D−1~b(3.3.8)
and let
~ε(t) = ~x(t)− ~x (3.3.9)
denote the vector amount by which ~x(t) differs from the exact solution ~x, then sub-
stituting eq. 3.3.8 in eq. 3.3.9 yields,
~ε(t + 1) = ~x(t + 1)− ~x
= M~ε(t)(3.3.10)
Jacobi relaxation converges to the correct solution for ~x provided that M t converges
to zero as t → ∞ where M = D−1(D − A) and D is diagonal matrix containing the
42
diagonal entries of A. (Equivalently, the algorithm converges to the correct solution
provided that all of the eigenvalues of M have magnitude less than one.) Thus,
~ε(t) = M t~ε(0) and ~ε(t) → 0 if M t → 0 as t → ∞. The rate of convergence depends
on how close the eigenvalues of M are to 1 in absolute value.
Applying Jacobi Relaxation in the Punit
The boundary condition problem is represented as a linear system AP = B (chapter
two and Appendix B.1) where,
P =
P (0)
P (1)...
P (N − 1)
P (N)
B =
S(T )
0...
0
0
+ h2x
0
G1Q1
...
GN−1QN−1
0
and the matrix A is:
A =
1 0
1 −(2 + h2xQ1) 1
. . . . . . . . .
1 −(2 + h2xQN−1) 1
0 1
Dividing the system by h2x yields:
1 0
u mi1 u. . . . . . . . .
u miN−1 u
0 1
p(0)
p(1)...
p(N − 1)
p(N)
=
g(0)
g(1)...
g(N − 1)
g(N)
(3.3.11)
43
where the g vector was defined and calculated in the previous unit (Gunit) and u and
mii are defined as:
u = 1/(h2x) = 1/(2−7)2 = 1/(2−14) = 214
mii = −(2/h2x + K(i)) = −(215 + K(i)) i = 1, 2, · · · , N − 1
(3.3.12)
p is the basilar membrane pressure vector we want to obtain. Applying the Jacobi
Relaxation method described in equation 3.3.7 we get:
pn+1(i) = 1mii
× (g(i)− u(pn(i− 1)− pn(i + 1))) i = 1, 2, · · · , N − 1
p(0) = g(0) = input
p(N) = g(N) = 0
(3.3.13)
where the number n represents the iteration number. The first approximation for
the vector p, (p0(i)) will be the last pressure vector, p, which was computed. The
multiplication in u in equation 3.3.13 will be a shift operation since u equals 214.
Now, using Jacobi Relaxation method for solving the boundary equation, we can
implement the Punit in a parallel way, computing the basilar membrane pressure at
coordinate i with a couple of iterations. We compute all of the coordinates i = 0, · · ·Nat the same time in a parallel way making the computational latency minimally and
equal to a Punit computation for one section. The parallel architecture is illustrated
in figure 3.5.
The number of iterations will influence the computational precision of the pressure
vector. Applying more iterations will certainly increase precision. On the other hand,
applying more iterations will increase the latency of the Punit.
The LU-decomposition method introduced for the software solution requires 2N+1
multiplications and 2N additions. Its computational latency is the same since the
method works sequentially. The Jacobi Relaxation method requires one multiplication
and two additions per section. As we have N − 1 coordinates after the elimination
44
Eunit
Euler
Gunit
Gunit
Punit Dunit
Dunit
MEunit
MEunit input Output
Sections
Punit
Figure 3.5: Illustration of the parallel punit
of the first and last coordinates, the number of multiplications is N − 1 and the
number of additions is 2N − 2 for an iteration. If we need about Piter iterations to
reach the solution of the Punit, then the total work of the Punit will be Piter(N − 1)
multiplications and Piter(2N − 2) additions. Although the hardware solution requires
more work than the software solution, its latency is by far shorter. The latency of
each iteration in the Punit is one multiplication and two additions. If we need Piter
iterations to compute the Punit, the total latency is Piter multiplications and 2×Piter
additions. The work-load and latency of the Punit is summarized in table 3.4.
Parameter Software HardwareLatency additions 2N Piter × 2
multiplications 2N + 1 Piter × 1Work additions 2N Piter(2(N − 1))
multiplications 2N + 1 Piter(N − 1)shifts 0 Piter(2(N − 1))
Table 3.4: Comparison of Punit work-load and latency between software and hardwareimplementations.
45
Punit convergence
As described in Jacobi Relaxation subsection, it converges to the correct solution
for ~x provided that M t converges to zero as t → ∞. Figure 3.6 demonstrates the
convergence of M t to zero as t increases when we applied the specified condition for
matrix A according to equation 3.3.11.
0 5 10 15 2014
15
16
17
18
19
20
21
22
23
24Matrix energy
iterations
ener
gy [d
B]
Figure 3.6: Jacobi matrix convergence
3.3.4 Dunit
The Dunit calculates the membrane acceleration. This unit follows the Punit as seen
in figure 3.3. The acceleration is calculated from the initial value problem,
v′bm = 1m
[P + Pohc − rvbm − sξbm]
v′bm = 1m
[P + g/K](3.3.14)
46
where P is the pressure vector taken from the Punit. All other variables and parame-
ters have been calculated before. Equation 3.3.14 is used for the software solution. It
requires 2(N − 1) multiplications and N − 1 additions for computing the acceleration
vector v′bm for all the coordinates (we exclude the first and last coordinates). The
latency according to the software solution is one addition and two multiplications for
one section, which is very significant.
Using the equations developed in the Punit, we will evaluate a new expression for
the vector g/K. Representing Eq. 3.3.13 and substituting Eq. 3.3.12 yields:
g(i) = 1/h2x × (p(i− 1)− 2p(i) + p(i + 1))−K(i)p(i) (3.3.15)
since we already know that
K(i) =2−6
m(i)
Dividing vector g by the vector K and substituting K(i) yields:
g(i)/K(i) = 26m(i)/h2x × (p(i− 1)− 2p(i) + p(i + 1))− p(i)) (3.3.16)
Substituting g/K from equation 3.3.16 in the Dunit equation 3.3.14 reveals:
v′bm =1
m(i)
[p− 26m(i)/h2
x × (p(i− 1)− 2p(i) + p(i + 1))− p]
(3.3.17)
Substituting 1/h2x = 214 yields:
v′bm = 220 [p(i− 1)− 2p(i) + p(i + 1)] (3.3.18)
The new expression for the basilar membrane acceleration is much more suitable
for the hardware design. Using equation 3.3.18 for the hardware model requires only
two additions per section, thus 2(N−1) additions. The Dunit latency will apparently
be two additions which is better from the software solution. We summarize the work-
load and latency of the Dunit in table 3.5.
47
Parameter Software HardwareLatency additions 1(N − 1) 2
multiplications 2(N − 1) 0Work additions 1(N − 1) 2(N − 1)
multiplications 2(N − 1) 0shifts 0 2(N − 1)
Table 3.5: Comparison of Dunit work-load and latency between software and hard-ware implementations.
3.3.5 MEunit
The MEunit is the last unit in the algorithm sequence. It computes the displacement,
velocity and outer hair cell pressure along the basilar membrane using the Modified
Euler method. The Modified Euler equations are:
ξbm(T + ht) = ξbm(T ) + ht/2 [vbm(T ) + vbm(T + ht)] ,
vbm(T + ht) = vbm(T ) + ht/2 [v′bm(T ) + v′bm(T + ht)] ,
Pohc(T + ht) = Pohc(T ) + ht/2 [P ′ohc(T ) + P ′
ohc(T + ht)]
(3.3.19)
In the hardware model, ht/2 = 2−20/2 = 2−21 and its multiplication turns to a shift
operation. The MEunit may compute all of the three variables in parallel. Only six
additions are necessary for one coordinate thus it requires 6(N−1) additions totaly for
this unit. The latency will be composed of two additions only assuming we compute
all sections in parallel. The latency could be even less if the computations would
start earlier, since the only variable which limits this computation is the acceleration
v′bm(T + ht) which is computed in the previous unit (Dunit). Therefore, the latency
could be reduced to even one addition. We summarize the work-load and latency of
the MEunit in table 3.6.
48
Parameter Software HardwareLatency additions 2(N − 1) 2
multiplications 1(N − 1) 0Work additions 6(N − 1) 6(N − 1)
multiplications 3(N − 1) 0shifts 0 3(N − 1)
Table 3.6: Comparison of MEunit work-load and latency between software and hard-ware implementations.
3.4 Computational Analysis
To this point, a basic hardware model for the cochlea was introduced. A new method
for solving the boundary condition equation was proposed in order to fit a parallel
architecture. We have also converted most of the parameters to be power of 2 and
set the number of Titer and Piter to be constant. The flow diagram of the algorithm
as described is illustrated in figure 3.7.
We analyze the hardware model in two categories. The first category is the total
work load of the algorithm as implemented in hardware versus the software or original
implementation. The second category analyzed is the critical-path latency which
indicates the minimum computational time needed.
Starting with the first category, the work-load of an algorithm effects the power
consumption, silicon area, complexity and processing or computational time. Clearly,
we would like to reduce as much as we can the work-load of the algorithm. The hard-
ware model consists from two iterative methods. As seen in figure 3.7, the Punit
is solved in Piter iterations and the whole algorithm flow is computed using Titer it-
erations. There is a tradeoff between the number of computational iterations and
49
Design flow
Eunit
Gunit
Punit
Dunit
MEunit
Input block
Output block
P iterations
T i t
e r a t
i o n
s
Figure 3.7: Block diagram of the hardware model.
output precision. Fewer number of iterations can reduce the work-load but will harm
performance. The performance analysis (chapter four) have shown that the basic
architecture for the hardware model is given when the number of iterations are con-
figured to Titer = 3 and Piter = 5 for each time step of ht = 1µs. The Euler unit
(Eunit) is computed only once at the beginning of every time point and the other
units are repeated Titer times for each time point. The Punit is solved by the Jacobi
method with Piter iterations each time.
In table 3.7 and table 3.8 we compare the work-load of the software and hardware
models. The number of additions, multiplications and shifts are displayed per section
and then multiplied by the parameter N (number of sections) and by Titer and/or
Piter iterations which represent the number of times the unit is calculated for each
time point.
50
The Work-Load in the software modelPer Section Total
Unit Additions Mult. Shifts Additions Mult. ShiftsEunit 3 3 0 3(N − 1) 3(N − 1) 0Gunit 4 6 0 6(4(N − 1)) 6(6(N − 1)) 0Punit 2 2 0 6(2N) 6(2N + 1) 0Dunit 1 2 0 6(N − 1) 6(2(N − 1)) 0
MEunit 6 3 0 5(6(N − 1)) 5(3(N − 1)) 0Total 16 16 0 75N − 63 78N − 60 0
Table 3.7: An analysis of the work-load of the software model.
The Eunit in the following tables is not multiplied by Titer because it is only
computed once for each time point. In addition, the MEunit number of Titer is
reduced by one as the first iteration includes the Eunit computation. It is impossible
The Work-Load in the hardware model for Piter × Titer architecturePer Section Total
Unit Add. Mult. Shifts Additions Mult. ShiftsEunit 3 0 3 3(N − 1) 0 3(N − 1)Gunit 4 4 2 T (4(N − 1)) T (4(N − 1)) T (2(N − 1))Punit 2× P 1× P 2× P T (2P (N − 1)) T (P (N − 1)) T (2P (N − 1))Dunit 2 0 2 T (2(N − 1)) 0 T (2(N − 1)MEunit 6 0 3 (T − 1)(6(N − 1)) 0 (T − 1)(3(N − 1))Total 15 + 2P 4 + P 10 + 2P
The Work-Load in the hardware model for Piter × Titer architectureOperation ValueAdditions (2PiterTiter + 12Titer − 3)(N − 1)Multiplications (TiterPiter + 4Titer)(N − 1)Shifts (2PiterTiter + 7Titer)(N − 1)
Table 3.8: An analysis of the work-load of the hardware model.
to know the exact number of the total work-load for the software model since the
number of time iterations is not deterministic. Hence, we have averaged the total
51
number of time iterations for the software algorithm and it is set to 6.
The hardware model work-load is calculated for a Piter × Titer architecture, which
means, Titer time iterations and Piter Jacobi iterations when solving the Punit. In the
basic architecture discussed in chapter four we configure Piter and Titer to be 5 and
3 respectively. The hardware model work-load is calculated for this architecture in
table 3.9.
The Work-Load in the hardware model for 5× 3 architecturePer Section Total
Unit Add. Mult. Shifts Additions Mult. ShiftsEunit 3 0 3 3(N − 1) 0 3(N − 1)Gunit 4 4 2 3(4(N − 1)) 3(4(N − 1)) 3(2(N − 1))Punit 2× 5 1× 5 2× 5 3(10(N − 1)) 3(5(N − 1)) 3(10(N − 1))Dunit 2 0 2 3(2(N − 1))) 0 3(2(N − 1))
MEunit 6 0 3 2(6(N − 1)) 0 2(3(N − 1))Total 25 9 20 63(N − 1) 27(N − 1) 51(N − 1)
Table 3.9: An analysis of the work-load of the hardware model for 5×3 configuration.
We can see from the comparison of the basic architecture hardware model and the
software model that the hardware model requires less operations for a computation
of 1µSec speech. A significant change is seen in the number of multiplications. The
number of multiplications for 1µSec of speech for software is about 78N and for
hardware it is about 27N. The multiplications ”cost” more than the other operations
and their reduction was important. Most of the multiplications were converted to
shift operations.
The second category in the computational analysis deals with the critical-path.
The critical-path represents the maximum path of the hardware model which indicates
52
the minimum latency of the computational processing.
The number of operations in the critical path must be done within the time frame
of ht (1µSec) when real-time application is required. The software solution run on
a single processor where it executes only few operations at a time. Today, advanced
processors have different instruction queues for different operations such as additions
and multiplications for floating-point and integer numbers [18]. Although each queue
has its own ALU, the dependencies between the instructions are kept. We can assume
that the software work is done serially since we have one general purpose processor
making the critical path the same as the total work load.
In the hardware model we assume a computational block (ALU) for each section
enabling a parallel processing. The critical path is displayed in Table 3.10. We do
not consider here a pipeline architecture improvement. A pipeline architecture can
reduce the critical path by a factor of Titer. The time-iteration processing can be
implemented with a linear array topology. We discuss this issue later.
The Critical-Path in the hardware model for Piter × Titer architectureLatency per time iteration Total latency
Unit additions multiplications additions multiplicationsEunit 1 0 1 0Gunit 2 1 T × 2 T × 1Punit 2× P 1× P T × 2× P 3× PDunit 2 0 T × 2 0MEunit 1 0 (T − 1)× 1 0Total 2P + 6 P + 1 2TP + 5T TP + T
Table 3.10: An analysis of the critical path of the hardware model for Piter × Titer
architecture.
If we implement the ”5x3” architecture where Piter = 5 and Titer = 3 for ht = 1µSec
53
we must execute 18 multiplications and 45 additions in 1µSec. We assume that shift
operations are done with no delay. In the next chapter we discuss the timing analysis.
3.5 Pipeline Architecture
Since the cochlea model algorithm is solved numerically, in an iterative method,
the implementation of a pipeline architecture might fit this problem. In the basic
hardware model proposed, we chose a 5X3 architecture, which represents 5 iterations
in the Punit and 3 time iterations. In the following pipeline architecture, we divided
the three time-iterations into three stages as shown if figure 3.8.
output
E u
n i t
M E
u n i t
D u
n i t
P u
n i t
G u n
i t
5 iterations
Every 0.1 micro second
M E
u n i
t
D u
n i t
P u
n i t
G u n
i t
M E
u n i t
D u
n i t
P u
n i t
G u n
i t
Input
Figure 3.8: A Pipeline architecture for the hardware model with the 5X3 combination.
In the pipeline architecture we start to compute each stage using the best esti-
mation of the variables from the previous stage. The inputs of the variables to the
Eunit are taken after the MEunit of the first stage, thus only after one iteration. The
second stage for the same time point will start with the outputs of the first stage and
54
the updated variables from the third stage. Each time point actually passes three
stages which are the three time-iterations, but the stages do not start with the most
updated variables as would expected in the basic hardware model. The Punit first
approximation for the basilar membrane pressure vector would be taken again from
the previous stage.
If for a real-time application the whole sequence needs to be computed in 1µSec
since it is the core data rate, now only one stage out of the three needs to be computed
in that time. So, the pipeline architecture reduces the amount of work for the specified
time by a factor of Titer. This fact enables a lower clock frequency which reduces
complexity.
The advantages of the pipeline architecture are in the shortening of the computa-
tional latency and the usage of the units at the same time. The major disadvantage
of this architecture is the usage of more silicon area since we need to implement all
of the units and we cannot compress them.
The pipeline architecture was implemented in the C code and verified. It showed
good results which are presented in the next chapter.
3.6 Delta Architecture
The basic software and hardware model numerical solutions solve the algorithm as a
two dimensional discrete net. As the algorithm proceed, it holds the absolute value
of the variables. It was interesting to check if the conversion of the algorithm to work
with the variables’s differences might improve the performance of the algorithm for the
hardware model. By the implementation of this delta architecture we might also save
computational operations. The following equations for the units were implemented.
55
The d < variable > represents the delta or difference of a variable. The differential
equations for the Eunit are:
d ξ(T + ht) = 2−20 × v(T ),
d v(T + ht) = 2−20 × v′(T ),
d Pohc(T + ht) = 2−20 × P ′ohc(T )
(3.6.1)
Gunit:
d g = −K[d ξ/c + rd v + 0.5d Pohc](3.6.2)
Punit:
d p(i) = K[d g − u(d p(i− 1) + d p(i + 1))(3.6.3)
Dunit:
d P ′ohc = Kohc[d v − w1dξ]− w0d Pohc,
d v′ = 220[d p(i− 1)− 2d p(i) + d p(i + 1)] (3.6.4)
MEunit:
d ξ = 2−21[v(T )− v(T + ht)] = 2−21[2v(T ) + d v(T + ht)],
d v = 2−21[v′(T )− v′(T + ht)] = 2−21[2v′(T ) + d v′(T + ht)],
d Pohc = 2−21[P ′ohc(T )− P ′
ohc(T + ht)] = 2−21[2P ′ohc(T ) + d P ′
ohc(T + ht)](3.6.5)
Only at the end of the time step after the Titerth iteration the following equations
need to be computed (not in a pipeline architecture).
v(T + ht) = v(T ) + d v(T + ht),
v′(T + ht) = v′(T ) + d v′(T + ht),
P ′ohc(T + ht) = P ′
ohc(T ) + d P ′ohc(T + ht)
(3.6.6)
56
Compared to the basic hardware model, we save 3 additions per section in the
Eunit and save 3 additions per section in the MEunit each time-iteration. But we
need to add 3 additions per section as stated in equation 3.6.6 at the end of a time
step. So, the delta architecture does not reduce the algorithm latency and work-load
dramatically.
We have implemented the delta architecture in the C code and verified its perfor-
mance. Histograms for the variables and differential variable were plotted and a fixed
point representation for the variables was defined. The results of the simulations and
fixed-point representation are discussed in the next chapter.
3.7 Summary
In this chapter we introduced the hardware model for the cochlea. The software model
was written for simulation and validation of the cochlea model when the research be-
gun. It was not written with the notion of hardware implementation. Therefore, it
is not suitable for hardware design. The major problem of the software algorithm
which is the Punit sequential solution method is solved in the hardware model. The
number of iterations for both numerical methods used in the algorithm were fixed and
determined as we expect constant throughput. The parameters had to be converted
to numbers with the power of 2. In addition to the basic hardware algorithm we in-
troduced two more modifications to the model. A pipeline and a differential variables
architectures. All options were simulated and evaluated. We discuss the results in
the next chapter.
Chapter 4
Evaluation of the HardwareAlgorithm
In the previous chapter we have introduced the basic hardware model for the cochlea.
We have also suggested several architectural modifications such as pipelining and
differential representation. In this chapter we present the simulations results and
analyze it. We also evaluate the representation of the variables for the hardware
design.
The hardware model solution was implemented in C++ and run on a PC. It was
then implemented on a FPGA simulator in order to verify and validate its perfor-
mance.
The output of the simulated model is represented by a two-dimensional matrix.
The elements of the matrix represent the cochlear partition velocities (vbm) for every
time point. The matrix rows represent the longitudinal axis of the cochlear partition
and the matrix columns represent the time axis. The time axis resolution is sampled
at a rate of 44.1 kHz. The longitudinal resolution equals to the number of sections
which is 448. This representation is conceptually similar to a spectrogram of the input
57
58
signal. An example showing the outputs of the cochlear model for three different input
signals is illustrated in figure 4.1. The three input signals are (1) a chirp from 500Hz
to 8KHz (2) a combination of several frequencies and (3) the word ”Mitz”. (Defined
in table 4.1).
0 200 400 600 800 1000−1
−0.5
0
0.5
1
amp.
chirp
0 200 400 600 800 1000−1
−0.5
0
0.5
1
amp.
sin
0 1 2 3
x 104
−1
−0.5
0
0.5
1
time samples
amp.
mitz
sect
ions
chirp
1000 2000 3000 4000
100
200
300
400 −800
−600
−400
−200
0
sect
ions
sin
1000 2000 3000 4000
100
200
300
400 −800
−600
−400
−200
0
time samples
sect
ions
mitz
4000 8000 12000
100
200
300
400 −800
−600
−400
−200
0
Figure 4.1: An example of the hardware model outputs. The input signals are dis-played on the left side and the output matrixes are displayed on the right side. Uppergraphs: chirp; middle: combination of several frequencies; lower: the word ”Mitz”.
The output matrixes shown in figure 4.1 represent the energy for every section
along the time domain. Each section represents a different frequency. The first
sections represent the higher frequencies and the last sections represent the lower
frequencies. For the chirp output we can see a logarithmic energy line which goes
59
up along the time to lower sections or higher frequencies. For the second output, a
combination of several frequencies, we can see horizontal energy lines along the time
for different sections. The model amplifies the lower sections more than the higher
sections which accounts for the difference in the colors of the energy lines. Finally, for
the word ”Mitz”, we can see that it is composed from two parts which are the ”Mi”
that holds low and high frequencies and the ”Tz” that holds only high frequencies.
In order to verify and validate the hardware model’s various implementations we
compare the output matrix to the output matrix of the original software model. We
use the original software model as our reference. The comparison criteria is the mean
square error between the points of the matrixes divided by the mean energy of the
reference matrix. A relative error in each cell will be (Aij−Refij
Refij)2, but since a large
part of Refij = 0 we chose the relative error to be
Relative Error =1
NT ·Nx
∑i
∑j(Aij −Refij)
2
1NT ·Nx
∑i
∑j(Refij)2
(4.0.1)
where NT · Nx is the number of elements in A. A and Ref represent the matrix of
the hardware model and the software reference matrix, respectively.
In order to evaluate the hardware model we ran the simulation on different input
signals. We used basic synthetic signals described in table 4.1. For a more reliable
representation we used a list of Hebrew words called HAB.
The HAB list are a Hebrew adaption to the AB List. The AB list is a set of
monosyllabic meaningful words, which comprise consonant-vowel-consonant (CVC)
words. The list was designed, so that the different phonemes in English shall be
equally distributed throughout the entire list [3]. The AB list is commonly used in
hearing tests as it reduces the effect of word frequency and/or word familiarity on
test scores. The HAB list was designed for Hebrew natives, and it consists of 15 lists
60
Signal length[Sec] sample rate[KHz] DescriptionSin 0.1 50 Combination of several frequencies, 250Hz,
500Hz, 750Hz, 1, 2, 4, 6, 8KHz.Sinc 0.01 50 Center at 5ms, with BW of 22.5KHz.
Chirp 0.1 50 Linear frequency from 0.5 to 8KHz.Click 0.01 50 A 0.1ms click at 2 ms.Mi 0.1 44.1 First part of the word Mitz.Tz 0.1 44.1 Second part of the word Mitz.
Table 4.1: The Synthetic input signals.
Signal length[Sec] sample rate[KHz]Buz 0.6 44.1Chug 0.5 44.1Dov 0.5 44.1Eich 0.5 44.1Kir 0.45 44.1La 0.4 44.1
Mitz 0.6 44.1Pas 0.5 44.1Shen 0.6 44.1Tof 0.5 44.1
Table 4.2: The Hebrew words input signals.
of 10 monosyllabic words such as ”shen”, ”kir”. Each list consists of ten phonetically
balanced CVC words. The term ”phonetic balance” indicates that speech material
has a phonemic composition equivalent to that of everyday speech. We used one of
the 15 lists shown in table 4.2. The HAB list was recorded by a single female speaker
with a sampling rate of 44.1kHz. This list is commonly used in hearing tests for
clinical evaluation in Israel and particulary in the Communication Disorder group in
”Sheba Medical Center”.
61
4.1 Punit Integration
The replacement of the Punit solution method had to be evaluated for convergence
and appropriate performance. The analytical solution was substituted by a numerical
method called Jacobi Relaxation and applied in parallel. In the software model we
estimated the truncation error of the basilar membrane velocity every time-iteration
and kept on going until an adequate error was reached. On the contrary to the
software model, the hardware model throughput must be constant implying a fixed
number of computational iterations. In order to verify the integration of the Punit
and determine the optimal number of both Piter and Titer iterations, various imple-
mentations of time and punit iterations were tested. The outputs were compared to
the original software model.
0.10%
1.00%
10.00%
100.00%
1000.00%
10000.00%
0 10 20 30 40 50 60
t iterations
rela
tive
err
or
p=1
p=3
p=4
p=2
p=5
p=50
Figure 4.2: The relative error as a function of t iterations when p iterations areconstant.
62
Figure 4.2 demonstrates the relative output error of the hardware model versus
the software model as a function of different time iterations for various configurations
of Piter iterations. Figure 4.3 demonstrates the relative output error as a function
of different punit iterations for various configurations of time iterations. It is clearly
noticeable that the new system with the Punit, converges after few iterations.
0.10%
1.00%
10.00%
100.00%
1000.00%
10000.00%
0 5 10 15 20 25
p iterations
rela
tive
err
or
t =2
t = 1
t = 3
Figure 4.3: The relative error as a function of p iterations when t iterations areconstant.
Several combinations between the number of time and Punit iterations can be set.
4.2 Results for Different Configurations
We have ran the simulations on the synthetic signals and recorded Hebrew words
for our various hardware model implementations. We define the writing of M × N
configuration as M punit iterations (Piter) and N time-iterations (Titer). The left
63
hand number describes the punit iterations and the right hand number describes the
time-iterations. The relative MSE was computed for each input signal. Figure 4.4
and Figure 4.5 demonstrate the relative errors of the hardware model with different
configurations for the Hebrew words and synthetic signal, respectively. We chose
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
buz
chug
do
v
eich kir
la m
itz
pas
shen
to
f
rela
tive
err
or
5x3
5x3pipe
10x3
3x3
5x2
10x2
10,5
Figure 4.4: Relative error for different time and punit iterations configurations forHab1 word list.
the 5X3 configuration as the best since its relative error reaches 1%. We can also
see that it is possible to run with a 3X3 configuration, but we get a relative error
higher than one percent for some of the tested words. We can also see that applying
a pipeline architecture to the 5X3 configuration raises the basic 5X3 relative error
graph in a parallel way about a 0.5%. The best performance, is observed in the
10X3 configuration in both graphs, indicating the significance of the punit iterations.
The configuration ”10, 5” is composed of two time-iterations where the punit does
10 iterations in the first time and 5 iterations in the second time iteration. We find
64
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
1.00%
chirp sin mi tz click sinc
rela
tive
err
or
5x3
10x3
3x3
5x2
10x2
10,5
Figure 4.5: Relative error for different time and punit iterations configurations forsynthetic signals.
this implementation unpopular because it is not modular. The work-load for every
time-iteration will be different making it harder for implementation and unsuitable
for pipelining.
The recorded Hebrew words were also tested with an additive gaussian white
noise. We created noisy words with 15dB SNR. The results of the simulations are
plotted in figure 4.6. The graph is plotted for a 5X3 architecture configuration.
The ”reg” line represents the relative error for the clean words. The ”snr15” line
represents the relative error for the words with 15dB noise applied and the third line
”snr15,pipe” introduce the effect of adding a pipeline architecture. As seen from the
graph, unexpectedly, the relative errors for the tested noisy words were lower than
the clean ones. Since the algorithm is numeric and we deal with small errors, we
65
prescribe this phenomenon to numerical computation imprecision. It is a known fact,
that sometimes the addition of noise to a quite signal might help numerical algorithm
converge faster. We may also see from the graph that the implementation of a pipeline
architecture increases the relative error by about 0.1%.
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
buz
chug
do
v eic
h kir la
mitz
pa
s sh
en
tof
rela
tive
err
or
reg
snr15
snr15,pipe
Figure 4.6: Relative error for different configurations when noise is applied. reg: a5X3 configuration using clean words; snr15: a 5X3 configuration using noisy words;snr15,pipe: a 5X3 pipeline configuration using noisy words. The confidence intervalwas set to 99%, number of simulation was 5.
Another aspect of our research was to investigate the influence of a reduction in
the signal input data rate on the model. The input signals are sampled at 44.1 kHz,
a typical audio sample rate, and by a linear interpolation we turn the sample rate to
about 1 MHz. This extremely high sample rate is dictated by the parameter ht in the
model. Since the model’s algorithm is solved in numerical method, it must use tiny
66
steps in order to converge and not deviate. The tuning curves of the filters imple-
mented in the algorithm may be very steep and change by tenths of dB between two
consecutive sections. This high rate might be the algorithm’s biggest disadvantage,
as the work-load is influenced directly from the high core data rate. Moreover, as we
seek to reach a real-time application, the high rate dictates a high process time for
the computational algorithm.
time step influence
0.00%
1.00%
2.00% 3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
buz
chug
do
v
eich kir
la m
itz
pas
shen
to
f
rela
tive
err
or
ht=2^-17
ht=2^-18
ht=2^-20
Figure 4.7: The relative error as a function of the time step parameter for the Hebrewwords inputs.
Figure 4.7 demonstrates how the signal data sample rate after interpolation influ-
ences the relative error. We may see that it is possible to lower the core sample rate
but the relative error will claim. The hardware model simulations began to suffer
from convergence issues when we increased ht.
67
4.3 Determining the Variables’s Presentation
The cochlea computational model simulated in software is obviously applied by float-
ing point arithmetic. In our C++ program all the variables and parameters are
represented as Double-Precision format. In the Double-Precision format, 64 bits are
partitioned into three parts, S, E and f .
1 bit S 11 bits E 52 bits f
The value of the floating-point number is:
F = (−1)S · 1.f · 2E−1023
Compared to the fixed-point representation, the range of representable floating-point
numbers is larger, but the precision is smaller.
Our goal is to determine the necessary word-lengths for the transformation of the
floating point version of the model into a version suitable for a hardware implemen-
tation as shown in figure 4.8.
The main problem when converting floating point arithmetic to reduced float-
ing point or fixed point arithmetic is the determination of the necessary numerical
precision. This implies the word-length of internal variables and parameters represen-
tations. Therefore, the hardware model was recoded in C++ using a self developed
scalable data type. This data type takes the internal word-length as a parameter and
saves the values exactly in the same format as they would be saved in a register on
an ASIC or FPGA. So numerical effects of imprecise arithmetic can be simulated.
To solve the problem of large word-length, the dynamic range of the signals had to
68
Floating Point
Limiter for upper and
lower bound
quantizer
Reduced Floating
Point
Fixed Point
Figure 4.8: The representation of the variables and parameters flow chart.
be limited by a lower and upper bounds. This implied a change of the dynamic behav-
ior between the reduced floating point or fixed point and the floating point version.
The effects of these changes to the behavior of the whole model was investigated.
In the research work it was discovered that the dynamic range of the variables may
even reach about 1000dB. Figure 4.9 demonstrates the basilar membrane velocity
vbm histogram for all the sections along the basilar membrane obtained from the
calculation of the input signals.
The extremely huge dynamic range of the variables is caused by the numerical impre-
cision in the arithmetic. It was also found out that most of the variables’s histogram
appearances range between a 160 to 200 dB. We can clearly see in Figure 4.9 that
at around −100dB we have a ”knee shape” where fewer appearances happen. This
69
−900 −800 −700 −600 −500 −400 −300 −200 −100 0 1000
5
10
15x 10
4 BM Velocity Histogram
Vbm
[dB]
Figure 4.9: Histogram of the basilar membrane velocity.
form of the graph let us determine the lower and upper bounds shown in Table 4.3
for all of the variables.
Though it is not noticeable in the variables’s histograms, we also discovered that
the dynamic range upon the different sections is not uniform. The sections close
to the apex, which represent low frequency have a reduced dynamic range, where
the sections close to the base have large dynamic range. Figure 4.10 displays the
histograms of the acceleration for three different ranges of sections. We can also see
that the three histograms have different locations. We have overlooked this fact since
we seek a generic representation.
The only method to determine the optimal word-length and to validate the correct
70
−100 −50 0 50 100 1500
0.5
1
1.5
2
x 104 Acceleration Histogram for different sections
Acceleration [dB]
100−110
400−410 250−260
Figure 4.10: Histograms of the acceleration for three different ranges of sections.
function of the model was to simulate various implementations with different word-
lengths in the hardware model and observe the influence of the word-length on the
performance of the application.
Now, as the variables’s dynamic range was set to no more than 200 dB, we have
simulated the representation of the mantissa with different bit resolution. The expo-
nent, which represents the dynamic range of 200 dB can be implemented by 5 bits
only, since
200dB = 20log10(1010) ≈ 20log10(232)
1010 ≈ 232
32 = 25
The Mantissa was simulated for different number of bits starting from 52 and
71
Variables’s Lower and Upper boundsVariable Minimum Maximum Lower bound Upper bound dB
vbm 1.4e− 45 7.38e2 1.5e− 7 1.5e3 200v′bm 1.87e− 53 1.8e7 4.0e− 3 4e7 200ξbm 1.75e− 62 3.47e1 7.0e− 9 7.0e1 200Pohc 2.17e− 60 2.14e7 2.0e− 6 2.0e3 180P ′
ohc 3.69e− 55 4.43e8 2.0e− 3 2.0e7 200g 1.0e− 57 6.0e5 1.0e− 4 1.0e6 200
Pbm 4.41e− 57 7.55e1 1.5e− 8 1.5e2 200
Table 4.3: lower and upper bounds for the cochlea hardware model variables.
down to 22 bits. As seen from figure 4.11, a significant change in the output relative
error is noticeable when the mantissa resolution was changed from 25 to 24 bits. (The
constant multiplicands were represented by 35 bits in the following graph.)
The hardware model variables representation was changed to a reduced floating-
point version where the Exponent is represented by 5 bits and the Mantissa is rep-
resented by 25 bits. This word-length definition was verified in the hardware model
FPGA simulator (Chapter 5).
A fix-point word-length was also investigated. It is assumed that if the cochlea
hardware model variable are bounded to 200 dB then a 32 bits fix-point representation
would be enough. The fix-point version was implemented on the hardware model
with the delta architecture. The dynamic range was primarily limited to lower and
upper bounds. The dynamic range was reduced to the range of 140 to 160 dB as
seen in Table 4.4. The word-length for fix-point was determined for each variable
by simulating various implementations with different word-length until reaching our
very tight restriction not to exceed a 1% relative error. Table 4.4 represents the list
of all the model’s variables. For each variable the lower and upper bounds and the
72
quantizing the parameters
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
1.40%
1.60%
chirp sin click mi tz
rela
tive
err
or
reg
24bit
25bits
30bits
35bits
40bits
0.10%
1.00%
10.00%
100.00%
buz
chug
dov
eich
kir la
mitz
pas
shen
tof
rela
tive
err
or reg
35 bits
25 bits
24 bits 22 bits
Figure 4.11: The relative error for different quantization of the model variables forsynthesized signals and Hab words. The constants are quantized to 35 bits. reg isthe reference for double precision, we use 5x3 configuration.
73
required resolution is set. It is clear from Table 4.4 that the maximum number of
bits are 30.
Fix-Point representationVariable Lower bound Upper bound dB resolution no. of bitsd input 2−17 2−3 80 2−18 21d ξbm 2−34 2−10 140 2−36 26d vbm 2−20 26 140 2−22 28d Pohc 2−24 24 160 2−25 29d g 2−10 216 140 2−12 28
d Pbm 2−24 22 140 2−26 28d P ′
ohc 2−10 219 160 2−12 31d v′bm 2−10 220 160 2−11 31vbm 2−17 211 160 2−19 30v′bm 2−4 225 160 2−6 31P ′
ohc 2−4 224 160 2−6 30
Table 4.4: Fix-Point representation for the hardware model with the delta and 5X3architecture. The constant parameter are represented by 25 bit fix-point.
74
4.4 Timing Analysis
In order to analyze the timing of the system, we begin by introducing the basic
building blocks of our system. Figure 4.12 demonstrates the flow of the system. The
processor, aimed for the hearing-aid devices may also carry out speech enhancement
before a Vocoder in cellular communication. The input data is processed in parallel.
We call such processor a SIMD (single instruction multiple data) processor [18]. The
reconstruction algorithm gathers the data processed in parallel into one output signal.
A/D Processor Reconstruction
D/A
Vocoder
SIMD Processor
Reconstruction
Figure 4.12: A basic flow of the system.
The basic architecture of the processor is shown in Figure 4.13. The processor is
composed from N ALUs, where N is the number of sections, currently N = 448. The
N ALUs run in parallel as the algorithm was fully parallelized. The ALUs preform
two operations, addition and multiplication by constants of 24 bits. The computations
can be done by 32 bit Fix-Point arithmetic or alternatively by a reduced Floating-
Point representation consisting of 25 bit Mantissa and 5 bit Exponent. The constant
values are stored in the memory block located on the right side of figure 4.13. The
75
memory contains 7 constant values of 24 bits for each of the N sections. On the
left side we have the register file which contains the variables data. Under Fix-Point
representation, we keep a 32 bit numbers of the past and current time point values
of the variables. For each of the N sections we use 7 variables. The controller of the
processor manages the processing algorithm.
Reg file 32x2x7
xN
ALU
Constant Param. Table
24x7xN
controller
ALU
Figure 4.13: The processor architecture.
The processor core data frequency is 1/ht, independent to the input sample rate
which usually ranges between 8 to 44.1 KHz. The core samples are applied by an
interpolation on the input signal samples since a higher data resolution is needed. In
our simulations we use ht = 1µSec which leads to about 1 MHz of core data rate.
Simulations showed that ht could be increased up to 7.6µSec which results in core
data frequency of 131 KHz. In order for the hardware design to comply with real-time
application, the workload summarized in Table 3.8 must be executed within the time
frame of ht. If the processing time for a time point takes more than the core sample
rate a stack starts to grow until the processor reaches overflow. Since the algorithm’s
76
stages are causal we derived the critical path of the algorithm (shown in Table 3.10)
for ht as a function of Titer and Piter.
Total latency of critical pathadditions 2TP + 5Tmultiplications TP + T
Table 4.5: Number of operation for critical path.
These numbers of operations, additions and multiplications by a constant value must
be processed within a time frame of ht. In this analysis we do not take into account a
pipeline implementation. We can plan a pipeline architecture built from Titer stages.
The stages are alike. By having Titer stages we can accept a new input sample after
one stage, thus we increase the processing rate by a factor of Titer or one can say that
the critical path can be divided by the factor Titer. A pipeline architecture which
is also called a linear array topology, certainly makes it easier to reach real-time
application but introduces an area problem. In a pipeline architecture the logic is
duplicated and the silicon area is increased. More silicon means higher manufacturing
cost. Notice that the total processing delay of a sampled data does not changes.
We minimize the computational latency by implementing fast addition and high
speed binary multiplication.
4.4.1 Fast Addition
The simplest adder is a ripple-carry adder, but we need to wait for the carry to prop-
agate. Carry Look Ahead adder is the most commonly used scheme for accelerating
carry propagation, which generates all incoming carries in parallel and avoid the need
to wait until the correct carry propagates from the stage (FA) of the adder where it
77
has been generated. It provides a logarithmic speed-up.
The number of gates along the critical (longest) path (in other words, the number
of circuit levels) determines the execution time of the algorithm.
In full custom VLSI technology the exact number of gates has very limited effect
on the implementation cost. Instead, regularity of the design and length of intercon-
nections are considerably more important, since they affect both the silicon area used
by the adder and the design time. In our case, as seen in Figure 4.13, the basic ALU
units are duplicated making the design modular and the number of interconnects are
small since we only have connections between adjacent ALUs. The two factors (i.e.,
implementation cost and speed) do not necessarily achieve their minimum value in the
same design. Thus, a tradeoff between these two might have to be found. There are
many techniques to implement a carry-propagate adder. Some of the popular ones are
Carry Look Ahead adder, Conditional Sum adder, Manchester and the Carry Skip
adder which has recently become popular [23]. In VLSI technology the carry-skip
adder is comparable in speed to the carry look ahead technique while it requires less
chip area and consumes less power.
In multiplication, when three or more operands are to be added simultaneously,
we use Carry-Save addition. In Carry-Save addition, we let the carry propagate only
in the last step, while in all the other steps we generate a partial sum and sequence
of carries separately. Thus, a Carry-Save adder (CSA) accepts three n-bit operands
and generates two n-bit results, a n-bit partial sum and a n-bit carry. A second
CSA accepts these two bit sequences and another input operand , and generates a
new partial sum and carry. A CSA is therefore, capable of reducing the number of
operands to be added from 3 to 2, without any carry propagation. A better way to
78
organize the CSAs, and reduce the operation time, is in the form of a tree commonly
called Wallace tree [14]. In this tree, the number of operands is reduced by a factor
of 2/3 at each level. Consequently,
Number of Levels ≈ log(k/2)
log(3/2)
where k is the number of operands to be summed. In our case, we will need 5 levels
of CSAs and one more CPA to sum 13 operands.
Carry-Save addition can be very useful in our design since we can represent the
variables and store them in a carry save format. The final stage of the CPA (carry
propagate adder) which also takes more time could be omitted. In Figure 4.14 we
uncoiled the hardware algorithm for one time-iteration. The basic building blocks of
adders and multiplier are illustrated.
4.4.2 High Speed Multiplication
There are two ways to speed up multiplication: reduce the number of partial products
or accelerate their accumulation [23]. All multiplication methods share the same basic
procedure, addition of a number of partial products. The simple methods are easy to
implement, but the more complex methods are needed to obtain the fastest possible
speed.
The simplest method of adding a series of partial products is based upon adder-
accumulator. This is relatively slow, because adding N partial products requires
N clock cycles. There is a faster version of the basic iterative multiplier which adds
more than one operand per clock cycle by having multiple adders and partial products
generators connected in series.
79
When a number of partial products are to be added, the adders need not be con-
nected in series, but instead can be connected to maximize parallelism. This requires
no more hardware than a linear array, but does have more complex interconnections.
The time requires to add N partial products is now proportional to logN , so this
can be much faster for larger values of N [8]. On the down side, the extra complex-
ity in the interconnection of the adders may contribute to additional size and delay.
Probably, the single most important advance in improving the speed of multipliers,
pioneered by Wallace [14], is the use of carry save adders, to add three or more num-
bers in a redundant and carry propagate free manner. By applying the basic three
input adder in a recursive manner, any number of partial products can be added and
reduced to 2 numbers without a carry propagate adder. A single carry propagate
addition is only needed in the final step to reduce the 2 numbers to a single, final
product.
All of the different methods of implementing integer multipliers are reduced to two
basic steps. Create a group of partial products (PP), then add them up to produce the
final product. There are number of different methods for producing partial products.
The simplest partial product generator produces N partial products, where N is
the length of the input operands. A recoding scheme introduced by Booth [5] and
also explained in Appendix C, reduces the number of partial products by a factor of
two. Since the amount of hardware and the delay depends on the number of partial
products to be added, this may reduce the hardware cost and improve performance.
In our circuit implementation, we will have to be able to generate the following partial
products:0, X,−X, 2X and −2X where X is the multiplicand. We can obtain these
easily by including circuits for negating and for shifting left by one bit position. In our
80
case, the multipliers are constant 24 bit which could be recoded by Booth’s algorithm
to create 12 partial products only.
Our multiplication operation will be translated to the summation of 12 partial
products of 32 bit which requires 5 levels of Carry-Save adders (CSAs) and one
Carry-Propagate adder (CPA) at the final step in case a single output is wanted.
Using Wallace or other forms of binary trees require at least 4 CSAs in parallel at
the first level. Each of the 4 CSAs has 3 inputs and the total inputs add to 12. This
way a 12 partial products of 32 bit could be added in parallel. The basic block of our
ALU consists of 32× (4 CSAs) and a 32 bit CPA.
The delay of a Carry-Save adder, which is marked as ”CMPR32” in TSMC 0.18µm
process standard cell library [1] is about 0.35 ns, depending on the load capacitance.
The delay is calculated in Appendix D, Equation D.0.2. We assume that the total de-
lay of a multiplier will include 5 levels of Carry-Save adders and one Carry Propagate
adder in some of the cases when a single output is wanted. Since CPA can be im-
plemented in a logarithmic delay such as the Carry-Look Ahead adder, we bound its
number of delay levels to be log2(32) = 5. We bound the total delay of our multiplier
by the following equation:
Tmult = 10 · Tadd (4.4.1)
The number of Additions and Multiplications needed in the critical (longest) path
of the hardware model were given in Table 4.5. By setting the ratio given in Equa-
tion 4.4.1 we compute the total additions in the critical path:
Total Additions = Titer(2Piter + 5) + 10 · Titer(Piter + 1) =
= Titer(12Piter + 15)
This should be done in a time period of ht. Thus, the delay for one addition must
81
not exceed Tadd which is a function of ht, Titer and Piter:
Tadd =ht
Titer(12Piter + 15)(4.4.2)
and the frequency of the addition fadd is set by:
fadd = 1/Tadd
In the case of pipeline architecture we multiply Tadd by Titer.
In Table 4.6 we can see Tadd and fadd for different configurations of the hardware
model as a function of ht, Titer and Piter.
Tadd and fadd for different hardware model configurationsConfiguration N ht[Sec] Titer Piter Tadd[ns] fadd[MHz] avg. rel. err.[%]
1 448 2−20 3 5 4.24 235.93 0.52 448 2−20 3 3 6.23 160.43 1.23 448 2−20 2 3 9.35 106.95 1.44 448 2−18 3 5 16.95 58.98 1.85 448 2−17 3 5 33.91 29.49 6.46 448 2−17 2 3 74.80 13.37 12
Table 4.6: Tadd and fadd for different hardware model configurations.
To conclude, the delay of one addition must not exceed at the worst case Tadd = 4.24 ns.
This is certainly feasible as we saw that the delay of a 3 to 2 adder (”CMPR32”) or
a FA is ∼ 0.35 ns.
Our analysis is related on the data of a 0.18µm process library. Today, semicon-
ductor companies such as INTEL and IBM are moving into smaller, faster and less
power consuming technology such as 0.09µm and 0.13µm. We can assume that the
performance, timing and power consumption of our hardware design would improve
by about 20 to 30% with the new technology.
82
r e g i s t e r s s e c t i o n i
r e g i s t e r s s e c t i o n i
r e g i s t e r s s e c t i o n i
r e g i s t e r s s e c t i o n i
r e g i s t e r s s e c t i o n i
Add
er(2
,1)
Con
stan
t m
ult.
Add
er(6
,1)
Con
stan
t m
ult.(
x,2)
Add
er(2
,1)
Add
er(2
,1)
Con
stan
t m
ult.
Con
stan
t m
ult.
Add
er(6
,2)
Add
er(6
,2)
r e g i s t e r s s e c t i o n i
Add
er
Add
er
Add
er
X 4
48
Sec
tions
disp
(n+
1)
velo
c(n+
1)
ohcp
(n+
1)
New
tim
e st
ep
1 m
icro
sec
Eun
it
disp
/LC
g
disp
/LC
p(i+
1) fr
om r
eg
p(i-1
) fr
om r
eg
p(i)
p(i-1
) fr
om r
eg
p(i+
1) fr
om r
eg
acce
(n+1
)
ohcp
(n+
1)
velo
c(n+
1)
disp
(n+
1)
The
add
ition
s ca
n be
mad
e ea
rlier
, exc
ept
for
the
prev
. ac
ce fr
om th
e st
age
befo
re.
disp
(n)
velo
c(n)
ve
loc(
n+1)
velo
c(n)
ac
ce(n
) ac
ce(n
+1)
ohcp
(n)
ohcp
_d(n
) oh
cp_d
(n+
1)
Gun
it P
unit
Dun
it M
Eun
it
25 b
its
man
tisa
5 bi
ts
exp
25
5
Pas
t(n)
N
ext(
n+1)
disp
velo
c ac
ce
ohcp
oh
cp_d
g
pres
sure
X 4
48
Add
er
Con
stan
t m
ult.
Add
er
ohcp
_d(n
+1)
disp
(n+
1)/L
C
velo
c(n+
1)
OR
Con
stan
t m
ult.
Con
stan
t m
ult.
velo
c(n+
1)
disp
(n+
1)
Add
er
ohcp
(n+
1)
ohcp
_d(n
+1)
Nee
d to
ad
d.
disp
(n)
velo
c(n)
velo
c(n)
acce
(n)
ohcp
(n)
ohcp
_d(n
)
X5
X2
Figure 4.14: Asic design uncoiled.
83
4.5 Power Consumption Analysis
The issue of Power Consumption is most important when we talk about portable
hearing-aids or mobile communication. The resource of energy in portable devices is
limited. In this section we analyze the power consumption of our processor.
Power dissipation is dependent upon the power-supply voltage, frequency of opera-
tion, internal capacitance, and output load. We used TSMC 0.18µm Process standard
cell library in order to calculate the power consumption of a 3 to 2 counter called
”CMPR32” which is like a FA. The standard cell library is designed to dissipate only
AC power. The power dissipation is primarily a function of the switching frequency
of the design’s internal nets. These nets include the inputs and outputs of each cell
and the capacitive load associated with the output of each cell. The power dissipated
by each cell according to TSMC 0.18µm process [1] is:
Pavg =x∑
n=1
(Ein · fin) +
y∑n=1
(Con · V dd2 · 1
2fon) + Eon · fo1 (4.5.1)
where,
• Pavg = average power (µW ).
• x = number of input pins.
• Ein = energy associated with nth input pin (µW/MHz).
• fin = frequency at which the nth input pin changes state during the normaloperation of the design (MHz).
• y = number of output pins.
• Con = external capacitive loading on the nth output pin, including the capac-itance of each input pin connected to the output driver, plus the route wirecapacitance, actual or estimated (pF).
• Vdd = operating voltage.
84
• fon = frequency at which nth output pin changes state during the normal op-eration of the design (MHz).
• Eos = energy associated with the output pin for sequential cells only (µW/MHz).
In order to calculate the power consumption of the cell ”CMPR32”, we used equa-
tion 4.5.1 and the data from [1]. We assumed that fin and fon are the same and equal
half of fadd, statistically.
Pavg = (0.1126 + 0.1450 + 0.06)fin + (0.005 · 1.82 · 12· fon) =
= 0.167 · fadd(MHz) [µW ](4.5.2)
The average power consumption of ”CMPR32” is 0.167 [µW/MHz].
We examined the issue of power consumption in two approaches. Our first ap-
proach is from the implementation point of view. In the previous section we found
out the basic components of our ALU and its frequency, fadd (Equation 4.4.2). The
approximation of the combinatorial logic power consumption is derived from the av-
erage number of adders working at frequency fin = fadd times the operand’s number
of bits (32) and the number of sections N . As discussed in the previous section, the
first level of the multiplier consists of 4 CSAs and the fifth level consists of 1 CSA.
We chose the average number of adders working at the same time for 1 bit to be 2
adders.
Pavg = 0.167 · fin(MHz) ·N · 32bit · 2csa [µW ] (4.5.3)
As seen from Equation 4.5.3, Pavg is a function of N and fin which is a function of
ht, Titer and Piter. We calculated the power consumption of the processor for different
configurations using Equation 4.5.3 as seen in Table 4.7. We can see that different
configurations effect the power consumption between the range of 1.13 to 0.06 Watts.
We must take into account that the relative error grows when less iterations are
85
Pavg for different hardware model configurationsConfiguration N ht[Sec] Titer Piter fadd[MHz] Pavg[W ] avg. rel. err.[%]
1 448 2−20 3 5 235.93 1.13 0.52 448 2−20 3 3 160.43 0.77 1.23 448 2−20 2 3 106.95 0.51 1.44 448 2−18 3 5 58.98 0.28 1.85 448 2−17 3 5 29.49 0.14 6.46 448 2−17 2 3 13.37 0.06 12
Table 4.7: Power consumption for different hardware model configurations.
y = 0 . 6639 x - 0 . 9071
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2 4 6 8 10 12 14
Average relative error (%)
Est
imat
ed P
ow
er
con
sum
pti
on
(W
)
first appr. second appr. pattern
Figure 4.15: Estimated Power consumption vs. relative error.
86
applied. Figure 4.15 demonstrates the relationship between the estimated power
consumption and the relative error of the hardware model simulation depending on
its configuration.
Our second approach to approximate the processor’s power consumption is by
looking at the total work-load which needs to be processed in a time period of ht
by 32 Full-Adders. We know from Table 3.8 that the total number of additions and
multiplications is:
Additions (2PiterTiter + 12Titer − 3)(N − 1)Multiplications (TiterPiter + 4Titer)(N − 1)
We assume the multiplications are done serially by Multiply-Accumulator (MAC) and
it takes 12 steps to sum 12 partial products by 32 FA. Using the following assumption:
Tmult = 12 · Tadd (4.5.4)
we compute the theoretical frequency and the power consumption:
ftheo = [(12T + 2PT − 3)N + 12N(4T + TP )]/ht [MHz]
Pavg = 32 · 0.167 · ftheo(MHz) [µW ](4.5.5)
Using the second approach reveals almost the same average power consumptions for
all the different hardware model’s configurations as seen in figure 4.15.
Chapter 5
FPGA Design and Simulation
In this chapter we present an implementation of the hardware model algorithm in
a field programmable gate array (FPGA) [2] simulator since an ASIC design was
impossible to do within the university framework.
The FPGA technology enjoys the following advantages over analog and digital
VLSI (ASICs): shorter design time, faster fabrication time, more robust to power
supply, temperature and transistor mismatch variations, wider dynamic range and
higher signal to noise ratios, better stability, the chips can be reused for different
application and it has a simpler interface.
5.1 The Design
The design architecture that was chosen for the implementation of the hardware
model algorithm was dictated by the FPGA size. Since the number of the FPGA’s
cells are limited, it was impossible to create a fully parallelized architecture which
implemented 448 ALUs in parallel. The 448 sections were divided into 14 segments
of 32 sections each. Our implementation included 32 ALUs in parallel which could
compute only 32 sections at a time. The ALUs are able to Multiply, Add and Shift.
87
88
Each clock cycle one of the operations are done. There are no dead cycles. It was
important for us that the design would be generic and that changes in parameters
could be easily done. The design works in SIMD (Single Instruction Multiple Data)
topology. The ALUs operate upon a configured instruction code. The FPGA design
based on parallel ALUs are shown in figure 5.1 [6].
REG bank A
(16x448 /ALU_NUM)
x (ALU_NUMx
Sample Width)
ALU
REG bank B
(16x448 /ALU_NUM)
x (ALU_NUMx
Sample Width)
PROGRAM
CLK
G Edge manipulation
P Shift P Shift
ALU_NUM x Sample
Width
Figure 5.1: A scheme of the ALU architecture implemented in the FPGA.
We hold the 16 parameters and variables for each of the 448 section in two Register
Banks. The contents of the Register Banks are displayed in Table 5.1.
89
Address Left Memory Bank Right Memory Bank
0 Tmp2 Tmp11 Prev veloc Veloc2 Prev disp Disp3 Prev acce Acce4 Prev ohcp Ohcp5 Prev ohcp derv Ohcp derv6 P P7 −K/2 (const. vector) G8 −RK (const. vector) ht (constant)9 −K/C (const. vector) 220
10 −1/dx2 (constant) ht/2 (constant)11 −1/(dx2 + K) (const. vector) −21
12 −w0
13 −Kohcw1 (const. vector)14 Kohc
15 0 (const)
Table 5.1: The contents of the FPGA Register Banks.
The number of clock cycles that are computed for a time sample computation are
calculated by the following expression:
Clock Cycles = 448ALU NUM
(PreE + E + TiterPiter × P+
+Titer(G + D) + (Titer − 1)ME) (5.1.1)
where, Piter and Titer are the Punit and Time iterations, respectively. ALU NUM is
the number of ALUs placed in parallel, E,G, P, D, ME and PreE are the numbers
of Clock Cycles (instructions) of each unit in the algorithm. The number of Clock
Cycles (CCs) depends on the amount of instructions for every unit in the algorithm.
The ALU commit one instruction every clock cycle. The complete instruction code
and their number of instructions are displayed in Appendix E. We summarized the
90
number of Clock Cycle per unit in Table 5.2.
Number of Clock Cycle per unitUnit Clock CyclesEunit 6Gunit 5Punit 4Dunit 9
MEunit 9Pre-Eunit 5
Table 5.2: Number of instruction per unit in the FPGA design.
Using Equation 5.1.1 and Table 5.2, the number of Clock Cycles that are needed for
our basic configuration, where ALU NUM = 32, Titer = 3, and Piter = 5 is 1834 CCs.
The Period of the clock cycle was determined by the longest instruction (operation).
If the ALU NUM was 448 then it would take 131 CCs to compute a time point.
In this case, real-time application needs to run at a frequency of 131 MHz when,
ht = 1µSec.
Figure 5.2 represents a waver of the controller state machine which controls the
ALUs. The main states of the state machine are the basic units of the hardware
Figure 5.2: A waver of the state machine of the FPGA design.
algorithm, Eunit, Gunit, Punit, Dunit, MEunit, and PREEunit. Each state
91
is composed from several instructions, as explained in Appendix E. The instruc-
tion number being committed is determined by the InstructionNum counter. The
RepetitionNum represents the number of segment between 0 and 448/ALU NUM
being processed. The parameters time-iteration (Titer) and punit-iteration (Piter) are
also displayed.
The reduced floating point definition that was determined in the hardware model
was used for the design of the FPGA. The parameters and variables are represented
by 25 bit Mantissa and 6 bits Exponent.
5.2 Simulation Results
The cochlea implementation was verified by comparing the results of the VHDL sim-
ulator with the results of the hardware model in C using the reduced floating point
representation. The input signals that were tested in the simulations were a sinus at
4KHz and a chirp starting from 500Hz up to 8KHz. The input signals where applied
at a rate of 1 Mega samples per second prepared beforehand.
We calculated the relative error between the VHDL and C implementations using
equation 4.0.1. Table 5.3 summarizes the relative errors for the tested signals.
signal relative error[%]chirp 0.0551sin4k 0.0462
Table 5.3: C vs. VHDL implementations relative errors.
Figure 5.3 demonstrates the energy of the output signal per section for the chirp input
92
signal. The energy is calculated by:
Energy(x) =T∑0
v2bm(x, n) (5.2.1)
where vbm is the output signal matrix. x represents the section number and T repre-
sents the signal’s period of time (number of samples). So close is the fit of the FPGA
0 50 100 150 200 250 300 350 400 450−150
−100
−50
0
50
100Signal Energy vs. section
section number
Ene
rgy
[dB
]
green,solid − C implementation
blue,dotted − VHDL implementation
Figure 5.3: The energy of the output for different sections for the chirp input.
result to the hardware model written in C, that one can scarcely tell that the two
lines are plotted rather than one. At about section 350 we can see a rapid drop in
the values of the energy in the C implementation. This change occurs since the upper
and lower limiters implemented in the C code were not inserted to the FPGA design.
Thus, at the higher sections, which represent the lower frequencies, the amplitudes
are low and for many time-points the C hardware model zeros the velocity values for
these sections.
The implementation of the hardware model definition in the FPGA simulator was
93
verified against the C implementation. We saw that the plotted lines of the VHDL
and C fit together. The slight differences between the two implementations are caused
by the floating point rounding scheme and lower bound limiter applied in the C code.
In order to check the feasibility of the implementation in FPGA we synthesized
the VHDL code. The FPGA is composed from basic logical cells called LUTs (Look
Up Tables) and FFs (Flip-Flops). Recent FPGAs were added specific design mod-
els such as multiplier (18x18 bits), MACs (Multiply-Accumulator), Memories, and
Embedded CPUs (ARM in Altera and PowerPC in Xilinx). An estimation of the
hardware resource for an adder and multiplier, displayed in Table 5.4, was derived by
synthesizing these components.
Component Logic note
Adder 1316 LCs LC=LUT+FF, FF-not in use.Multiplier(25x25) 4 MULT(18x18), 590 LCs using special purpose multiplier.
Table 5.4: Amount of logic needed for an adder and multiplier in FPGA.
In order to implement an architecture where the ALU NUM = 448, 448 multipli-
ers of 25x25 bit are required. In two of the most updated FPGAs, Xilinx Virtex2Pro
(XC2VP100) and ALTERA Stratix2 (EP2S180) there are 111 and 96 multipliers, re-
spectively. For the algorithm to run in real-time we either must have 4 − 5 FPGAs
in parallel or multiply the FPGA’s basic frequency which is 131 MHz by 4 or 5. The
floating point multiplier model in Xilinx Virtex2Pro family may work at 100 MHz,
but since we need a 25x25 bit multiplier, it demands two levels of multipliers, thus
working at a rate of only 50 MHz. It seems that a real-time implementation of the
hardware algorithm in one FPGA is not feasible for now, nevertheless, it may be
94
feasible with the implementation of 4 to 5 FPGAs and the improvement of the com-
putation rate, mainly by improving the FPGA design. One idea for the improvement
of the design is the use of pipeline.
5.3 Summary
A VHDL implementation for the hardware model has been designed using an instruc-
tion queue which runs an array of ALUs. The design was simulated and verified in a
FPGA simulator. The simulations results were compared to the hardware model in
C using the reduced floating point representation. The VHDL code was synthesized
in order to check its feasibility to run at real-time. For the moment, it is possible to
implement a FPGA cochlea model not in real-time. A real-time application could be
made possible using more FPGAs and by improving the design.
Chapter 6
Discussion
The processing of the cochlear representation introduces massive computations in
the speech enhancement system. In this study, we have focused on designing the
main block (the processor) which accounts for most of the computational load. Al-
though other blocks, such as the interpolation and reconstruction blocks, participate
in the processing sequence, their addition to the time delay and power consumption
is negligible.
In this research, the solution of the one-dimensional cochlear model [4] was modi-
fied to best fit hardware design implementation. A parallel solution for the algorithm
was introduced and evaluated. Due to the fact that the hardware solution consists
from two iterative numerical methods, we had several configurations possible, obtain-
ing a relative error of less than 1% compared to the original algorithm, on a set of
tested stimuli.
We have narrowed down the number of configurable parameters to only four:
ht, N, Titer and Piter. The computational resolution is defined by ht and N , which
represent the processor’s core data sampling period and the membrane’s resolution
(number of sections), respectively. The computational number of iterations of the
95
96
numerical methods are represented by Titer and Piter. The convergence and truncation
errors of the numerical methods depend on the number of the iterations. Obviously,
when more iterations are applied, the relative output error decreases, but the latency
and work-load increases. The parameters ht, N and Titer were also valid for the
software solution of the algorithm. The parameters ht and Titer were fixed in the
hardware solution. The timing performance (processing latency), power consumption
(total work-load) and functional correctness (relative output error) are determined
from the configuration.
Each configuration was evaluated using three parameters: the error relative to the
original solution, the clock frequency and the power consumption. These three criteria
were the most important and acute for our design consideration. We estimated that
a clock frequency of up to 250 MHz, and a power consumption ranging between 0.06
to 1.13 Watts, would be needed for achieving a reasonable functional performance in
real-time.
We used TSMC’s 0.18µm process standard cell library databook as a reference.
This technology is not the most up to date, and there are newer technologies, such as
0.09µm, which would improve the timing, power and area size by about 20 to 30%.
There are many other criteria which were not considered. For example, the com-
plexity, size and layout of a chip. Modular design is preferable, because it is less
complex to implement and usually a more area efficient. In our architecture, we have
designed the N ALU blocks to be laid-out in parallel making the design modular and
easier to layout. The total area was not computed since it’s technology dependent.
The issue of silicon interconnects play a major role in design architectures nowadays,
97
where the transistors get smaller and faster. It is very hard to forecast wire con-
gestion problems but we certainly minimized the problem with our parallel solution,
since there is only communication between two adjacent ALUs.
The system could be designed as a system on a chip (SoC) or could be partitioned
into couple of independent components. The IO interface is an important issue in
chip design. A chip can be pad-limited or core-limited. When a chip is considered
to be pad-limited, a given silicon area cannot contain all of the IO interface (pads).
Therefore, the integration of the reconstruction algorithm with the cochlear model
algorithm to one chip, is preferable. The reconstruction algorithm uses the cochlear
model output, while performing a multiply and add operations on the matrix columns,
at no more than 50 kHz, output rate.
The total tolerable delay in speech applications is about 25 ms. Our bottle neck
is the high core data rate of the processor, represented by ht, which is about 1µSec.
Considering real-time performance, the processor time delay would be ht, enabling
sufficient time of (0.025− ht) Sec for the rest of the system’s operations.
Different criteria exist for evaluating various speech applications such as: hearing-
aids, cellular communication and voice recognition. The common evaluation tests
applied today are: hearing tests, Mean Opinion Score (MOS) [38] which is a hear-
ing test where a listener grades speech quality on a scale of 1 to 5, and Perceptual
Evaluation of Speech Quality (PESQ) [39] which is a software that imitates the MOS
testing for telecommunications. In our work, we set the most stringent restriction,
where no more than 1% of relative error between the hardware and the original out-
put matrixes was allowed. In this case, we relayed on the correctness of the original
software solution.
98
The evaluation of the reconstructed speech signals for the original and hardware
model was done using hearing tests. We haven’t evaluated these output signals using
MOS or PESQ testing methods, yet. The tested stimuli was composed from synthetic
signals and a set of recorded Hebrew words. Although recognizing monosyllabic words
in hearing tests is harder compared to meaningful sentences, the addition of sentences
to the hearing test should be considered.
As it seems, there are still many open questions regarding the correct evaluation
criteria of the original speech enhancement system including the comparison between
the hardware and the software model. We suggest that future work would compare the
different hardware configurations using other testing criteria. It is possible, that for
different applications with different evaluation criteria, we would discover that fewer
iterations, with a reduced number of sections and data rate would be sufficient. Thus,
the matrix relative error may not be the only criteria to evaluate and decide which
configuration is best for our processor. Moreover, we can have different configurations
for various applications. Each configuration results in different timing performance
and power consumption.
The hardware model was designed for field programmable gate array chip (FPGA),
using an ALU oriented architecture (Chapter 5). The design was synthesized but real-
time performance was not satisfied under the FPGA limitations. Other hardware
implementations were also considered. We had turned to ASIC design (Chapter 4),
where fewer design limitations are considered, and the design is more diversified.
This research was the first step towards the implementation of a newly pro-
posed speech enhancement algorithm. We have discovered the drawbacks of the
algorithm relating to its hardware implementation, where timing performance and
99
power-efficiency are a design consideration. The large number of partitions, N , and
the high computational rate obtained by 1/ht introduced massive commutations.
These drawbacks should be examined and noticed in the development of the next
version of the algorithm. Future research should consider new mathematical ap-
proach, using bank of filters (FIR) instead of the current numerical iterative solution.
It could make the algorithm more attractive for hardware implementation. Future
studies should include an examination of the algorithm (original and hardware) with
a broader database and examine the reconstructed speech signals using different eval-
uation techniques.
Appendix A
List of Symbols and parameters
List of symbolsSymbol Definition Units
P Pressure across the cochlear partition kg/m · sec2
Pt(x, t) Pressure in scala tympani kg/m · sec2
Pv(x, t) Pressure in scala vestibuli kg/m · sec2
Ut(x, t) Scala tympani fluid velocity for unit area m/secUv(x, t) Scala vestibuli fluid velocity for unit area m/sec
ρ Perilymph density kg/m3
ξbm(x, t) Basilar membrane vertical displacement mA(x) Scalae cross section area m2
β(x) Basilar membrane width mPohc OHC Pressure contribution kg/m · sec2
Pbm Pressure obtained by basilar membrane kg/m · sec2
m(x) Basilar membrane mass per unit area kg/m2
r(x) Basilar membrane damping per unit area kg/m2 · secs(x) Basilar membrane stiffness per unit area kg/m2 · sec2
ψ Basolateral membrane voltage drop voltψ0 equivalent electrochemical gradient voltGa OHC Apical membrane conductance 1/ohmCa OHC Apical membrane capacitance amp · sec/volt
100
101
List of symbolsSymbol Definition Units
Gb OHC Basolateral membrane conductance 1/ohmCb OHC Basolateral membrane capacitance amp · sec/volt
∆`ohc OHC Elongation mFohc OHC Force applied to the basilar membrane kg/m · sec2
Kohc OHC Stiffness kg/sec2
γ Relative density of healthy OHC’s per unit area 1/m2
K0 OHC load stiffness kg/m2 · sec2
ωcf charactrristic angular frequency Rad/secZ impedance kg/m2 · sec
Table A.1: List of symbols
Appendix B
Mathematical Methods
The solution of the algorithm is performed in two steps. The first step is the solution
of a boundary condition problem in the spatial domain using the finite difference
method and the second step is the solution of an initial condition problem in the time
domain using Euler and Modified Euler methods.
B.1 The Finite Difference Method
We use the finite difference method to solve the second degree differential equation
with the boundary condition introduced in equation 2.2.7,
∂2P
∂x2−QP = QG (B.1.1)
and the boundary condition:
P (x, t) = S(t) x = 0
P (x, t) = 0 x = `
The basilar membrane is partitioned uniformly and the net point is described as:
xi = ihx hx = `N
i = 0, 1, · · · , N (B.1.2)
The natural three-point approximation to the second derivative in x is:
∂2P (x, t)
∂x2≈ P (x + hx, t)− 2P (x, t) + P (x− hx, t)
h2x
(B.1.3)
102
103
Defining pi as Pxiand substituting the approximation equation B.1.3 in equation B.1.1
gives:
pi−1 − (2 + h2xQi)pi + pi+1 = h2
xQiGi (B.1.4)
where Qi = Q(xi) and Gi = G(xi). The exterior points equations are:
p0 = S(T )
pN = 0(B.1.5)
Equation B.1.4 and equation B.1.5 are combined and displayed as a linear system:
Ap = B (B.1.6)
where,
p =
p0
p1
...
pN−1
pN
B =
S(T )
0...
0
0
+ h2x
0
G1Q1
...
GN−1QN−1
0
(B.1.7)
and the matrix A is:
A =
1 0
1 −(2 + h2xQ1) 1
. . . . . . . . .
1 −(2 + h2xQN−1) 1
0 1
(B.1.8)
Matrix A is time independent, it is a constant matrix. As Qi > 0 for i = 1, · · · , N−1,
matrix A has a dominant diagonal and it is a regular matrix [31]. A unique solution
104
to the linear system exists. Matrix A is a tridiagonal matrix factored into two bi-
diagonal matrixes, ([15] page 55).
A =
α0
1 α1
. . . . . .
1 αN−1
1 αN
1 γ0
1 γ1
. . . . . .
1 γN−1
1
(B.1.9)
where:α0 = −(1 + h2
xQ0
2)
αi = −(2 + h2xQi)− γi−1 i = 1, 2, · · · , N − 1
γi = 1αi
i = 0, 1, 2, · · · , N − 1
αN = 1− γN−1
(B.1.10)
Now, the solution of the linear system is done in two steps. An intermediate vector
n is defined and obtained from the following system:
α0
1 α1
. . . . . .
1 αN−1
1 αN
n0
n1
...
nN−1
nN
=
B0
B1
...
BN−1
BN
(B.1.11)
The vector n is obtained by the recursion formula:
n0 = B0
α0
ni = Bi−ni−1
αii = 1, 2, · · · , N
(B.1.12)
Finally, in order to obtain the basilar membrane pressure p, the next system is solved:
1 γ0
1 γ1
. . . . . .
1 γN−1
1
p0
p1
...
pN−1
pN
=
n0
n1
...
nN−1
nN
(B.1.13)
105
The vector n is obtained by the recursion formula:
pN = nN
pi = ni − γipi+1 i = N − 1, N − 2, · · · , 1, 0(B.1.14)
The solution of the linear system computed by the factorization of A is a good
analytical solution for the boundary condition problem. As seen, the solution of
the intermediate vector n and the solution of the desired vector p is done recursively.
Thus, it takes 2N +2 steps to complete the pressure vector p computation. Moreover,
each step requires 2N + 1 multiplications and 2N additions.
B.2 Initial condition problem numerical solution
The cochlea model variables vbm, ξbm and Pohc must be approximated at the beginning
of each time point computation. The variables: P, vbm, ξbm and Pohc along the cochlea
partition at all time points when t ≤ T are available. The cochlear model variables:
v′bm and P ′ohc are also computed and are known for the sections at the time point
t = T . The numerical methods used to approximate these model’s variables are Euler
and Modified Euler methods. These methods are simple and computation efficient.
As the magnitude of the model’s variable undergo significant changes during the
computation process, it takes several iterations for the algorithm to converge. In the
first iteration we use Euler method to approximate roughly the variables: vbm, ξbm
and Pohc for the time point. Then, from the second iteration we use the Modified
Euler method to approximate these variables.
B.2.1 Euler Method
We define the initial value problem:
y′(t) = f(t, y(t))
y(t0) = y0
(B.2.1)
106
where f(t, y) is a given function, t0 is a given initial time and y0 is a given initial
value for y. The unknown in the problem is the function y(t).
The Euler method is very simple. It uses the first derivative to determine (ap-
proximate) the next time step, as seen from the equation:
y(tn+1) ≈ yn + f(tn, yn)ht (B.2.2)
The parameter ht is called the time step size. The value of ht may be changed by the
convergence unit (Cunit) which evaluates the approximation error. The truncation
error of the Euler method is O(h2t ). The computation of the Euler equation requires
only one multiplication and one addition.
B.2.2 Modified Euler Method
The Modified Euler method is classified as Runge-Kutta method of order two. It is
also called the trapezoidal method.
By the fundamental theorem of calculus and the differential equation, the exact
solution of the initial value problem stated in equation B.2.1 obeys:
y(tn+1) = y(tn) +∫ tn+1
tny′(t)dt
= y(tn) +∫ tn+1
tnf(t, y(t))dt
(B.2.3)
The algorithm for computing yn+1 will be of the form:
y(tn+1) = y(tn) + approximate value for
∫ tn+1
tn
f(t, y(t))dt
In Euler’s method, we approximate f(t, y(t)) for tn ≤ t ≤ tn+1 by the constant
f(tn, yn). Thus,
Euler′s approximate value for
∫ tn+1
tn
f(t, y(t))dt =
∫ tn+1
tn
f(tn, yn)dt = f(tn, yn)ht
The area of the complicated region 0 ≤ y ≤ f(t, ϕ(t)), tn ≤ t ≤ tn+1 which is the
area under the parabola in figure B.1 is approximated by the area of the rectangle
0 ≤ y ≤ f(tn, yn), tn ≤ t ≤ tn+1 (the shaded rectangle in the right half of figure B.1).
107
Figure B.1: Euler and Modified Euler approximation method.
The Modified Euler method gets a better approximation by attempting to ap-
proximate by the trapezoid on the left above rather than the rectangle on the right.
The area of the trapezoid is the length ht of the base multiplied by the average,
12[f(tn, ϕ(tn)) + f(tn+1, ϕ(tn+1))], of the heights of the two sides. Thus, the solution
of modified euler is:
y(tn+1) ≈ y(tn) +ht
2[f(tn, y(tn)) + f(tn+1, y(tn+1))] (B.2.4)
The truncation error of the Modified Euler method is O(h3t ). Equation B.2.4 is an
implicit equation since y(tn+1) appears on both sides. In order to solve equation B.2.4
we define an iterative series {wj}:
w0 = y(tn) + htf(tn, y(tn))
wj = y(tn) + ht
2[f(tn, y(tn)) + f(tn+1, wn)]
(B.2.5)
The first element in the series is obtained by Euler method. The other elements in the
series are obtained by Modified Euler method. The series defined in equation B.2.5
is repeated until convergence is obtained. The Modified Euler method requires two
additions and one multiplication.
Appendix C
Booth Recoding
This technique is used to reduce the number of partial products that must be added
together in a multiplier. Various forms of Booth recoding techniques have been pro-
posed. Consider one variation (called either modified Booth, radix-4 Booth or Booth-2
recoding) that examines groups of 3 adjacent bits of the multiplier operand. Assume
that the number of bits, n, in the multiplier operand Y is an even number. The algo-
rithm uses two’s complement representation of signed numbers. The significands are
positive numbers, so they can be written in two’s complement format by prepending
a 0 MSB. (If we stated with an even number of significand bits, then we will have to
prepend two 0 bits to keep the number of bits an even number.) Then, Since the bit
yn−1 = 0, we can write:
Y =n−1∑
k=0
yk2k (C.0.1)
Each term in the summation can be written as:
yk2k = (2yk − 1
22yk)2
k = yk2k+1 − 2yk2
k−1 (C.0.2)
We use the above expansion for the odd-k terms in the summation for Y and obtain:
Y = (yn−12n − 2yn−12
n−2) + yn−22n−2 + (yn−32
n−2 − 2yn−32n−4) + yn−42
n−4 + · · ·· · ·+ (y32
4 − 2y322) + y22
2 + (y122 − 2y12
0) + y020
(C.0.3)
108
109
We define y−1 ≡ 0 and collect together the terms having the same power of two:
Y = yn−12n + (−2yn−1 + yn−2 + yn−3)2
n−2 + (−2yn−3 + yn−4 + yn−5)2n−4 + · · ·
· · ·+ (−2y3 + y2 + y1)22 + (−2y1 + y0 + y−1)2
0
(C.0.4)
The first term drops out since yn−1 = 0. We define a new set of coefficients zk for
k = even number:
zk ≡ −2yk+1 + yk + yk−1 , for k = 0, 2, 4, · · · , n− 2 (C.0.5)
We can write a new expression for Y and the product, P = XY , in terms of zk as
follows:
Y =n−2∑
k=0,k even
zk2k , P = XY =
n−2∑
k=0,k even
(zkX)2k (C.0.6)
Thus, we have to generate and sum only n/2 partial products zkX, which is ap-
proximately half of the original number of partial products ykX. This represent a
considerable savings in the required hardware. In our circuit implementation, we will
have to be able to generate the following partial products:0, X,−X, 2X and − 2X.
We can obtain these easily by including circuits for negating and for shifting left by
one bit position.
Appendix D
Delay Calculation
The propagation delay through a cell is the sum of the intrinsic delay, the load depen-
dent, and the input-slew dependent delay. The typical delay calculation (@1.8V, 25◦C)
through standard cells according to TSMC 0.18µm process [1] is:
tTPD = ttypical = tintrinsic + (Kload · Cload) (D.0.1)
where,
• tintrinsic = delay through the cell when there is no output load (ns).
• Kload = load delay multiplier (ns/pF).
• Cload = total output load capacitance (pF).
In order to calculate the propagation delay through the cell CMPR32, which is a
counter 3 to 2 or FA we use equation D.0.1 and the data from [1]. We used the Cload
of the cell BUFX2.
tTPD = 0.34 + (4.5 · 0.003) = 0.35 ns (D.0.2)
110
Appendix E
FPGA Instruction Code
The instruction code for the FPGA controller:
Tmp1 = mult(prev veloc, ht) Eunit = 6 CCDisp = add(prev disp, tmp1)Tmp1 = mult(prev acce, ht) (CC=Clock Cycles)V eloc = add(prev veloc, tmp1)Tmp1 = mult(prev ohcp derv, ht)Ohcp = add(prev ohcp, tmp1)Tmp1 = mult(k/2, Ohcp) Gunit = 5 CCTmp2 = mult(RK, V eloc)Tmp1 = add(tmp2, tmp1)Tmp2 = mult(−K/C, Disp)G = add(tmp2, tmp1)In the last instruction edge manipulation= ON g(0)=input, g(447)=0Tmp1 = add(P (i− 1), P (i + 1)), shift=ON Punit = 4 CCTmp2 = mult(−1/dx2, tmp1)Tmp1 = add(tmp2, G)P = mult(1/(2/dx2 + K), tmp1)In the last instruction edge manipulation=ON p(0)=input, p(447)=0
111
112
Tmp1 = add(P (i− 1), P (i + 1)), shift=ON D = 9 CCTmp2 = mult(P,−2)Tmp2 = add(tmp2, tmp1)Acce = mult(tmp2, 220)In the last instruction edge manipulation=ON Acce(0)=0, Acce(447)=0Tmp1 = mult(−w0, Ohcp)Tmp2 = mult(−Kohcw1, Disp)Tmp1 = add(tmp2, tmp1)Tmp2 = mult(Kohc, V eloc)Ohcp derv = add(tmp2, tmp1)Tmp2 = add(prev veloc, V eloc) ME = 9 CCTmp1 = mult(tmp2, ht/2)Disp = add(prev disp, tmp1)Tmp2 = add(prev acce, acce)Tmp1 = mult(tmp2, ht/2)V eloc = add(prev veloc, tmp1)Tmp2 = add(pre ohcp derv, ohcp derv)Tmp1 = mult(tmp2, ht/2)Ohcp = add(prev ohcp, tmp1)Prev disp = add(0, Disp) Pre E = 5 CCPrev veloc = add(0, V eloc)Prev acce = add(0, Acce)Prev ohcp = add(0, Ohcp)Prev ohcp derv = add(0, Ohcp derv)
Table E.1: The Instruction code for FPGA Controller.
Bibliography
[1] TSMC 0.18µm Process 1.8-Volt SAGE-X Standard Cell Databook.
[2] Virtex-2 Pro Platform FPGAs: Complete Data Sheet.
[3] Boothroyd A. Statistical theory of the discrimination score. j. Acout. Soc. Am.,
43:362–367, 1968.
[4] Cohen A. and Furst M. Integration of outer hair cell activity in one-dimensional
cochlear model. j. Acout. Soc. Am, 115:2185–2192, May 2004.
[5] A.D.Booth. A signed binary multiplication technique. Quart. Journ. Mech. and
Applied Math., 4:236–240, 1951.
[6] Ronen Akerman. Implementation of a one dimensional cochlear model in FPGA.
Technical report, Elect. Eng. Dep., Tel-Aviv Univ., 2004.
[7] Oren Bahat. Efficient implementation and complexity analysis of one dimensional
cochlear model. Technical report, Elect. Eng. Dep., Tel-Aviv Univ., 2003.
[8] Gary W. Bewick. Fast Multiplication: Algorithm and Implementation. PhD
thesis, Elect. Eng. Dep., Stanford Univ., California, February 1994.
[9] Geisler C.D. From sound to synapse:physiology of the mammalian ear. Oxford
university press, New York, 1998.
113
114
[10] C.D.Summerfield and R.F.Lyon. ASIC implementation of the lyon cochlea
model. IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, pages 673–676, 1992.
[11] Azaria Cohen. Cochlear Model For Normal and Damaged Ears. PhD thesis,
Elect. Eng. Dep., Tel-Aviv Univ., Tel-Aviv, Israel, June 2004.
[12] B. Cooper. The Beethoven Compendium:A Guide to Beethoven’s Life and Music.
Thames & Hudson, New York, 1991.
[13] Steele C.R and Tabar L.A. Three-dimensional model calculations for guinea pig
cochlea. j. Acout. Soc. Am., 69:1107–1111, 1981.
[14] C.S.Wallace. A suggestion for a fast multiplier. IEEE Trans. on Computer
EC-13, pages 14–17, Feb. 1964.
[15] Isaacson E and Keller H.B. Analysis of numerical methods. Dover, New York,
1993.
[16] Von Bekesy G. Experiments in Hearing. McGraw-Hill, New York, 1960.
[17] Zweig G, Lipes R, and Pirce J.R. The cochlear compromise. J. Acoust. Soc.
Am., 59:975–982, 1976.
[18] J. Hennessy and D. Patterson. Computer Architecture, A Quantitative Approach.
Morgan Kaufmann Publishers, third edition, 2003.
[19] Helmholtz H.L.F. On the sensation of tone. Dover (the original German edition
was published in 1862), New York, 1954.
[20] J.C.Bor and C.Y.Wu. Analog electronic cochlea design using multiplexing
switched-capacitor circuits. IEEE Transactions on Neural Networks, 7:155–166,
1996.
115
[21] Zwislocki J.J. Theory of the acoustical action of the cochlea. J. Acoust. Soc.
Am., 22:778–784, 1950.
[22] Pickles J.O. An Introduction to the physiology of hearing. 1982.
[23] I. Koren. Computer Arithmetic Algorithms. Prentice-Hall, New Jersey, 1993.
[24] F. Thomson Leighton. Introduction to Parallel Algorithm and Architectures:
Arrays, Trees, Hypercubes. Morgan Kaufmann, San Mateo, 1992.
[25] Furst M and Goldstein J.L. A cochlear nonlinear transmission line model com-
patible with combination tone psychophysics. J. Acoust. Soc. Am, 72:717–726,
1982.
[26] Viergever M.A. Mechanics of the inner ear a mathematical approach. Delft
University of technology, The Netherlands, 1980.
[27] M.Brucke, W.Nebel, A.Schwarz, B.Mertsching, M.hansen, and B.Kollmeier. Dig-
ital VLSI-implementation of a psychoacoustically and physiologically motivated
speech preprocessor. Proceedings of the NATO Advanced Study Institute on Com-
putational Hearing, pages 157–162, 1998.
[28] M.P.Leong, C.T.Jin, and P.H.W.Leong. Parameterized module generator for an
FPGA-based electronic cochlea.
[29] Ranke O.F. Theory operation of the cochlea: A contribution to the hydrody-
namics of the cochlea. j. Acout. Soc. Am., 22:772–777, 1950.
[30] R.F.Lyon and C.Mead. An analog electronic cochlea. IEEE Trans. Acoust.,
Speech, Signal Processing, 36:1119–1134, July 1988.
[31] Burden R.L and Faires J.D. Numerical Analysis , Fifth Edition. PWS Publishing,
Boston, 1993.
116
[32] Wegal R.L and Lane C.E. The auditory masking of one pure tone by another
and its probable relation to the dynamics of the inner ear. Physical Review,
23:266–285, 1924.
[33] Harrison R.V and Hunter-Duvar I.M. An anatomical tour of the cochlea. In:
Physiology of the ear, edited by Jahn A.F and Santos-Sacchi. Ravan, New York,
1988.
[34] S.C.Lim, A.R.Temple, S.Jones, and R. Meddis. VHDL-based design of biolog-
ically inspired pitch detection system. Proceedings of the IEEE International
Conference on Neural Networks, 2:922–927, 1997.
[35] S.Kochkin. Marketrack VI: Hearing aid industry market tracking survey 1984-
2000. www.knowleselectronics.com/market/presentations.asp, 2003.
[36] L. Watts, D.A.Kerns, R.F.Lyon, and C.A.Mead. Improved implementation of
the silicon cochlea. IEEE Journal of Solid State Circuits, 27:692–700, May 1992.
[37] Vered Weisz. Robust cochlear based representation of speech signals: Compar-
ison between healthy and damaged cochlea. Master’s thesis, Elect. Eng. Dep.,
Tel-Aviv Univ., Tel-Aviv, Israel, October 2004.
[38] Methods for subjective determination of transmission quality. ITU-T Rec. P.800
August 1996
[39] Perceptual evaluation of speech quality (PESQ), an objective method for end-
to-end speech quality assessment of narrow-band telephone networks and speech
codes. ITU-T Rec. P.862 February 2001