a digital audio coder based on a model of human hearingande2213/filez/paper/mainpaper.docx · web...

University of Minnesota - Duluth

A Digital Audio Coder Based on a Model of Human Hearing

Hans Anderson5/21/2007

TABLE OF CONTENTS

Introduction.................................................................................................................................................................................. 4

1 - Digital Musical Synthesis Techniques.........................................................................................................................7

Wavetable Synthesis.............................................................................................................................................................7

Frequency Modulated Synthesis – (FM)......................................................................................................................9

Modeling Synthesis............................................................................................................................................................ 10

2 - Introduction to Digital Audio Coding........................................................................................................................11

Sampling.................................................................................................................................................................................. 11

The Sampling Theorem............................................................................................................................................... 11

Sample Depth................................................................................................................................................................... 13

Existing Audio Coders.......................................................................................................................................................13

Masking...............................................................................................................................................................................14

MP3 Encoding..................................................................................................................................................................14

Prony’s Method............................................................................................................................................................... 15

Principal Components Analysis for Data Reduction.......................................................................................17

3 - Goals and Assumptions...................................................................................................................................................20

Desireable Qualities...........................................................................................................................................................20

Polyphonic Pitch Detection........................................................................................................................................20

Musically Well-Placed Basis Functions.................................................................................................................24

Frequency-Dependent Time Resolution..............................................................................................................25

Real Time Operation.....................................................................................................................................................27

4 - Theoretical Basis of the Algorithm............................................................................................................................28

Anatomy of the Auditory System.................................................................................................................................29

Theories of Time-Frequency Analysis in Human Hearing................................................................................31

Spectrograms for Time-Frequency Analysis...........................................................................................................34

5 - Our Implementation.........................................................................................................................................................41

The Analysis Phase.............................................................................................................................................................41

Discrete Model of A Damped Harmonic Oscilator...........................................................................................41

Analytic input signal.....................................................................................................................................................52

Masking...............................................................................................................................................................................58

Data Storage..................................................................................................................................................................... 66

Synthesis............................................................................................................................................................................ 67

6 - Performance........................................................................................................................................................................ 70

Quality...................................................................................................................................................................................... 70

2

High Frequency Estimation Error...........................................................................................................................71

High Frequency Attenuation.....................................................................................................................................74

Speed........................................................................................................................................................................................ 76

Efficiency of the Analysis Phase...............................................................................................................................77

Efficiency of the Masking Phase...............................................................................................................................77

Efficiency of the Synthesis Phase............................................................................................................................78

Data Rate.................................................................................................................................................................................78

7 - Future Research Possibilities.......................................................................................................................................78

Parallelization.......................................................................................................................................................................78

Feature Recognition and Transformation................................................................................................................78

Musical Transcription....................................................................................................................................................... 79

Psychoacoustics................................................................................................................................................................... 80

8 - Code......................................................................................................................................................................................... 80

Organizational Overview.................................................................................................................................................80

Analysis Functions..............................................................................................................................................................81

Main File: analysisTest.m............................................................................................................................................81

Auxiliary File: energyRateOfChange.m.................................................................................................................82

Masking Functions..............................................................................................................................................................84

Organizational File: maskTest3.m..........................................................................................................................84

Main File: applyMask.m...............................................................................................................................................85

Auxiliary File 1: estiamteAlpha.m...........................................................................................................................86

Auxiliary File 2: fitmaskCurve.m.............................................................................................................................86

Auxiliary File 3: aliasAmplitude3.m.......................................................................................................................87

Auxiliary File 4: singleMaskCurve.m.....................................................................................................................88

Synthesis Functions........................................................................................................................................................... 88

Main File: testSynthesis.m..........................................................................................................................................88

Auxiliary File 1: findNearest.M.................................................................................................................................90

Auxiliary File 2: continuousFadeExps.m..............................................................................................................91

Auxiliary File 3: chop.m...............................................................................................................................................92

Bibliography...............................................................................................................................................................................93

3

ABSTRACT

For musical recordings of a single instrument playing only one note at a time, there exists reliable

software for detecting the pitches and transcribing them into musical notation. But for polyphonic

recordings (those that contain sounds of several simultaneous pitches) very little has been

accomplished. This is surprising because humans do it so well and because unlike other audio

recognition tasks, such as speech recognition, it doesn't require deep conceptual understanding. In

order to move a step closer to a software solution, we implement a computer model of one theory

of human hearing and use it to encode audio recordings in a format similar to musical notation.

This compact, efficient format has possible applications including vioce over I.P. and live music

synthesis.

4

INTRODUCTION

In this paper, we present an algorithm for encoding and decoding audio signals. Although it

provides a method for storing audio in a very compact format, data compression is not its

intentional goal. It aims, instead, to provide a perceptually meaningful data format. That is, a

mathematical representation of sound that closely resembles the language and notation favored by

musicians. Such a representation has several advantages:

First, for creating synthesized electronic musical instruments it is helpful to represent audio data

in a format that is compatible with popular techniques for digitally generating sound. Secondly, this

format simplifies many musical signal processing tasks such as pitch detection, automatic

transcription, pitch shifting and speed adjustment. Finally, it represents sound in a clear and

intuitive way that enables us to visualize more accurately and understand more easily the nature of

sound. This algorithm is a research tool that provides a convenient way to analyze acoustical data

and experiment with sound.

Of course, these advantages come at a price; compared with mp3 and other perceptual data

compression schemes, this one is very computationally expensive. Also, when faced with decisions

in the design phase of the project, we occasionally opted for intuitive solutions based on perception

and physical analogy instead of algebraic manipulation. As a result, the algorithm is less accurate

than it would be.

An important distinction between this and other time-frequency analysis methods is that it adheres

to a perceptual measure of accuracy. We have maintained, as a guiding principle, the idea that the

human hearing apparatus is, by definition, the archetypical example of a perfect perceptual audio

encoder. In other words, wherever inaccuracies exist in our algorithm’s ability to perform time-

5

frequency analysis on an audio signal, they should not be considered as perceptual deficiencies if

there is evidence that the human auditory system makes similar mistakes.

Finally, since our CODEC1 is modeled after a particular aspect of a theory of hearing perception, the

quality of its sound output provides an indication of the degree to which that particular theory

explains the sensitivity of our hearing.

1 CODEC stands for COder/DECoder.

6

1 - DIGITAL MUSICAL SYNTHESIS TECHNIQUES

The method we are presenting uses a data format that makes it useful for producing synthetic

music. Since this is a major motivation for the project we begin by summarizing the existing

techniques for music synthesis.

WAVETABLE SYNTHESIS

Wavetable Synthesis is, conceptually, the simplest of popular techniques. A description in summary

goes like this:

Suppose you want to make a synthesizer to imitate the sound of a piano. Begin by making a

recording of the sound of every key on a real piano.

Cut each sound into three sections:

o The first section represents the “attack”, that is, the sound of the felt-padded

hammer as it strikes the string at the beginning of the note.

o The second section is the “sustain”. This part represents the tone of the instrument

as it holds out a note. It will usually be cut so that it can be played in a loop that

could continue indefinitely to produce an arbitrarily long note.

o The final part is the “release”. In our example of a piano, this is the sound of the felt

dampers clamping down to dampen the vibration of the string.

When a musician presses a key on the electronic keyboard that controls the synthesizer, the

recording of that same key from the real piano should begin to play. It should begin, of

course, with the attack. Then the sustain part should loop until the key is released. After

the release of the key, the sustain section should stop looping and the release part of the

sample should play.

7

Wavetable synthesis is employed, at least part of the time, by most high-quality modern

synthesizers. Its main advantage is accuracy; since every sound from a wavetable synth is

based on a recording, the sound closely resembles the real instrument.

Wavetable synthesis has several disadvantages:

It lacks expressiveness. Keyboard instruments can be accurately represented by

wavetable synthesis because they are compatible with the “attack, sustain, release”

model described above. That is, there isn’t much variety in the way each key is pressed,

and the musician doesn’t do anything to affect the timbre of the note while the key is

held down. Brass instruments or bowed string instruments are not so easily adapted to

this model because players of these instruments constantly adjust the amplitude,

timbre, and even the pitch while the note is in the “sustain” phase.

It requires a lot samples. The timbre of most instruments changes significantly

depending on the amplitude. A vibrating string, for example, produces a harmonic

series of sounds in predictable, even ratios when the amplitude of vibration is minimal

compared to the length of the string. But as the amplitude increases non-linear effects

of friction begin to change the spectrum of the output in order to produce sounds of

more dissonant character. It is therefore necessary to sample each note at several

volume levels in order to get a realistic representation of the sound of the instrument. A

full-sized piano keyboard has 88 keys. If we sample each note at five volume levels, we

need 440 samples. Although the falling price of digital memory is making the necessary

hardware increasingly affordable, nothing relieves the cost of human effort required to

record all those samples. Needless to say, each recording should be of the highest

quality and each requires considerable manipulation and editing to make the attack,

sustain, and release sections fit together smoothly. Furthermore, the one who does the

8

sampling must take great care to be consistent about how he plays the notes on the real

instrument that he is recording. If he samples at five volume levels he must make sure

that the pressure he applies to the piano key for the fourth volume level is absolutely

the same for each key. For keyboard instruments, he may use some form of mechanical

assistance but this is not possible for other instruments. A trumpet, for example, must

be played by human lips or the sound will not be pleasant.

FREQUENCY MODULATED (FM) SYNTHESIS

FM synthesis is an approach that was prevalent in the digital synthesizers of the 1980’s and

continues to be used in low-cost keyboards and computer sound cards. It was patented in 1977 by

Stanford Professor John Chowning (Stanford University News Service, 1994). The idea is to

produce sounds based on simple mathematical expressions of the form

cos (ωc t+∫0

t

f ( x ) dx)Where f (t ) is an arbitrary function that “modulates the frequency” (Schottstaedt).

Frequency Modulation produces tone that is based around a fundamental harmonic oscillation;

typical choices of the modulation function generate various spectra of harmonics above the

fundamental.

The sound of an FM synthesizer is distinctly electronic and rarely resembles any real instrument

except perhaps in caricature. Favoring expressive control and exotic texture over realism, FM

synthesis became the driving force behind the popular bands of the 1980’s. Today it is still used to

produce electronic-sounding tones.

9

MODELING SYNTHESIS

Recent research has shifted to computationally expensive numerical vibration simulation models

that represent each part of a physical or electrical instrument by a delay line. A delay line

represents a vibrating string, for example, by modeling the transfer of vibrational energies through

a finite number of segments, each representing a small portion of the string. In the case of a guitar

model, the output from the delay line that represents the strings may be coupled into another

model that represents the wooden body of the guitar.

Modeling synthesis opens up a new dimension of freedom to synth designers, allowing for both

accuracy and expressive control. Although some degree of modeling is incorporated into the most

expensive commercial synthesizers it has not supplanted wavetable synthesis as the primary

means of producing electronic sound because it is too computationally expensive and it requires a

large amount of effort to design a complete set of instruments.

10

2 - INTRODUCTION TO DIGITAL AUDIO CODING

SAMPLING

As a sound waves pass over the diaphragm of a microphone the oscillating sound pressure levels

induce a similarly oscillating electrical current. If the microphone is connected to a personal

computer, the oscillating signal reaches the sound card where the voltage is measured many

thousand times each second and the measurements are stored numerically in the computer

memory.

THE SAMPLING THEOREM

The Nyquist-Shannon sampling theorem, often called simply “the sampling theorem” states that a

continuous audio signal may be perfectly reconstructed from discretely sampled data provided the

original signal is band-limited so that the absolute value of the highest frequency, f max, is not

greater than one half of the sampling rate,f sample.

|2 f max|≤ f sample

This theorem is often misunderstood to mean that practical decoders such as those used in

personal computers are capable of exact reproduction up to frequencies of half the sampling rate.

The Nyquist-Shannon theorem requires that the decoder must reconstruct the signal by multiplying

each sample by a sinc function, so that the influence of a particular sample affects the interpolation

between samples over the entire signal, not only near the sample in question. Since most practical

decoders use much simpler means of interpolating between samples, their actual frequency range

capability is quite difficult to predict. (Goldberg, p. 62)

11

The formula for perfect reconstruction of a band-limited signals (t )from data sampled at a rate of f s,

is

s ( t )= ∑n=−∞

∞

x [ n ] sinc (tf s−n )

Where

sinc (t )={ 1 , t=0

¿sin ( t )

t, otherwise

andx [n ]is the nth sample of the signal.

In practice, the implication of the sampling theorem is a direct tradeoff between sound quality and

data file size; the higher the sampling rate, the wider the frequency range of the output. (Goldberg,

2003)

Perceptually speaking, lowering the sampling rate results in a muffled sound where bass notes are

faithfully reconstructed but high pitches become muted.

12

FIGURE 1 - THE SINC FUNCTION

SAMPLE DEPTH

Another factor relating file size to sound reproduction quality is the numerical precision of the

sampling. Clever adjustment of numerical precision can significantly reduce storage requirements

without affecting the percieved quality of the output signal. We will say more about that in the next

section.

EXISTING AUDIO CODERS

Most of the world encountered digital music for the first time with the invention of the compact

disc. Music CDs use a straightforward coding scheme: audio is sampled 44,100 times per second.

Each sample is a sixteen-bit signed real number, and the samples come in pairs – one sample for

each of the left and right stereo tracks on the disc. One minute of audio in CD format occupies just

over 10MB. A whole CD can contain about 800MB of data. This is slightly more than the capacity of

a data CD because audio CDs use a simpler error correction scheme. (Red Book (audio CD

standard))

Early efforts to store music on personal computers employed Pulse Code Modulated (PCM) formats

very similar to that of audio CDs. The ubiquitous .wav format is a category of PCM formats that

allows several choices of sampling rate and depth. The PCM wav format is quite inconvenient for

storing and transferring music because a typical music file requires more than 30 Megabytes of

hard drive space.

At the time when the first free mp3 player software became popular most personal computers had

a hard drive smaller than 250MB so the prospect of storing music on the computer in .wav files was

not particularly attractive. Converting CD audio or .wav files into .mp3 format typically reduces the

13

file size by a factor of ten. That makes music files small enough to send them over a 56 kbps modem

in under ten minutes. (Dwight Brown)

MP3 is a lossy CODEC, which means that it reduces file size at the expense of sound quality; often

called a perceptual CODEC because it exploits imprecision in hearing perception to allow

imprecision in the sound representation without causing perceptible loss of sound quality.

MASKING

The human auditory system has a tendency to perceive certain sounds more accurately than others

and sometimes to completely ignore certain types of background noise, perceiving only the louder

auditory stimuli. Waiting in the office in the quiet at the end of the evening one becomes aware of

the 60 Hz hum emanating from the electronics in the building. Of course, that sound is also there in

the day but it is not perceptible because it gets masked by the sounds of people talking, typing, and

moving around the building. Only when there are few other stimuli of louder volume does one

begin to perceive those softer sounds. This is an example of perceptual masking.

In the study of Psychoacoustics, we differentiate between two types of masking: temporal masking

and frequency masking.

Frequency masking, also called simultaneous masking, is when a louder sound obscures

simultaneously occurring sounds at other frequencies. Temporal masking occurs when a loud

sound obscures a softer sound impulse that occurs shortly before or shortly after the loud sound.

(Goldberg, pp. 156-157)

Some perceptual coders exploit perceptual masking by predicting which parts of a signal are likely

to be masked and reducing the sample precision for those sections.

MP3 ENCODING

14

The popular MP3 encoder exploits perceptual masking to reduce the size of audio recordings. A

summary of the process is as follows (Goldberg, 2003):

1. Cut the incoming data stream into overlapping windows of 512 samples each. Each window

will be encoded separately. In the reconstruction phase the decoder will piece the signal

back together.

2. Apply a series of filters to separate the signal into 32 frequency bands. This allows the

encoder to control the accuracy of the encoding process independently for each frequency.

3. Reduce the sampling rate of each band. For signals limited to a narrow frequency band, the

sampling rate can be significantly reduced without any loss of information. For a detailed

explanation and proof of this, see Bosi and Goldberg, pages 80-84.

4. Compute the masking effect for each frequency band and reduce the sample depth for

heavily masked frequencies. The most audible frequencies should be encoded with the

greatest accuracy possible but for frequencies that would not be clearly perceptible the

sample depth can be significantly reduced without a noticeable loss of signal quality.

Typically, the original signal has 16 bits per sample but for heavily masked frequencies,

much lower precision may be adequate.

5. Apply a Huffman Coding lossless compression to further reduce the data rate. Huffman

coding is similar to the algorithm in the ubiquitous PKZIP format. It reduces file size

without any data loss by identifying the most frequently used patterns and replacing them

with shorter bit sequences.

An MP3 encoder with typical settings achieves compression by a factor of 10 for music files.

PRONY’S METHOD

While researching models of gas dynamics in 1795 Baron Gaspard Riche de Prony developed an

exact method for fitting a model of p exponential functions to a dataset of 2 p observations. His

15

method was later generalized to allow models containing sinusoidal functions and to accommodate

complex-valued input. In present-day usage, the idea is rarely used to fit exact models – instead a

method of least-squares approximation is used to fit a model of a small number of complex

exponential functions to a relatively large set of samples.

Given samples x [1 ] … x [ N ], the Prony method estimates complex values for the parameters hk, and

zk to minimize

ρ=∑k=1

N

( x [ n ]− x [ n ] )2

Where

x [n ]=∑k=1

p

hk zkn−1

Prony’s method is especially useful when an appropriate value for p, the number of exponentials in

the model, is known. There are several algorithms for estimating p based on trying the method

several times and comparing results but the estimation is difficult becauseρ, the total-squared-

error, is non-increasing as p approachesN . (The estimation becomes exact when p=N /2.)

In modern usage, where the number of functions in the model is much less than the number of data

points, it is important to consider the effects of noise. Prony’s method is quite resistant to the

presence of noise if it is evenly distributed in frequency (white noise) but its accuracy is not as good

for narrowband noise. If the signal is relatively constant over a longer period of time we can

distinguish the signal from the noise by looking for correlations between samples taken at different

times.

Prony’s method is not an audio CODEC but it does provide a means for polyphonic pitch detection.

Its implementation is completely different from that of our own method and the details of it are

16

outside the scope of this paper but it is worth mentioning because both algorithms compute

parameters to fit a sum-of-sinusoids model to a set of periodic data.

PRINCIPAL COMPONENTS ANALYSIS FOR DATA REDUCTION

Prior to beginning research on the method described in this paper, we experimented with using

principal components analysis as a method for audio data compression. The results were

acceptable but neither the compression ratio nor the quality was an improvement over existing

compression schemes and the data format was conducive to neither musical analysis nor synthesis.

We mention it here because it may still be helpful, when combined with the present incarnation of

our CODEC, for either increasing data compression or for feature recognition.

Principal components analysis is a multivariate statistical technique for identifying the most

significant linear combinations of factors in a dataset:

Suppose we have a set of m n-dimensional multivariate observations represented by the m× ndata

matrixX . LetU ΣV ¿be the Singular Value Decomposition of X . The elements σ 1…σ nalong the

diagonal ofΣare called the singular values ofX . If we let u1…um denote the m columns ofU and let

v1…vnrepresent the nrows ofV ¿, then we can writeX in terms of the orthogonal set of matrices

Zi=ui v i as follows:

X=∑i=1

r

σ i Zi

The singular valuesσ iare arranged along the main diagonal ofΣin non-increasing order so that σ 1Z1

is the most significant term in the sum shown above, followed by σ 2Z2and so on. The term that

accounts for the least significant portion of the variance of the dataset isσ r Z r, where ris the rank of

X .

17

Suppose we wanted to reduce the file size necessary to storeX . We could leave out some of the less

significantσ i Zi terms from the sum, often without any perceptible effect onX . This is especially

true whenX contains noise. For some audio recordings, the vast majority of the desirable part of

the signal resides within the first few terms of the sum while the remaining terms are mostly noise.

In some cases the signal to noise ratio improves after applying a judicious amount of this type of

data compression (Meyer, pp. 412-418).

In order to actually reduce the spaceXoccupies on the disk, we transform it to a new basis in which

the columns ofU , called the left singular vectors, are the standard basis vectors. In order to do this,

we begin with the statement of the SVD:

X=U Σ V ¿

Recalling thatU is unitary, we write

Y=U ¿ X=Σ V ¿

WhereY is a representation ofX relative to the basis of left singular vectors. Now, if we want a low-

rank approximation toY , we let Σk=diag (σ 1 , …σ k ,0 ,…0 )so thatΣkhas lower rank thanΣ. ThenY

is a rank-k approximation toY .

The advantage to this approximation is that Y contains a row of zeros for everyσ ithat we removed

fromΣ when we definedΣk. We do not need to waste disk space storing those zero rows, so we

simply make a note that they were there and then reduce the size ofY .

It has been noted that the singular vectors are a generalized Fourier series. Indeed, the first few

singular vectors, when plotted, resemble the low-order sine and cosine terms of a standard Fourier

series basis. The higher terms, however, bear a diminishing resemblance to anything at all.

Because of this, we found it difficult to do any meaningful analysis or transformation based on PCA.

18

Ordinary PCM sampled data is so different in nature from the way we perceive sound, and from the

way we generate it, that it is only with great difficulty that linear transformations can accomplish

anything musical. But linear statistical techniques sometimes become more useful after a non-

linear transformation is applied. With this in mind, we set out to define a more perceptually

meaningful representation for audio signals.

19

3 - GOALS AND ASSUMPTIONS

DESIREABLE QUALITIES

Typically, the purpose of an audio CODEC is to reduce the size of a data file or lower the bit rate of a

stream. The algorithm we describe in this paper is intended for quite a different purpose: to

transform the data into a format where various types of analysis and transformations become

simple. In order to make it suitable for that purpose, we examine the deficiencies of existing

methods and consider how we might avoid them.

POLYPHONIC PITCH DETECTION

From a musician’s perspective, it is important to know what pitches are present in a sound sample.

Humans can learn to isolate the sound of an individual instrument in a recorded sound, to identify

the notes that instrument plays, and to replicate the sound by playing the same notes on another

instrument. A physical object in vibration, such as a piano string, produces a proliferation of

harmonic frequencies. As a result, the process of identifying the sound of a specific key on a piano

can be much more complicated than simply identifying the frequency of the sound. Nevertheless, if

we can estimate the strongest of those harmonics we are in a much better position to guess which

keys the pianist is pressing.

A monophonic composition is a musical piece for which, at any time during the performance, there

is never more than one fundamental pitch. Most wind instruments such as the flute or the

saxophone are only capable of playing one note at a time. Any piece in which such an instrument

plays unaccompanied will be monophonic. Stringed instruments such as the guitar or the piano are

capable of playing several notes simultaneously and therefore they can produce polyphonic sounds

without accompaniment.

20

For a monophonic signal,f ( t ), we say that we have detected the pitch on a given time interval when

we have found parameters a, θ and ϕ such that we minimize ‖a sin (θt +ϕ )− f (t )‖. For polyphonic

signals, we need to find parameters for amplitude, frequency, and phase; ak, θk, and ϕk to minimize

the following:

‖f ( t )−∑k=1

n

ak sin (θk t+ϕk )‖In practice, this can only be done in rough estimation because n, the number of terms in the series,

is unknown.

Most existing perceptual audio coders derive somehow from FFT2 methods. The obvious advantage

of the FFT is its computational speed and simplicity of implementation. Musicians describe sound

as a sum of oscillations at a small, finite number of frequencies. In acoustical research, the FFT is

often used as a means for identifying the frequencies and amplitudes of these oscillations. But the

result of a Fourier transform can be quite different, conceptually, from the result that musicians

expect.

In its analytical form, the Fourier transform is usually stated as an improper integral with bounds

from t=−∞¿ t=∞ . But discrete Fourier transforms are always computed on a windowed signal,

that is, a signal that has been divided into short sections of samples. Although, for computational

reasons, the window size is usually chosen to be an integer power of two, in theory, it is arbitrary.

Since Fourier transforms are sometimes used for pitch detection, it is natural to say that the Fourier

series of a sound represents the constituent frequencies of that sound. It is important to consider in

what sense this is true. It has been proven that the Fourier series for any appropriately band-

limited continuous signal converges to that signal but when musicians use Fourier analysis for pitch

2 Fast Fourier Transform

21

detection they are not looking for a series that converges to the original signal. Instead, they want

to estimate the resonant frequency of the vibrating object that produces the sound. Before we can

extract that kind of information for a Fourier series we must carefully consider the effects of

windowing.

The following figures show Fourier series for the function sin (2 t ). Computing the series on a

finitely bounded interval (figure 2), we can see that when the length of the interval is an integer

multiple of the period of the signal, the series contains only one non-zero term corresponding

exactly to the input function. Otherwise, the series contains an infinite number of terms and in

some cases does not approximate the original signal well unless we compute a large number of

them. In the second set of plots we see that when we change the length of the interval from 2π to 7

we get a series with a large number of high-amplitude terms that still fails to approximate the input

at the boundaries even after we carry it out to twelve terms.

If we had used the second series to estimate the wavelength of the input function we would

probably guess that it was near 3.5 but we would have no way of guessing how many other

frequencies might be in the signal.

The fourth plot (figure 3) shows eight terms in the Fourier series for f ( t )=sin (2 t )+cos (3 t ) on (0,7 ).

In this case the Fourier series tells us only that a large portion of the energy is focused at the lower

frequencies. It gives no indication that the input was the sum of only two sinusoids.

22

23

FIGURE 2 - FOURIER SERIES ON (0,2 π) CONVERGES WITH ONLY ONE TERM

FOURIER SERIES OF sin(2t ¿)¿ON (0,7)UP TO 12 INDIVIDUAL TERMS OF THE SAME SERIES

Polyphonic pitch detection, that is, identification of musical tones in signals with more than one

fundamental frequency, is a problem that is not well handled by even the best software.

24

FIGURE 3 - TERMS OF THE FOURIER SERIES FOR A POLYPHONIC INPUT

MUSICALLY WELL-PLACED BASIS FUNCTIONS

The terms of the Fourier series on a given interval form an

orthogonal basis for the vector space of continuous

functions on that interval. Although it can be generalized to

allow an infinite variety of bases, the frequency resolution

of Fourier analysis is always limited by the orthogonality

condition. Whenever we require increased frequency

resolution, we must increase the length of the interval of

our analysis. A typical FFT window for pitch detection

contains 2000 – 4000 samples. That represents between

1/20 and 1/5 of a second, depending on the sample rate.

Later in this paper, we will show that those windows are

much too long for transient elements of a typical music

recording.

In the graph on the left, the frequencies of the terms of a

Fourier series are plotted relative to the frequencies of the

keys on a piano keyboard. The frequencies of the Fourier

terms are in an arithmetic series but musical pitches follow

a geometric series. Consequently, the Fourier series

concentrates the vast majority of its frequency resolution at

the high end of the keyboard. If we choose a sufficiently

large window to give adequate resolution at the lower

frequencies we waste computational effort and get unnecessarily high resolution at the other end.

25

An ideal representation for musical analysis of audio data should distribute its frequency resolution

evenly over the keyboard, in other words, the frequencies of the basis functions should follow a

geometric series. In Fourier analysis, this would imply a violation of the requirement that the basis

should be a set of orthogonal functions.

FREQUENCY-DEPENDENT TIME RESOLUTION

Up to this point we have discussed frequency analysis without regard to precision in the time

domain. But pitch detection has only a very limited musical utility if it doesn’t identify the pitch at

the correct time. In the previous section we mentioned that we can improve frequency resolution

by increasing the length of the analysis window. For FFT based algorithms, the inevitable effect of

increasing the length of the analysis window is a decrease in time resolution.

From a Fourier transform perspective of time-frequency analysis, the precision to which we can

define an event in time-frequency space is limited by the uncertainty principle. A mathematical

statement of this principle is given by the following relation:

σ ωσ t ≥ η

Whereσ ωis the bandwidth of the event, σ t is the time-duration, and η is a constant which depends

on characteristics of the window function.

Although the derivation of this inequality follows from Schrödinger’s generalized proof of

Heisenberg’s uncertainty principle for quantum physics, uncertainty for Fourier analysis of audio

signals has nothing to do with probability because no quantity is estimated. A Fourier transform

does not predict or estimate the frequencies in a signal; it simply transforms the data to an

alternative representation in a different basis. Therefore the quantity σ ω should not be thought of

as the standard deviation of an estimate of the frequency; it is simply the bandwidth of the

transformed representation of the sound. (Cohen, pp. 44-52, 88)

26

In this project, however, we are interested in estimating both the time and frequency of each event

in the signal. We interpret the uncertainty principle in terms of probability, similar to Heisenberg.

This will become clearer mathematically when we discuss the particulars of our implementation

but for now we offer only an intuitive explanation:

For musicians, the concept of pitch is inherently probabilistic. For every acoustic situation there is

a perceptual limit of frequency resolution and that limit depends on the frequency of the sound.

This is a well known phenomenon to anyone who has ever played an upright bass or a bass guitar.

When the bassist plays pizzicato3, his tuning can be quite far off, in fact he may even be playing the

wrong note entirely; still it can be quite difficult to hear that there is any problem. But as soon as he

plays a note that is sustained for a longer period of time or begins to play with a bow his imprecise

intonation creates a very sour sound in the ensemble. This is not the case for high pitched

instruments. A flautist, for example, must be much more careful about tuning, even for notes of

short duration.

Consider the following two sinusoidal graphs:

3 When the true period of the signal does not correspond to the size of the analysis window we compute the Fourier series with the assumption that the signal in the analysis window is replicated infinitely many times so that the signal has infinite duration.

27

On hearing the first sample, a musician will

recognize it as a tone and he or she will be able to identify the name of the note. Although the

second sample lasts the same length of time, the listener will not be able to identify its pitch; in fact,

he will perceive a “thump” instead of a tone. The longer the wavelength of a sound, the more time it

takes for a human to perceive the pitch. The converse of this is also true. That is, higher pitches

may still be perceptible even if they change quickly or last only a short time.

We take this into account in our CODEC design in two ways: First, we plan that at high frequencies,

our time resolution should not be less than that of typical human hearing. And secondly, for low

frequencies we may significantly reduce our resolution without causing any perceptual error. In

that way we can reduce the need for data storage and computational effort.

REAL TIME OPERATION

The ability to function in real time is not a requirement for audio CODECS but it does increase the

range of possible applications. Pitch detection and transformation are useful in live musical

28

performance, and data reduction is necessary for communications technologies such as digital

cellular phones and voice-over-IP software.

The process of encoding and decoding an audio signal can be conceptualized in three phases as

shown in the following diagram:

Typically, the encoding phase is more time consuming than the decoding phase and the time

required for the storage or transmission phase depends more on the speed of the transmission

medium than on the speed of the CODEC.

If the decoder is not fast enough to send the signal directly to the sound output device then it must

save its output to another file format that can play back in real-time. That renders the whole

system useless or at least too cumbersome for practical purposes. Therefore we require that the

decoding phase should be fast enough for real-time operation.

As we will see in a later section about design restrictions, we have restricted our encoder algorithm

to the use of functions that are theoretically capable of real-time operation.

4 - THEORETICAL BASIS OF THE ALGORITHM

In order to meet the objectives mentioned in the previous chapter we are forced to sacrifice some of

the precision of Fourier analysis and adopt a more intuitive and perceptual approach. In place of

mathematical analysis, we take the human auditory system as our example and as our rubric.

The study of sensory perception becomes an increasingly ill-defined science as the focus of the

investigation shifts away from gross anatomy of the sense organs and into the subtleties of

29

psychology. A central question in the field of Psychoacoustics is “What electrochemical format does

the brain use to represent the sounds it perceives?” In efforts to answer that question, audiologists

use electrical probes to measure the electrical response of individual auditory neurons as the test

subject listens to various sonic stimuli but of course, with more investigation, the question becomes

more complicated.

The brain is a highly-distributed computing environment - not easily represented by the kind of

data-path flow charts we might use to describe a computer program. Unlike computer software,

biological sensory perception processes never reach a point that can be considered “output”; they

receive input from the sense organs and begin processing the data, moving it quickly through an

ever-widening data-path that never reaches any objective point of completion. That is, there is no

part of the brain that can be considered the ultimate observer of sense experience.

This creates a problem for anyone who would try to understand sensory psychology: Even if we are

completely aware of all electrochemical processes involved, in what sense does that imply

understanding?

For this reason we must exercise caution when incorporating any aspect of psychoacoustics into

our CODEC. We do not intend to produce an accurate computational model of the auditory cortex

or even to include all known properties of human hearing into our algorithm. Instead, this project

stands somewhere between an analytical perspective and a psychoacoustic perspective.

We intend to employ our knowledge of psychoacoustics in three ways: First, to aid us in defining

perceptual measures of the accuracy of our algorithm - as much as our CODEC produces perceptible

signal degradation, we consider it to be in error; whenever it does not, we declare it “good enough”.

Second, it makes us aware of what is possible – if a human listener can detect a difference between

tones of 400Hz and 401, but our program cannot, then there must be a way to improve our method.

Finally, it gives us occasional inspiration regarding computational methods we might use.

30

ANATOMY OF THE AUDITORY SYSTEM

FIGURE 4 - FLOWCHART OF MAJOR HEARING EVENTS (GULICK, P. 74)

Our research focuses on the last two stages of the flowchart of figure 4, especially the filtering

function of the internal ear. Figure 5 shows a simplified sketch of the physical parts responsible for

this process. Of primary interest is the cochlea, the spiral-shaped organ on the right side of the

illustration. This is the location of the hearing transducer cells and it is believed to be the organ

responsible for separating the incoming sound into component frequencies.

FIGURE 5 - A CROSS SECTIONAL DIAGRAM OF THE EAR (VON BEKESY, P. 11)

31

Base

Apex

Transverse fibers

Figure 6 shows an idealization of the cochlea and basilar membrane uncoiled from its spiral shape

and straightened into a long tube. This kind of reconfiguration is believed to preserve the

resonance characteristics of the organ while facilitating modeling and visualization. The cochlea is

divided into two compartments by the basilar membrane and completely filled with fluid. The

primary sensory transducer cells are located on the membrane itself.

FIGURE 6 - A STRAIGHTENED-OUT MODEL OF THE COCHLEA AND BASILAR MEMBRANE (FLETCHER, P. 47)

THEORIES OF TIME-FREQUENCY ANALYSIS IN HUMAN HEARING

In the past two hundred years, theories of hearing followed loosely after two conflicting models.

The Resonance-Place Theory, suggested by Helmholtz in 1863, said that the basilar membrane

consisted of an array of fibers acting as tuned resonators, each having a unique frequency of

maximum response. Sensory nerves attached to each fiber responded to vibrations by sending out

electrical impulses. He supposed that the brain

identified the frequency of sounds by identifying the

points along the membrane where displacement was

at a local maximum. Although this theory was

intuitively attractive it depended on some faulty

assumptions. Helmholtz proposed that the fibers

were under considerable tension in the transverse

32 FIGURE 7 – HELMHOLTZ THEORY OF RESONATING FIBERS IN THE BASILAR MEMBRANE

direction but that the tension of the membrane itself in the longitudinal direction was negligible.

Later research showed that the tension of the membrane is roughly equal in both directions.

Another difficulty is related to the range of frequencies we perceive; the variation in the length and

mass of the fibers is not enough to account for observed the variation in frequency sensitivity of the

hearing sense (Gulick, pp. 60-62).

In 1886 Rutherford proposed a theory that completely ignored the supposed resonance properties

of the basilar membrane. Called the Frequency Theory, it suggested that frequency separation was

strictly a function of the central nervous system. More recent evidence shows that the auditory

nerve transmits a complete and accurate

electrical representation of the input sound. In

one case, researchers observed that amplified

neural signals from the ear of a cat were

perceptible as speech; confirming that the nerves

respond to much more than the location of

maximum displacement of the basilar membrane

(Gulick, p. 69). An obvious difficulty with this theory is its inability to explain the purpose of the

particular shape of the cochlea. Experiments have proven that its design guarantees that every

frequency of vibration within the range of hearing does correspond to a unique location of

maximum displacement in the basilar membrane.

Aspects of both theories persist among modern explanations but the currently prevailing opinions

center around an alternative theory proposed by Georg Von Bekesy in 1928. His contribution,

called the Traveling-Wave Theory, says that it is the structure of the membrane itself, not the

tension of fibers running across it, that accounts for the frequency-dependent location of maximum

33FIGURE 8 - DIAGRAM SHOWING FREQUENCY-DEPENDENT LOCATION OF MAXIMUM

DISPLACEMENT ALONG THE BASILAR MEMBRANE(GOLDBERG, P. 173)

displacement. In the 1950’s, Bekesy built a large-scale experiment resembling the model in figure 6

that produced vibrations in a rubber membrane such that the forearm of a researcher could

substitute for the nerve sensors in the basilar membrane. This apparatus, shown in figure 8,

consists of a plastic tube cast around a brass tube. The tubes are sealed and filled with fluid. On the

top edge is a rubber strip that allows the researcher to feel vibrations from inside the tube. On the

end, a piston driven by a mechanical oscillator produces pressure waves in the fluid. This

experiment was one of many used to verify Bekesy’s traveling-wave theory of frequency perception

(Von Bekesy, 1960).

In the model shown in figure 9, the skin of the forearm senses the location and intensity of vibration

in the rubber strip. In a sense, the person in the picture is “hearing” the vibration of the piston

through the nerves of his arm. We can take it for granted that the nerves on the forearm are not

sensitive enough to translate unfiltered vibrations into any sensory perception that resembles

hearing. If they were, then people born with total hearing loss could learn to hear by placing their

hand on the vibrating cone of a loudspeaker. But the cone of a loudspeaker doesn’t filter particular

frequencies into unique locations like the basilar membrane does. Bekesy’s theory raises an

interesting question: Is the frequency-dependent location of maximum basilar membrane

displacement enough to account for the human frequency discrimination ability? Or, to put it

another way: If cochlea causes each frequency to stimulate a unique set of nerves, what’s left for

the brain to do?

34

FIGURE 9 - BEKESY'S MODEL OF THE COCHLEA (VON BEKESY, P. 546)

As an incidental outcome, our project provides an answer to that question. Although our numerical

model resembles Helmholtz’s array of tuned resonators more than Bekesy’s Traveling-Wave

theory, all three share an important likeness: They disregard the phase of the input signal; using

only location-intensity information to identify pitches. Mathematically, they measure frequency

and amplitude as real numbers. Since our CODEC also disregards the phase of the input, it

demonstrates the quality of perception that would be possible if the human auditory system were

sensitive to only two quantites: the intensity and location of vibrations in the basilar membrane. Of

course, perceptual deficiencies in our own results do not imply inadequacy of the Traveling-Wave

theory but the success of our codec demonstrates that it is possible to make accurate analysis of

sound based only on the amplitude and location of displacement of the basilar membrane.

35

SPECTROGRAMS FOR TIME-FREQUENCY ANALYSIS 4

A spectrogram is graphical representation of a sound that shows frequency, intensity, and time. It

is either colored or plotted in 3D to demonstrate all three quantities simultaneously. An ordinary

two-dimensional spectral-analysis graph of a chirp (figure 10, left) shows the frequencies in the

sound and their respective amplitudes but it gives no indication of when those frequencies appear.

On the right side of figure 10, we see the spectrogram of the same sample. This time we can clearly

see that the sound consists of a single tone that began at a high frequency, swept down to a low

pitch, and rose again following a quadratic curve.

4 Pizzicato is a musical term for notes played as short as possible. For orchestral strings the player touches or sometimes plucks the strings his or her fingers instead of bowing.

36

Its name seems to imply that a spectrogram is merely a type of plot, a visualization of sound, but

some types of spectrograms are reversible. In other words, the graphical output represents the

data so exactly that we can convert the image back into sound and reconstruct the input. The first

part of the method we present in this paper can be classified as a type of spectrogram and the

explanation of the remaining part requires the use of similar time-frequency analysis terminology.

Therefore it is appropriate to take a few pages here to outline the mathematics of spectrograms and

time-frequency analysis.

37

FIGURE 10 - A SPECTRUM (RIGHT) AND A SPECTROGRAM (LEFT) OF A CHIRP

We begin by defining s (t ), the amplitude of a signal at timet . S (ω )is the amplitude of the frequency

ωover the entire duration of the signal. The two are related in the following way5:

S (ω )= 1√2π∫ s (t ) e− jωt dt

and

s ( t )= 1√2π∫ S (ω ) e jωt dω

In the language of signal processing S (ω )is sometimes called the frequency-domain representation,

while s ( t )is called the time-domain representation of the signal. It is easy to switch between the

two by means of Fourier transforms but with both representations there is a certain deficiency: the

time-domain representation tells us instantaneous amplitude with perfect accuracy but says

nothing about what frequencies are in the signal. Conversely, the frequency-domain representation

gives no insight into the timing of events within the signal.

Ideally, we would like to have both kinds of information simultaneously in the form of a function

P ( t , ω )from which we could compute the intensity of energy in the signal at time tand frequencyω.

Borrowing from the language of probability, this is called the joint energy-distribution-function

(EDF) of the signal. Spectrograms are one type of algorithm that computes an approximation of

P ( t , ω ).

If we want to compute the energy in the signal in the two-dimensional intervalω0<ω<ω1, t 0<t<t 1

then we integrate P ( t , ω )as follows:

E=∫ω0

ω1

∫t0

t1

P (t , ω) dt dω

5 This chapter summarizes the relevant information from (Cohen, 1995).

38

The functions s ( t )and S (ω )are related to P(t ,ω)through intensity. Intensity is defined as the

square of amplitude so that |s( t)|2 is the intensity per unit time and |S(ω)|2is the intensity per unit

frequency. These two measures of intensity are known as marginal energy-distribution-functions

because of the following analogues to marginal probability-distribution-functions:

∫P (t , ω ) dω=|s (t)|2

∫P ( t , ω ) dt=|S (ω)|2

Any approximation of P ( t , ω )for which these equations hold is said to “satisfy the marginals”.

Many widely-used time-frequency analysis techniques do not satisfy the marginals. Those that do,

tend to introduce artifacts that confuse the analysis.

Figure 11 shows an example of a distribution that

satisfies the marginals but exhibits significant

inaccuracies whenever the input is polyphonic.

There are infinitely many ways to construct an EDF.

Especially if we do not require satisfaction of the

marginals, mathematics can tell us very little about

how it should or should not be done. The

construction of methods to compute an EDF is a pragmatic science, guided by intuition, analogy,

and necessity. In order to define our own procedure, we refer back to the goals of our project and

expand the ideas mentioned there. We would like our EDF to have the following properties:

Real time operation: Many time-frequency analysis techniques incorporate information

from the entire signal into the computation of the value at a given time. Our technique

processes the time-domain signal,s (t ), in order to compute P ( t ,ω ). We require that the

39

FIGURE 11 - EDF FOR THE SUM OF TWO CHIRPS AS GIVEN BY THE WIGNER

DISTRIBUTION. NOTICE THE LARGE ARTIFACTS BETWEEN THE TWO MAIN LOBES.

(COHEN, P. 127)

value ofP(τ ,ω) may depend only on values of the functions (t ) for t ≤ τ . In other words, we

should process the signal data file in order from beginning to end, possibly using

information from earlier in the file but never looking ahead of the current position.

Strictly Non-negative Energy: Although the analogy between energy distribution

functions and probability distribution functions suggests that this should be a requirement,

some time-frequency algorithms do not guarantee thatP (t , ω )is positive over the whole

domain. The Wigner distribution shown in fig. 11 suffers from this problem – the spurious

values between the two main ridges of the chirps contain many negative values.

Mathematically, there is nothing wrong with negative values of energy; in fact, the Wigner

distribution is perfectly invertible. But the negative values for energy don’t make sense

according to physical intuition. Furthermore, since we don’t perceive any frequency

between the two chirps it is upsetting to see such large spikes located there.

Local-priority estimate: For the EDFs given by some techniques the relationship between

the original signal and the EDF doesn’t necessarily follow intuitive principles of locality. So,

for example, the value of P (τ ,ω)may be quite large even though s ( t )=0in the neighborhood

of τ . (τ−ϵ <t< τ+ϵ for significantly largeϵ .)

Figure 12 shows a graphical example of this type of problem. On the far left is a graph of

the time-domain representation of the signal. The EDF given by the Wigner distribution is

shown in (a). For this distribution there is a huge artifact in the period of silence between

the two tones. This is an example of a failure to exhibit intuitive temporal locality. Part (b),

the Margenau-Hill distribution, avoids contaminating silence with noise but shows aliasing

in the frequency domain. The Page running-spectrum distribution (c) has some of the same

problem as Margenau-Hill except, since its output at timetonly considers the signal up to

that time, frequency domain aliasing only moves in one temporal direction.

40

Our method strives for a local-priority estimate of the energy, that is, the value of P ( t , ω )is

most heavily influenced by the values ofs ( t )andS (ω )in the neighborhood of tandω. The

influence ofs (t )andS (ω )on the EDF at the point(τ , w )decreases as the distances|t−τ| and

|w−ω| increase. We call this local-priority estimation because althoughP ( t ,ω )is influenced

by events in the signal that are not in the vicinity of (t ,ω ), nearer parts of the signal always

have greater priority of influence than those farther away.

FIGURE 12 - LOCALITY ISSUES IN THREE DISTRIBUTIONS FOR THE SAME SIGNAL (COHEN, P. 177)

A local priority estimate is achieved by multiplying the signal by a window function that limits the

influence of parts of the signal that are far from the area of interest. Consider the Page running-

spectrum distribution (demonstrated in part c of figure 12):

P ( t ,ω )=2ℜ¿

Where St−¿ (ω )¿is the frequency-domain representation of the signal up to timet :

St

−¿ (ω )=1

√2π ∫−∞

t

s (τ ) e− jωτ dτ ¿

41

It is clear from this representation why the distribution is zero in the time corresponding to the

silent parts of the signal shown in figure 12; the s ( t )term in the definition prevents the EDF from

being non-zero whenevers ( t )itself is zero. But the infinite lower bound on the integral in the

definition of St−¿ (ω )¿implies that non-zero parts of signals from the past will contaminate non-zero

parts of the ESD in the future and that the contamination will continue indefinitely. A natural

solution is to limit the influence of past events by computingSt−¿ (ω )¿with a finite lower bound on the

integral∫−∞

t

s (τ )e− jωτ dτ . This is precisely the motivation for using a spectrogram.

In general form, the EDF of a spectrogram is given by

P (t , ω )=| 1√2 π

∫e− jωτ s (τ ) h (τ− t ) d τ|2

Where h (τ−t )is a window function that typically has the following characteristics:

h (τ−t )={ 1 when τ=tbetween0∧1 ¿

t¿0¿when τ is far¿t ¿

The definition of the functionhhas significant effect on the properties of the EDF given by the

spectrogram.

One such property which is of fundamental importance is the time-frequency resolution. Recall

from our discussion about the uncertainty principle that there is a tradeoff between time resolution

and frequency bandwidth. The inclusion of the window function is an attempt to improve the time

resolution by forcing the effect of sound events not to influence the value of the EDF outside of a

narrow time interval. The natural result as a consequence of the uncertainty principle is a widening

of the bandwidth of each event.

42

It should be noted that distributions such as the Page running spectrum and the Wigner

distribution, since they are computed using an integral with infinite bounds, have the potential for

infinitely precise frequency resolution. But in the graphs shown in figures 11 and 12 this does not

appear to be the case. The peaked parts of the graph show a gentle curvature on either side of the

base. This is the result of an implicit window function forced upon our analysis because we were

processing signals of finite duration. Even though we did not apply any window, our signal is non-

zero for only a limited time so the EDF behaves similarly to the way it would if we had windowed

the input, that is, each event is somewhat spread out in the frequency domain.

Spectrograms do not, in general, satisfy the marginals. The addition of the windowing function into

the expression for P ( t ,ω )confounds the energy density of the signal function with the energy of the

window function and introduces effects unrelated to the properties of the original signal. Our

CODEC also doesn’t satisfy the marginals. We will discuss how this affects the quality of the output

in the next chapter when we explain the perceptual masking phase of our algorithm.

43

5 - OUR IMPLEMENTATION

FIGURE 13 - OVERVIEW OF OUR PROGRAM

Our algorithm works in four phases as shown in the chart above. In this section we describe the

details of each phase of the process.

THE ANALYSIS PHASE

DISCRETE MODEL OF A DAMPED HARMONIC OSCILLATOR

Helmholtz’s resonance-place theory of frequency filtering in the basilar membrane is at the heart of

our time-frequency analysis method. Recall that Helmholtz imagined the basilar membrane to

consist of an array of tuned resonators responding in unison to the stimulation of vibrations of

various frequencies as they pass through the cochlea. He knew that every frequency of vibration

would stimulate each one of the resonating fibers, but for a given input frequency, the fibers whose

resonance most closely matched that pitch would respond most strongly. This, he supposed, was

the way we identify individual frequencies in the sounds we hear. Although his theory was later

supplanted by Bekesy’s traveling-wave theory, the idea that different frequencies correspond to

different-locations on the membrane persists. Helmholtz’s theory still has one advantage over the

traveling-wave idea: computational simplicity. Even at the time when the field of Psychoacoustics

was still in its infancy, resonance was already a well-understood phenomenon. The Helmholtz

model can be implemented on a computer as a simple array of ordinary differential equations, one

for each resonator. The traveling-wave theory, on the other hand, corresponds to a non-linear

44

inhomogeneous partial differential equation. We did not find any evidence in the psychoacoustical

literature that it has ever been modeled mathematically6.

Regardless of whether the cochlea is home to an array of damped harmonic oscillators or to a nasty

partial differential equation, these two facts remain uncontested:

1. When vibration, at any frequency, stimulates the fluid in the cochlea the whole basilar

membrane is set in motion.

2. The magnitude of displacement depends on the frequency of the vibration and the location

along the membrane.

We have based our model on the older Helmholtz theory; mostly because we have discovered a

simple and efficient discrete method for modeling the response of an array of damped harmonic

oscillators but also because we believe it is qualitatively similar to the traveling-wave model;

similar enough so that the advantages of the Helmholtz model’s simplicity outweigh the

disadvantages of its inaccuracy.

We begin to derive our method by considering a well known result about the amplitude of

resonance in a damped harmonic oscillator driven by a sinusoidal forcing function. If the resonator

is represented in the form of a mass-spring-dashpot system with massm, spring constantk , and

damping constant cthen we use the following differential equation as a model:

m x ' '+cx ' +kx=f (t )

In the cochlea, the sound vibrations entering from outside provide a forcing function, stimulating

vibrations within. When the forcing function is a single sinusoid with amplitude Aand frequency

6 jis the imaginary square root of -1. These integrals appear without any specification of the upper and lower bounds. The integration should be carried out over the whole domain of the variables of integration. This applies also to all other expressions in this paper where an integral appears without bounds.

45

ω ,the amplitude of resonance induced in a damped harmonic oscillator is known to be given by the

following formula7:

A

√( k−mω2)2+ ( cω )2(1)

The derivation of this formula, which appears in most ordinary differential equations textbooks8, is

well known and we will not repeat it here. Instead we present an alternative method for arriving at

the same expression. The advantage of this second method is that we can prove it generalizes to

give the amplitude of resonance even if the forcing function is not a sinusoid. Normally, we would

need to solve the differential equation or compute the Fourier series of the signal before we could

find the amplitude of resonance but with our new method we can compute it for any forcing

function even if we don’t have an analytical expression for the function.

7 We made our own attempt but found that our results were at odds with Von Bekesy’s. Bekesy showed evidence that the traveling waves are a combination of transverse and longitudinal vibration. Since the equations that model that type of vibration are difficult to analyze, we considered seperable non-linear PDEs that model the transverse components of the vibration. The very fact that the equations were seperable precludes any possibility of producing modes of vibration that appear as “traveling waves”. Since we do not have any interest in dissecting cadavers, we decided instead to accept Von Bekesy’s Nobel-prize-winning conclusions without making any further investigation. 8 This formula gives the steady-state response of the harmonic oscillator so we don’t make any stipulation about what initial conditions we use to solve the equation. The transient solution decays in time and leaves this result regardless of what initial condition we use.

46

FIGURE 14 - AMPLITUDE OF A DAMPED HARMONIC OSCILLATOR WITH RESONANT FREQUENCY ωr=30 AS THE FORCING FREQUENCY VARIES FROMωf =0¿100

Let s ( t )be the amplitude of a continuous audio signal at timet . The response to the signals, of a

damped harmonic oscillator tuned to a resonant frequency ofωr, is given by the product of the

signal with a complex exponential (the oscillatory component of the resonator) and a real

exponential function (the damping component of the resonator, with damping constantΓ),

integrated over all time up to the current moment9:

∫−∞

t

s (τ )e−i ωr τ eΓ (τ−t )dτ(1)

(2)

Usually the input signal contains oscillatory components. Even if that is not the simplest

representation, we can always use Fourier transforms to write it as a linear combination of complex

exponentials. Let us assume that our signal is a single complex exponential forcing function with

real-valued frequencyωf and real amplitude A. In that case we can simplify the expression for the

response in the following way:

∫−∞

t

s (τ )e−i ωr τ eΓ (τ−t )dτ=∫−∞

t

A ei ωf τ e−i ωr τ eΓ (τ −t)dτ¿A e i( ωf −ωr ) t

Γ +i ( ωf −ωr )

(1)

(3)

Now, we are concerned with the amplitude of oscillation, not the phase, so we find the complex

magnitude by multiplying by the complex conjugate and taking the square root:

A

√Γ 2+ (ωf−ωr )2(4)

If we setΓ=cω, ωr=k ,∧ωf =mω2then formula (3) is equivalent to formula (1), demonstrating

that (2) is an alternative method for computing the amplitude of resonance of a harmonic oscillator

9 (Edwards, 2005)

47

when s ( t )is a sinusoidal forcing function of known amplitude and phase. But what if sis not a

sinusoid?

Provided s can be represented by a Fourier series, we can re-write (2) as follows

∫−∞

t

¿¿

and by the linearity of the integral operator, the response of the resonator is

∑n

An ei ( ωfn−ωr ) t

Γ+i (ωfn−ωr )

We cannot simplify this to a sum of terms resembling (3) because the complex magnitude operator

is non-linear. As we will discuss later, this fact mathematically explains the crucial distinction

between Rutherford’s frequency theory and those of Helmholtz and Von Bekesy.

Now that we know (2) expresses the response of a damped harmonic oscillator at time tto the

forcing function s (t )we need to find a discrete form of the expression so that we can compute it

when the forcing function is given in terms of samplesx [n ].

Actually, we derived (2) from the discrete formula, not the other way around. Therefore it will be

simpler if we describe the algorithm first, without giving any derivation, then show how it

approximates (2) in the limit as the sampling rate,f s, approaches infinity.

Suppose we want to approximate the resonance of the harmonic oscillator of resonant frequencyωr.

Oscillating at its resonant frequency, the vibration follows the function e i ωrtso the time for a single

period of oscillation is 2πωr

. At a sampling rate of f sthere are n=f s2πωr

samples per period of

oscillation. Let zndenote the nthcomplex root of unity so that(z¿¿n)n=1¿. The sequence

48

Z=zn0 , zn

1 , …znkrepresents kconsecutive samples of the functione i ωrt . If x [k ]is the k thsample of the

input signalX then the sum

∑k

znk x [ k ] (5)

is the inner product of the vectorsX andZ.

This is like a discrete approximation for (2) without the damping termeΓ (τ −t ). We could

approximate the damping with discrete exponents of a real numbera∈(0 ,1) just as we did for the

oscillation term with exponents of zn:

∑k=−∞

κ

aκ−k znk x [ k ]

But then when we move on to the next sample,x [k+1], we would have to completely re-calculate

the sum because at the next sample κ=k+1. That would mean that in practical operation we would

be calculating an inner product of vectors whose length increases with every new sample we

process, making the algorithm computationally infeasible.

Figure 15 shows the signal,x [k ], the damping term, aκ−k, and the oscillatory term, znk, plotted up to a

certain time t . After time twe move ahead to time t+s and we want to recalculate the inner product

of the three functions. For the oscillatory and signal components, the function at time t+sis the

same as for timetexcept that a new section is added at the end (shown in different color). But for

the damping component, the entire function has to be shifted to the right as a result of the change in

the value ofκ . This means we cannot compute the inner product of the three functions at timet+s

simply by computing the new section and adding the new result to the result from the existing

section. Instead we have to recalculate the entire product.

49

Damping

Oscillatory

Signal

Fortunately, there is another way to calculate the product that requires only a constant-time

operation to update the result each time a new sample is added to the signal. We define an iterative

weighted averaging process that gives the average of a constantly changing sampled valuex [i ]over

an infinite time interval with increasing weight given to the most recent values:

Let A [ i−1 ]be the weighted average ofx [0 ] … x [i−1], then A [ i ]is defined by the following

recurrence relation:

A [ i ]=αx [ i ]+(1−α) A [ i−1 ] , α∈(0,1)

We can better understand the behavior of this function if we expand it as a sum:

A [ i ]= ∑k=−∞

i

α(1−α )k−i x [k ] (6)

Here we see that the weight allotted to the k th sample, x [k ], diminishes geometrically ask−i

decreases. Practically speaking, ifx [i ]is the most recent sample then|k−i|represents the “age” of

50

FIGURE 15 – TERMS OF THE INNNER PRODUCT UP TO TIME tAND EXTENDED TO TIMEt+s.

x [k ] so we could also say that the weight of the k thsample diminishes geometrically as the sample

gets “older”.

There is a difficulty with this type of weighted average which becomes apparent when we consider

a simple property that we usually expect of a formula for a measure of central tendency: Normally

we would expect that ifx [i ]=c , for alli, then A [ i ]=c. In other words, if x [ i]are samples of a constant

value then the weighted average of the samples is equal to the same constant value. We attempt, by

induction, to prove this is true for our formula and thereby demonstrate the trouble.

Proof by induction: Suppose A [ j ]=c for all j<i . Then A [ i ]=αc+(1−α)c=c. It’s going well so

far… but what about the base case? When we begin the iterative calculation by finding A[1]we do

not have any value for A[0]. The obvious solution is to define A [0 ]=x [1 ]. Then

A [1 ]=αx [1 ]+(1−α)x [1 ]=cand the proof is complete.

The difficulty in defining the base case for this induction reveals an important property of this

definition of average. Suppose we had defined A [0 ]=0instead. That may be a more natural

definition because it doesn’t assume a non-zero average for non-existent previous values ofx [ i]. If

x [i ]were samples of an audio signal then the samples beforex [1 ]would more likely be zeros. With

that base case, the induction fails and A [ i ] ≠ c. Instead,limi → ∞

A [i ]=c ; the estimate of the average

begins badly but becomes increasingly accurate as the influence of the zero value of A [0 ]diminishes

in time. It is easy to quantify the error of theithsample by comparing the difference between A0 [i ],

the average computed under the assumption A [0 ]=0, with Ax1 [ i ], the average computed when we

set A [0 ]=x [1]. The error at theithsample:

e [i ]=Ax1[ i ]−A0 [ i ]=(1−α )−i x [1]

This is important because it indicates something about the behavior of the method whenx [i ]is

51

varying; the more rapidlyxis changing, the less likely that Agives an accurate estimate of the recent

values. But the longerxremains constant, the more accurate Abecomes. Even more importantly, we

see how the value ofαaffects the estimate. Notice thatlimα →1

A [i ]=x [i ]. The value ofαcontrols the

amount of influence that the most recent sample has on the whole average. We could also say that

when x is constant αcontrols the rate of convergence. Values ofαcloser to zero cause the average to

converge more slowly and to give more weight to older samples. Larger values have the opposite

effect.

We now consider the effect on this average of the sampling ratef s. Supposes ( t )is a continuous

signal and thatx [t ]is the same signal sampledf stimes per second. Consider equation (5) in the more

practical case where the sum contains finitely many terms:

A [ i ]=∑k=1

i

α (1−α )k−i x [k ]

We can vectorize the computation by writing it as an inner product A [ i ]=⟨ Ω, X ⟩, where Ω=α ¿.

We have mentioned that the parameterαdetermines the speed with which the average converges to

the value at the current sample. As we will see later, we do not want this to happen too quickly. In

fact, the adjustment of this parameter is critical to controlling the time and frequency resolution of

our spectrogram. But the rate of convergence as controlled byα is set in terms of the number of

samples. That is, given values of A [0 ] , c ,α∧ϵ , there exists a precise number of samples,n, such that

the error in A[ i ]afterxhas remained constantly equal tocfor the lastnsamples is guaranteed to be

less thanϵ . So, if we leave everything else unchanged, but double the sample rate, then the time for

the error to diminish belowϵ is cut in half. Therefore we should defineα to depend on f sso that once

we setα the time to convergence remains constant regardless of how often we sample the signal.

52

Suppose we have experimentally determined a value ofα ,α 0, that works well at sampling ratef 0. In

order to keep the time for convergence constant we define a universal valueα=f 0α 0

f sthat gives the

same performance at every sample rate.

It is interesting to see how the computation of A [ i ]behaves in the limit as the sample rate

approaches infinity. Replacingαwithf 0α 0

f s, we have the following expression for thek thelement of

the weight vectorΩ that specifies the weights for the samples on the time interval0<t <1:

Ω [ k ]= f 0α0f s

(1− f 0α 0f s

)k− f s

(7)

This value represents the weight given to thek thsample. Naturally, this number decreases to zero in

the limit as the sample rate goes to infinity. But as the number of elements inΩincreases, it

becomes an increasingly good approximation of a continuous exponential function. To see why this

is so, we define a function a (t , f s )that gives the element ofΩcorresponding to the sample taken at

timetand at sample ratef s.

a (t , f s )=Ω [ t f s ]¿ f 0α 0f s

(1− f 0α 0f s

)(t−1) f s

If we take the limit asf sgoes to infinity we have

limf s→ ∞

a (t , f s )¿ ( limf s→ ∞

f 0α 0f s

)e f 0α0 ( t−1)

We preserved f 0α 0

f son the right hand side of this expression instead of explicitly writing its limit

because its limit is zero. Doing this, we can see that for a sufficiently high sampling rate,Ω

approximates a continuous real-valued exponential function and that it converges pointwisely to 0.

53

The quantity f 0α 0

f sis inversely proportional to the sample rate but the number of elements inΩis

directly proportional. This suggests that the sum of the elements ofΩremains roughly the same

regardless of any change in sample rate.

Let’s summarize the results of this section in a table:

Discrete Continuous

signal x [n] s(t )

timekf s

t

sinusoidal oscillation znk e i2πnt

exponential damping (1−f 0α0f s

)k−i

e f 0α 0 (t−1)

response of damped harmonic oscillator ∑

k=−∞

κ

aκ−k znk x [ k ] ∫

−∞

t

s (τ )e−i ωr τ eΓ (τ−t )dτ

At the beginning of this section we showed that the expression in the bottom-right hand corner of

this table is the displacement of a damped harmonic oscillator driven by an arbitrary forcing

function. We went on to show that each of the discrete formulas10 on the left side of the table is

equivalent in the limit as sampling rate goes to infinity to the continuous formula on the right.

Based on this evidence, we take the discrete formula in the bottom row, to be a good approximation

for the response of a damped harmonic oscillator with a resonant period ofnsamples to a discrete

sampled forcing function, at the time of theκ thsample. For efficiency, we implement the

computation of this expression with the iterative weighted-average process described before:

A [ i ]=αzni x [ i ]+ (1−α ) A [ i−1 ] , α∈(0,1)

10 We do not have references to other papers that use this expression but we will justify its usefulness by showing that it gives the same result as expression (1).

54

We use an array of these discrete resonator models, one for each of the frequencies we wish to

analyze, and we process this iterative calculation once per sample for each frequency. Usually, we

run the algorithm with about 300 resonators in the array but for higher sampling rates it is

necessary to increase this number in order to get good resolution into the lowest frequencies. In

the next two sections we explain how we use the data from this computation to produce a compact

representation of the input signal.

ANALYTIC INPUT SIGNAL

FIGURE 16 –OUTPUT OF OUR SPECTROGRAM FOR INPUT CONTAINING A SINGLE SINUSOID ATω=100

Figure 16 shows the output of our array of discrete oscillators in response to a monophonic

sinusoid. The third axis, receding back into the page, is time. As it begins to process the signal the

program underestimates the amplitude and gives an unclear indication of the frequency but as time

progresses a clear peak rises atω=100, indicating the frequency of the input.

55

Looking from the side at the time axis (figure 17) we can see the peak asymptotically approaching a

limiting amplitude. But there is a significant washboard effect affecting the whole output right from

the beginning of the analysis. The frequency of perturbation is similar to the frequency of

oscillation of the input signal.

FIGURE 17 - A VIEW FROM THE SIDE SHOWS THE WASHBOARD SHAPE OF THE ENTIRE SPECTROGRAM

Let’s consider the cause of this. We measure the amplitude of oscillation based on a time-weighted

average of the inner product of a real valued sinusoid with a complex exponential function. We

always give the heaviest weight to the most recent samples. Consider an exaggerated picture of the

functions we are multiplying:

56

Signal

Exponential Oscillator

Damping

Pointwise Product

time

Figure 18 shows an input signal (top), the real part of a complex exponential function of the same

frequency, a damping function, and the inner product of these three functions (bottom). Since the

damping function puts most of the weight at the end of the time interval, the signal that is zero at

the end produces an inner product with lesser magnitude.

Perhaps the first question that comes to mind is “So what?” We can still see the frequency and the

amplitude of the input from the spectrogram so if the amplitude wobbles a little bit, is that a

problem? The amplitude of the input signal is oscillating in time so isn’t it natural that the

instantaneous energy estimate given by the spectrogram should also oscillate?

We have a confusion relating to what quantity we expect the spectrogram to measure. If we want it

to measure the absolute value of the instantaneous amplitude of the oscillatory component at a

57

FIGURE 18 - THE POINTWISE PRODUCT ON THE LEFT HAS MUCH LOWER AVERAGE VALUE BECAUSE ITS SIGNAL IS ZERO AT THE END OF THE WINDOW.

given frequency then we should expect its output to oscillate at twice that frequency. But we are

not interested in the amplitude of oscillation. We really want to measure the total energy in the

vibration at each frequency.

Our harmonic oscillator model is designed to give an estimate of the energy in the sound waves that

enter into the ear first by propagation through the air and then through the solid medium of the

bone structures of the inner ear. The motion of sound waves in air is governed by the three-

dimensional wave equation but for simplicity, we consider it in only one dimension11:

utt−c2uxx=0

c=√ Kρ

(K isthe bulk modulus of themedium∧ ρis density )

Kinetic energy varies with the velocity of the air molecules and the potential energy varies with

pressure but the total energy always remains constant.

Etotal=Ekinetic+Epotential Ekinetic=12∫ ρ( ∂u

∂ t )2

Epotential=12∫ K ( ∂ u

∂ x )2

So the oscillations in the spectrogram occur because we were considering only the kinetic energy of

the sound. If we consider potential energy as well we get a much smoother spectrogram as shown

in the following illustration.

11 We have not yet shown convergence for the formula in the bottom row; since each of the three terms in the discrete product converges to the corresponding term in the continuous expression the proof is trivial.

58

FIGURE 19 - SAME AS FIGURE 16 BUT WITH QUADRATURE MODEL SIGNAL

It is customary to use complex numbers when representing waves that propagate by transferring

energy between two quantities, as in electromagnetic theory. We can do the same for sound waves

or for our audio signal. If we lets ( t )=cos (ωt ) be the kinetic component of our signal then

s' (t )=−sin (ωt )is the potential component. We can combine these together, defining

A[ s ( t )]=A [cos (|ω|t )]¿cos (|ω|t )− jsin (|ω|t )¿e j|ω|t

and calling this the analytic version of our signal12.

The linear operator Ais defined for general signals as follows13:

A [ s ]=s ( t )+ jπ∫

s (τ )t−τ

dτ

12 (Zauderer, p. 197)Our use of the wave equation in this case represents the sound waves in the air, not the vibration of damped harmonic oscillators. Since damping in air is negligible over short distances, we do not include any damping term in the equation. 13 This is true only when our signal is of the form cos ( ωt ) .

59

Clearly, if we knew the appropriate symbolic description of our signal and if the signal contained

sums but not products of real-valued sinusoids then writing the analytic version would be trivial.

Unfortunately, for sampled signals it is not so easy. In the preceding paragraph we have suggested

the idea of writing our signal in terms of cosines and then simply adding an imaginary sign term

corresponding to each one:

if s (t )=∑n

An cos ( ωn t+ϕn)then A [ s ]=∑n

An¿¿¿∑n

An e j ( ωn t+ϕn)

This is called the quadrature approximation. There is a distinction between this and the correct

definition of Abecause the proper analytic signal behaves differently for negative values ofω.

Normally, real valued signals have a frequency spectrum that is symmetric in frequency about 0.

For some applications we prefer to see only positive values for frequency. Making the signal

analytic is a way of complexifying the input to cancel the negative values out of the spectrum. In

our case, we are not trying to force a positive measurement of frequency. There are several

methods for complexifying the signal that produce the smoothing effect we desire.

Since we cannot write the signal in terms of cosine functions until much later in the analysis

process, we do not try to apply the quadrature method as described above. Instead we modify

equation (2) by delaying the signal one fourth of the wavelength of resonance of the harmonic

oscillator:

A ( t )=∫−∞

t

[s ( t )− j s (t− π2ωr )] e− j ωrτ eΓ (τ −t )dτ

That works effectively when the resonant frequency of the oscillator is near the forcing frequency

but as shown in figure 19, there is still a small amount of ripple effect farther away from the peak.

60

We should note that this delay method is not particularly effective for transient signals because the

samples fromt− π2ωr

seconds in the past may come from a section of the sound where the

amplitude and frequency content is much different from what it is at timet . We also experimented

with using a numerical derivative of the signal to approximate the imaginary part but rejected that

approach because it introduced other types of error.

MASKING

We saw in (3) that

A e i( ωf −ωr ) t

Γ+i ( ωf −ωr )

is the response at timetof a damped harmonic oscillator of resonant frequencyωrto a complex

exponential forcing function of amplitude Aand frequencyωf . The graph of this function asωrvaries

over a wide range appears in figure 14. An important feature of this graph is that it has only one

local maximum value and that value occurs at the point whereωr=ωf . For single frequency input

identifyingωf is a trivial matter of locating the maximum value of (3).

61

FIGURE 20 - SPECTROGRAM OF A MULTI-FREQUENCY SIGNAL

Figure 20 shows the response of our spectrogram to an input that is the sum of sinusoids at three

frequencies. Several important features are visible here. First, the longer wavelengths take

considerably more time to show a peak at the appropriate frequency. The short wavelength part

shows a very sharp peak from early on but it is relatively unstable; the estimate of amplitude is very

unsteady compared to the longer wavelengths. This happens because we make the value ofα

increase proportionally with the wavelength. Remember thatαcontrols the speed of convergence

for the estimate of both amplitude and frequency. For a complex-valued, single-frequency input

signal, our estimate is quite steady for any value ofα . But when the input is polyphonic the

interaction between frequencies causes instabilities. By lowering the value ofαwe can slow down

the convergence, thereby smoothing the signal and producing a more stable result. But this

smoothness comes at a price; slowing the convergence of our spectrogram decreases the accuracy

62

in the time domain. We mentioned before that for high frequencies, it is possible to do good pitch

detection even on a very short timescale but for low frequencies this is not the case. Since this is

true for humans as well as for our model, it is important for us to adjustα to give good time-

resolution for the high frequencies but better stability for low frequencies. Consider the following

illustration:

If we compare the value of the damping function for the most recent sample to the value at the

sample taken one period of oscillation earlier, the difference between the two should be constant

for all frequencies. In other words, the rate of convergence measured in terms of the wavelength of

each damped harmonic oscillator should be constant. In figure 21, the signal on the left has a very

low frequency; therefore the damping function changes slowly. On the right, the frequency is much

higher so we do not want to use a slowly increasing function (red) because it will give bad time

resolution. Instead, we use the one that gives faster convergence (blue).

63

FIGURE 21 - THE CORRECT DAMPING FUNCTION (BLUE LINE) SHOULD DIMINISH BY THE SAME AMOUNT PER PERIOD OF OSCILLATION, REGARDLESS OF FREQUENCY

Returning now to our discussion of the features of figure 20, we describe how we detect the

frequencies in the signal when the graph contains more than one local maximum. Figure 22 shows

a time slice of the spectrogram for a signal with three frequencies.

If the frequencies and amplitudes of the components of the signal are known, the graph shown

above could be given by the formula

‖∑n

An ei (ωfn−ωr ) t

Γ+i (ωf n−ωr )‖=|(∑n

Complex−valuedresponse

¿¿ the sinusoid at¿ thenth frequency¿¿)|

So the response to a signal that is a sum of sinusoidal functions is just the sum of the response for

each frequency. Why then does the spectrogram give such an unstable estimate of the amplitude

when the signal is polyphonic? Because the complex amplitude function is a non-linear operator.

So even though the complex

response of the damped harmonic

oscillators is a linear combination of

64

FIGURE 22 - A TWO-DIMENSIONAL TIMESLICE OF THE SPECTROGRAM FOR A SIGNAL WITH THREE FREQUENCIES

FIGURE 23 - NON-LINEAR EFFECT ON AMPLITUDE IN THE SUM BETWEEN INPUT FREQUENCIES: AN ENLARGEMENT OF THE

SECTION BETWEEN THE LEFT AND CENTER MAIN FREQUENCY RESPONSES FROM FIGURE 24 (RIGHT SIDE).

the response to each of the component frequencies in the signal, the amplitude of that complex

number is not.

FIGURE 24 - THREE COMPLEX HARMONIC RESPONSE FUNCTIONS (LEFT) AND THEIR SUM (RIGHT). THE Y AND Z AXES ARE THE REAL AND IMAGINARY COMPONENTS OF THE AMPLITUDE. THE X AXIS REPRESENTS

THE VARYING RESONANT FREQUENCY OF THE HARMONIC OSCILLATOR.

On the left side of figure 24 we see three complex response functions similar to those that we used

to generate figure 22. But when we take the sum, as shown on the right, the values don’t always

add constructively. Figure 23 shows a zoomed in and stretched out view of the region between the

left and middle large circular regions from the right side of figure 24. The multicolored part of

figure 24 clearly shows that each independent response function in that region (between the red

and blue main spirals) has a relatively large amplitude. But in the sum (purple) there appears to a

bottleneck there, indicating that the red and blue functions interfere destructively. The next

bottleneck region to the right, between the blue and green spirals, there appears to be constructive

interference in the sum. If we plot the sum at several time steps we find that the interference

oscillates between destructive and constructive as the relative phases of the spirals change.

65

If we set our value ofαvery small so that the rate of convergence is very slow relative to the rate of

oscillation between constructive and destructive interference then we can partly eliminate this

effect. Fortunately, it is not necessary to eliminate it completely.

In figure 25 we see the same three functions from 24 plotted in their absolute value along with their

sum. Notice that each colored function, at its peak, stands far above its neighbors. This indicates

that, despite any interference that might go on between peaks, each function dominates the

absolute value of the sum at its own peak. Because of this, the local maxima of the sum are close to

the peaks of the individual terms. If we adjustα to slow down the convergence then the peaks get

more pronounced and the sum becomes an even better approximation at the local maxima. As we

will see later, we can dynamically make adjustments toαso that we do not experience a significant

loss in time resolution when we slow down the convergence.

We turn now to the topic of how will identify the

frequencies in a polyphonic signal. The flowchart

on the right summarizes the process. On the next

66

FIGURE 25 - THREE RESPONSE FUNCTIONS (COLORS) AND THEIR SUM (BLACK)

page is a more detailed chart demonstrating how the frequency components are identified at each

time step.

67

68FIGURE 26

Although our spectrogram updates its simulation of the harmonic oscillators after each new sample

it processes, we found that it works well to record the frequencies only every 500 to 1000 samples.

Longer time between outputs reduces the file storage requirements and reduces the perception of

instability in frequency and amplitude of the output. But it also reduces the accuracy of timing in

response to transient sounds, cutting off the pick sound at the beginning of each note from a guitar,

for example.

Shortening the time interval improves the perceptual quality of the encoding for transients in the

signal but reduces the quality for signals with more stable frequency and amplitude characteristics.

There is a considerable amount of inaccuracy in our estimates of the signal properties. The

frequency-amplitude wobbling effect is the most perceptible artifact. Lengthening the time

between outputs makes the wobble undetectable and greatly increases the perceptual quality of the

output. It is therefore necessary to find a compromise value for the output period that works well

for both transient and sustained signals. We are currently working on a new data format that

allows the output period to be adjusted. The analysis phase of the program already adjustsα

according to the rate of change in average amplitude of the input signal so it is quite natural to use

the same parameter to control the timing of the data recording.

Typically, the number of frequencies detected at each time step is not more than ten. When the

signal contains a densely packed set of frequencies it is difficult for the masking algorithm to get an

accurate estimate. The frequencies create a lot of interference in the spectrogram when they are

close in both frequency and amplitude. As a result, the resonant response curves blend together

giving the appearance of wider peaks. We fit the mask functions as shown in figure 26 using a least-

squares regression in the neighborhood of the peaks so when the peaks widen, each mask covers a

larger portion of the data. The masking level rises very quickly to cover the whole data and the

algorithm misses some of the frequencies. It might seem that this would cause a big loss in

69

perceptual quality but in fact, as we mentioned in the second chapter, a very similar effect occurs in

the human hearing senses. Our goal then, is not to detect every frequency component in the signal

but only to detect every perceptible frequency component.

DATA STORAGE

FIGURE 27 - REPRESENTATION OF THE 3D MATRIX USED FOR DATA STORAGE

Our CODEC stores its output in a three-dimensional matrix. The program writes a new page into

the matrix every time it processes the number of samples that we define for the output period.

Figure 27 illustrates three pages of the data file. Each of the time indices T ncorresponds to a page

that records an arbitrary number of frequencies and the amplitude for each one of them.

We can easily calculate the data compression that can be achieved with this structure if we consider

that the CODEC writes one page every 500~1000 samples and that each page contains information

for between 2 and 10 frequencies:

Data ReductionFactor

= new dataRateold dataRate

=

sampleRateoutputPeriod

× sampleDepth× (2averageFrequencies )

sample Rate × sampleDepth

70

For typical values this works out as follows:

Data ReductionFactor

=

44,000500

×16×2×4

44,000×16=1.6%

That means that after processing, our CODEC reduces the audio data to 1.6% of its original size.

Compared to standard methods this is a huge improvement. (Typical compression rates for MP3

encoders are near 10%.) We will compare the quality of the sound reproduction in the next

chapter.

SYNTHESIS

FIGURE 28 - OUR SPECTROGRAM FOR THE SUM OF TWO CHIRPS; ONE DOWNWARD SWEEPING AND THE OTHER UPWARD SWEEPING

71

Once we have the data written into the 3D matrix format reconstructing the original sound is

relatively simple. Suppose our data file shows two frequency components in the signal; at time t 1

the signal is estimated to bes1 ( t )=3cos (2.4 t )+2cos (13 t )and att 2 , s2 (t )=4cos (2.5t )+2.5cos (12 t ).

Even though the frequencies and amplitudes are not exactly equal between the two time indices it

is reasonable to assume that the low frequency att 1faded smoothly into the low frequency att 2and

that the same happened for the high frequency. The time between intervals is so short that the

signal can’t change much between pages in the data file.

As a simplified illustration, consider a monophonic signal:

s1 ( t )=1cos (2t )

and

s2 (t )=2cos (2.5t )

if we create a continuous functions1,2 ( t )=a (t )cos (tω ( t )) and leta (t )andω (t )be functions that

smoothly fade the frequency and amplitude betweent 1andt 2then we can sample it at discrete points

and get the desired effect. The actual implementation is more complex but it would be best to refer

to the code section in the last chapter to see the complete details.

There is one more major difficulty: we need to figure out how the frequencies move between the

time indices in our data file. Figure 28 shows an example of a signal that is the sum of two tones for

most of the time. One tone sweeps upward in frequency while the other shifts down. They meet in

the center and for a brief instant there is only one tone; then they diverge again. In situations like

this, where the number of frequencies changes between pages, it is important to carefully identify

which frequencies in a given page should fade in to frequencies in the next page and which ones

should simply fade out to zero amplitude.

72

Figure 29 shows three pages of our data file, each containing a different number of frequencies.

Since the pitches in the original sound sample may be bending in both frequency and amplitude we

cannot assume that f1 in the first page corresponds to f1 in the second page.

FIGURE 29 - SOMETIMES THE INTRODUCTION OF NEW TONES MAKES IT DIFFICULT TO DECIDE HOW TO SCALE THE FREQUENCY BETWEEN PAGES OF THE DATA FILE

We match the frequencies between pagest 1andt 2by the following algorithm:

1. Consider the amplitudes of all frequencies on both pages. Identify the strongest one.

2. Choose the frequency in the opposite page that is nearest to the frequency identified in the

previous step.

3. If the two frequencies are within a specified range of each other then they form a pair.

Copy them to the synthesizer, erase them from the data file and continue to the next step. If

they are not close, then the frequency identified in part 1 fades to zero amplitude. Pair it

with a sinusoid of zero amplitude and copy the pair to the synthesizer. Erase it from the

data file. (It is important to remove the frequencies from the data file after they are used to

avoid re-using them in two or more pairs.)

73

4. If there are still any frequencies left in the data file repeat back to step one, otherwise, stop.

6 - PERFORMANCE

QUALITY

It is well known that it is possible to reconstruct a sound from a graphical representation of its

spectrogram (Hentjeens, 1997). But the errors in the reconstructed signal are more noticeable than

with more conventional coding methods. A certain amount of error is unavoidable thanks to the

uncertainty principle and to our unwillingness to accept a spectrogram that outputs negative

energy values or shows sound at times when the input signal did not (Cohen, 1995). Knowing that

this error can be shifted into frequency or time domain but not eliminated, we have tried to

minimize the perception of error by imitating the distribution of error in human hearing. We

believe that if our CODEC is inaccurate in precisely the same way that our ears are inaccurate, then

we have minimized the perceptual error.

We control the distribution of uncertainty between the time and frequency domains by adjusting

the value ofα . Higher values ofα increase frequency resolution but cause the sound to be “mushy”;

the beginnings and ends of tones become spread out in time. In order to get the best sound it is

necessary to adjustαdifferently for each sound or even to adjust it dynamically during the analysis

of a single sound.

There is a problem with this approach. Even if our CODEC succeeds in having the same time and

frequency uncertainty as our ears do, when we listen to the output the error is compounded. This

is easy to see if we consider the time domain error:

74

FIGURE 30 - COMPOUNDING OF TIME DOMAIN SPREADING ERROR

The first frame of figure 30 shows a sinusoidal input that starts and stops abruptly. Normally,

whenever two sounds occur within 117 second of each other, our ears hear them as a single sound

(Goldberg, 2003). This is because sounds of this duration approach the limit of our time domain

resolution. The second frame of figure 30 illustrates the effect of the uncertainty in human time

resolution; causing smoothing of the edges of the signal. Our CODEC introduces additional

uncertainty of time resolution, therefore when human ears listen to the output of our CODEC, the

uncertainty is compounded a second time; once from the CODEC and again from the ears of the

listener. The third slide shows how the uncertainties of our CODEC combine with those of the

human senses resulting in compounded uncertainty in the time domain.

In the frequency domain, uncertainty does not result in smoothing. Instead it results in inaccurate

estimation. We do not have frequency smoothing in our CODEC because we eliminate that in the

masking phase. But the masking transfers the frequency domain uncertainty from a smoothing

effect to a probabilistic effect; increasing the standard deviation of the estimate. Even though

uncertainty in the frequency domain works probabilistically the effect is still compounded; first our

CODEC makes estimation errors in the frequency domain, then the listeners ears make estimation

errors while listening to the output from our CODEC, and the effects compound to produce a greater

perceptual uncertainty.

HIGH FREQUENCY ESTIMATION ERROR

75

The justification for our discrete model of damped harmonic oscillators is based on its behavior in

the limit as the sample rate increases to infinity. But what about the behavior as the sample rate

goes the other way? As one might expect, the discrete model is not much like a damped harmonic

oscillator when it operates very close to the Nyquist frequency14.

In practical signal processing, no digital coder operates very well at those wavelengths. Recall from

the first chapter that the sampling theorem guarantees perfect reconstruction of the signal up to

half the sampling rate, provided the interpolation between samples is done using the sinc function.

In ordinary digital-to-analog converters, no such interpolating function is used (Goldberg, 2003).

Instead the converter uses a step function, that is, if sample 1 corresponds to an output of .3 mVand

sample 2 corresponds to .4 mV then the output is .3 mV between sample 1 and sample 2. When the

time comes for sample 2 to play back the voltage will increase to .4 mV as quickly as possible. Since

the voltage can’t change instantaneously, the change may be somewhat smoother than a step

function, but it certainly won’t be an exact reconstruction of the input. The following illustration

shows an example of the kind of errors that can occur near the Nyquist frequency when the proper

interpolating function is not used.14 (Cohen, p. 30)

76

FIGURE 31 - THE SINC FUNCTION

FIGURE 32 - HIGH FREQUENCY EFFECT OF AN INCORRECT INTERPOLATION FUNCTION

On the left side of figure 32 is a continuous sine wave. The red points show samples taken at a rate

just slightly higher than twice the frequency of the wave. Using linear interpolation between the

samples to reconstruct the signal, we get a “beating” effect in the amplitude of the tone that was not

present in the original. This beating is audible as a tone, sometimes with a pitch that is not

consonant with the continuous signal. We have observed that this effect is quite noticeable even for

pitches more than twenty percent below the Nyquist frequency.

A major problem with our implementation of the discrete harmonic oscillator model is that we only

model those oscillators whose resonant period is an integer number of samples. In chapter 3 there

is an illustration showing how the basis functions of Fourier series fit badly with the spacing of

musical notes in the audio spectrum, specifically that it doesn’t have sufficient resolution at low

frequencies. For our model, we could turn that illustration upside-down and get a good

representation of our high-frequency resolution issue.

POSSIBLE SOLUTIONS TO THE HIGH FREQUENCY RESOLUTION DEFICIENCY

There are several ways we could solve this problem. First, we could simply add discrete harmonic

oscillators that have non-integer resonant

periods but they would experience the same

kind of “beating” effect shown in figure 32.

77

That isn’t as bad as it might sound because even lossless15 CODECS suffer from the same problem

unless they have a special digital-to-analogue converter that interpolates correctly between

samples.

Second, we could interpolate our results in

the frequency domain. Figure 33 shows an

example of how insufficient frequency resolution causes errors in the estimation of the peak value

of a response function. Our present implementation would identify the red point as the maximum

value because it is the maximum of the data points. But a simple interpolation would reveal that

the actual value is between the first and second highest points on the graph.

A third solution would be to interpolate the signal in the time domain before beginning to process

the data. By increasing the sample rate by a factor of two or three we shift our high frequency

errors higher into the inaudible part of the audio spectrum. This is the most plain and simple

solution because it requires absolutely no changes to our implementation. We have tested it and

found that it is very effective except for one problem: it takes longer to run the CODEC when there

are more samples in the signal. The interpolation itself is very fast but if we double the sample rate

we have twice as many samples to process. Unfortunately, doubling the sample rate also doubles

all of our resonant frequencies so we have to add more discrete harmonic oscillators on the low end

to fill the gap. The result is a quadratic increase in processing time. As we will see in the next

section, we may be able to improve that to O ¿ if we thin out our frequency resolution on the low

end. But there is another way to keep the processing time from increasing with the sample rate.

We don’t need the high sample rate for the low frequencies so we might try interpolating the input

signal in the time domain to several sample rates ranging from below the original rate to four or

eight times higher. When processing the signal, we would use the high sample rate versions for the

15 Actually, our signal is complex so it also contains phase information. But we use cosines here for simplicity of demonstration.

78

FIGURE 33 - OUR CODEC INCORRECTLY IDENTIFIES THE PEAK OF A RESPONSE FUNCTION AS A RESULT OF

INSUFFICIENT HIGH-FREQUENCY RESOLUTION

high frequencies and the lower versions for the low frequencies. That way we neither waste time

processing unnecessarily large amounts of data to estimate low frequencies nor do we sacrifice

frequency resolution at high frequencies.

HIGH FREQUENCY ATTENUATION

There is another type of error that appears in the upper part of the spectrum. It is a subtle artifact

of the difference between the discrete and continuous models of damped-harmonic oscillators. It is

a tacit assumption of our model that the response of a harmonic oscillator to a constant input

should be zero. Sets (τ )=1and we can see this is true for the continuous expression:

∫−∞

t

s (τ )e−i ωr τ eΓ (τ−t )dτ=∫−∞

t

e−i ωr τ eΓ (τ−t )dτ=0

But for the discrete model we are not so lucky. Definex [k ]=1and our discrete model becomes

∑k=−∞

κ

α κ−k znk x [k ]= ∑

k=−∞

κ

ακ−k znk

This does not simplify but numerical testing shows that it approaches zero in the limit as the

sample rate goes to infinity. Practically, however, it is always non-zero and the error increases as

the resonant frequency of the oscillator approaches the Nyquist frequency. Fortunately there is a

way to control the error. The error is not directly dependent on the frequency of the oscillator; it

depends onα , butα is frequency dependent. The reason for the error is easy to see if we letα=1and

x [k ]=1. Our discrete oscillator response at thek thsample becomes simply

znk

The average puts all its weight on the most recent sample and there is no frequency filtering at all.

This is precisely the behavior that appears in the discrete model asαapproaches 1; the high

frequency estimates look less like filtered frequency responses and more like the input signal

multiplied by a complex number.

79

Our solution is to fixα for frequencies above a certain threshold. Experiments showed that if we

allowα to depend on the resonant frequency of the oscillator up to a resonant period of 80 samples

that the error resulting from the discretization is limited to less than one percent of the amplitude

of the input signal. For oscillators of resonant periods less than 80 samples, we use the same value

ofαas the 80 sample resonator instead makingα frequency dependent.

The main disadvantage of doing this is high-frequency attenuation. When we allowα to increase

with the frequency the amplitude of response is nearly linear throughout the spectrum but when

we fix the value for high frequencies we observe rapid attenuation of the amplitude as the resonant

frequency approaches the Nyquist frequency. We tried compensating for that effect by simply

multiplying the high frequency output by a function that cancels the effect of the attenuation. Of

course, the difficulty with this approach is that it also multiplies the error. In fact, since fixingα

doesn’t eliminate error, the signal-to-noise ratio for those high frequencies doesn’t improve at all.

Fixingα for resonant periods below 80 samples fixes the error at one percent but as the amplitude

of the signal decreases, the proportion of the reading that is due to that one percent error increases

in much the same way as the error itself increased when we allowedα to remain frequency

dependent.

One solution that we haven’t yet tried is to postpone the amplitude compensation until after the

masking phase of the algorithm is complete. Since the error is roughly the same for nearby

frequencies, it probably doesn’t affect the estimate of the frequency very much. Once we have

correctly identified the frequency it should be safe to multiply the amplitude to compensate for the

attenuation because we would only be multiplying a single frequency, not a whole range of them.

SPEED

80

We intended from the outset that every part of our algorithm should be theoretically capable of real

time operation. Since the current implementation is only a proof-of concept, we focused our efforts

developing the ideas, not on optimizing for efficiency. Although there are still many ways to

improve the existing code, it would perhaps be wise to begin any efforts to improve the speed by re-

writing the entire project in c or c++. We have replaced some of the inner loops with vectorized

code, which greatly improves the efficiency in MATLAB but sometimes makes the programming

difficult to decipher.

EFFICIENCY OF THE ANALYSIS PHASE

The analysis phase, that is the discrete model of the array of harmonic oscillators, executes in linear

time relative to the number of samples in the input and also relative to the number of frequencies in

the spectrogram. The current version runs about five-times slower than real time. It computes the

response of the whole array of frequencies simultaneously as a vector operation; it should increase

its efficiency by an additional five or ten percent if we do that operation as a matrix computation;

computing the response for all frequencies over a large section of input as a matrix operation.

Perhaps the most significant improvement we could make with the analysis phase of the program

would be to selectively reduce the number of frequencies in the spectrogram. Right now we include

a frequency corresponding to every period of oscillation that is an integer number of samples. For

high frequencies we need this much resolution and more. But lower down, the perceptual

difference between a tone with a period of 300 samples and one with a period of 301 is hardly

noticeable; we could probably compensate for the loss if we discarded a few frequencies simply by

interpolating between values on the spectrogram.

EFFICIENCY OF THE MASKING PHASE

81

Masking is by far the slowest operation because its dominant inner loop contains the least squares

regression that fits response functions to the peaks of the spectrogram data. We had originally

tried less computationally expensive ways of estimating the parameters of those mask functions,

with moderate results, but as time for the project was running short, we decided to use regression

so that we could observe the maximum perceptual fidelity of which the system is capable. There

are many more efficient ways to estimate the mask functions. Finding a good substitute for

regression is the best first step to speeding up masking phase of the encoder.

EFFICIENCY OF THE SYNTHESIS PHASE

The synthesis phase is already much faster than real time. Even for longer input files it completes

its processing almost immediately. Typically for CODECS of musical application it is advantageous

if the synthesis phase is faster than the analysis phase because we usually play back audio samples

many more times than we record them.

DATA RATE

We have mentioned already in chapter 5 that our program compresses data down to one or two

percent of original size. While that is an impressive compression ratio, it’s difficult to compare it

with other CODECS unless we consider the quality as well. It’s of no use getting high compression

rates if the reconstructed signal is too damaged to be useful. Since we are still working on

improving the quality of the compression, we cannot yet brag about the small file size.

7 - FUTURE RESEARCH POSSIBILITIES

PARALLELIZATION

Each damped-harmonic oscillator in the analysis phase operates independently of its neighbors.

Therefore parallelization is a natural next step toward improving the speed. Our spectrogram could

82

easily be separated to run in separate threads on multi-processor desktop computers, in graphics

processing hardware, or on a collection of small microcontrollers.

FEATURE RECOGNITION AND TRANSFORMATION

Feature recognition methods for voice recognition or security biometrics are often based on linear

procedures like principle components analysis. Sometimes the abilities of linear methods are

limited by the non-linear structure of the feature space in which they operate. If we begin by

applying a non-linear transformation such as the one of our coder, then apply statistical or linear

algebraic techniques to the results, it may open the door to new possibilities or enhance the

performance of existing methods.

Ours is the only audio format we are aware of for which pitch shifts of arbitrary amounts can be

accomplished by scalar multiplication. Consider a single page of our data file:

If we call this page T nwe can double the amplitude and raise the pitch by one octave simply by

multiplying by the scalar number 2. With vector multiplication we could do something more

interesting. The operationT n ∙ {1 ,2312 ,2

412 ,1 , 1

2, 12}corresponds to raising the first two harmonics

above the fundamental by a minor third and a major third16 respectively, and reducing them to half

their amplitude. This kind if transformation could be very powerful in music synthesis applications.

To do the same thing with time domain output of a Fourier transform would be a challenge to say

the least.

MUSICAL TRANSCRIPTION

16The Nyquist frequency is one half of the sampling rate.

83

This CODEC does not do polyphonic pitch detection in the musical sense because it estimates only

the harmonic frequencies in the sound; it doesn’t try to guess which frequencies are fundamental

and which are higher harmonics. For polyphonic musical transcription the next step is to try to find

harmonic series in the output and identify the fundamental pitches. By applying statistical methods

to this task we might also be able to improve the quality of the original estimate of the frequencies

because we expect the various harmonics of a single vibrating instrument to show strong

correlation in both amplitude and frequency.

PSYCHOACOUSTICS

We mentioned before that the absolute value of the sum of resonance-response functions is a non-

linear effect that becomes more linear as the value ofαdecreases, resulting in better frequency

resolution at the expense of slower time response. There is debate among researchers of

psychoacoustics about how much of the human ability to discern differences in pitch is a result of

the filtering of the cochlea and how much depends on logical processing in the brain. Since our

method operates under the assumption that the non-linearity effects are small and it completely

ignores all information about the phase of the signal, it is a demonstration of the quality of

perception that is possible based on only the physical filtering effect. This is interesting for two

reasons. First, the fact that our CODEC reproduces the signal in a recognizable fashion suggests that

it would be possible for biological hearing organs to do reasonably accurate recognition of sounds

without requiring any additional phase-dependent signal processing in the brain. Second, we could

try fitting our mask curves in complex space to see how much it would improve the quality of our

output. That might give some indication of what role the brain plays in low-level audio signal

processing.

8 - CODE

84

ORGANIZATIONAL OVERVIEW

Our code operates in three phases: analysis, masking, and synthesis. Analysis refers to the

spectrogram. Masking and synthesis are as described in the implementation chapter. All code is

written for MATLAB.

ANALYSIS FUNCTIONS

MAIN FILE: ANALYSISTEST.M

%data - the input sound PCM%outputPeriod - number of samples between outputs%damping - the averaging factor%shortestPeriod - the number of samples in the shortest period%numFrequencies - self explanitory%avgFctrWvlngthLmt - the shortest frequency for which the averagefactor is frequency dependent%%output format:%row 1 - samples per period%row 2 - energy%row 3 - corresponding index from the original PCM data inputfunction out = testAnalysis8(data, outputPeriod, damping, shortestPeriod, numFrequencies, avgFctrWvlngthLmt) out = []; %for debugging: averageFactorAdjustments = [];

%open a status bar window sbar1=statusbar('Analyzing...');

%prepare some necessary vectors wavelengths = shortestPeriod:(numFrequencies+shortestPeriod-1); oneStepPhase = (2*pi)./wavelengths; currentPhase = zeros(1,numFrequencies); %avgFactors is scaled according to the change in input energy. %"baseAvgFactors" is the initial value. We retain a copy of the %original instead of using it directly so that we avoid numerical %errors that would be introduced bt constantly scaling it. baseAvgFactors(1:avgFctrWvlngthLmt-shortestPeriod) = avgFctrWvlngthLmt; baseAvgFactors(avgFctrWvlngthLmt-shortestPeriod+1:numFrequencies) = wavelengths(avgFctrWvlngthLmt-shortestPeriod+1:numFrequencies); avgFactors = baseAvgFactors*damping; quarterPeriods = round(wavelengths/4); firstDataIndex = quarterPeriods(end) + 1; dataIndices(1,1:numFrequencies) = firstDataIndex; dataIndices(2,1:numFrequencies) = firstDataIndex - quarterPeriods; % resonators - row 1, samples per period % - row 2, avg energy resonators(1,:) = wavelengths; resonators(2,:) = zeros(1,numFrequencies);

85

%the algorithm: while (dataIndices(1,1) <= length(data)) inputVector = data(dataIndices(1,:))' - i*data(dataIndices(2,:))'; prevValues = resonators(2,:); %thisEnergy = inputVector.*exp(i*currentPhase)./wavelengths; thisEnergy = inputVector.*exp(i*currentPhase); resonators(2,:) = prevValues.*exp(-1./avgFactors) + thisEnergy.*(1-exp(-1./avgFactors)); dataIndices = dataIndices + 1; currentPhase = mod(oneStepPhase + currentPhase,2*pi); %every 'outputPeriod' entries, update the status bar and break if it has been %closed. Also output the frequency/energy data at this time. Also %update the averageFactor if mod(dataIndices(1,1), outputPeriod) == 0 progress = dataIndices(1,1) / length(data); if isempty(statusbar(progress,sbar1)) break; end out(:, :,end+1) = resonators([1 2],:); %transience is a measure of how quickly the input amplitude is %changing. We consider the change over a period of time that %depends on the average factor at the longest wavelength. if((dataIndices(1,1) > baseAvgFactors(end)*8+8) && (dataIndices(1,1) < (length(data) - baseAvgFactors(end)*8-8))) %don't adjust at the beginning or the end %compute the transcience in a window of size 4 times the %averageFactor at the longest wavelength transience = energyRateOfChange(data,dataIndices(1,1),16*round(baseAvgFactors(end))); averageFactorAdjustment = 10*(1 - transience); %this is to reduce the average factors when the signal is %very transient. averageFactors = baseAvgFactors*damping*averageFactorAdjustment; averageFactorAdjustments(end+1) = averageFactorAdjustment; end end end

%close the status bar if ishandle(sbar1) delete(sbar1); endend

AUXILIARY FILE: ENERGYRATEOFCHANGE.M

%This function estimates the change in avg energy over a small range of%samples in the input data by fitting a polynomial to the square of the input and%estimating the rate of change based on the coefficients of that polynomial.

%the meaning of the output of this function is difficult to interpret when

86

%indx is at the beginning or end of the data so we leave it to the function%that calls this one to make sure the array indices don't go out of bounds.

%data - ordinary pcm data%indx - the index of the input data that where the analysis filter is%processing.%windowSize - the number of samples to considerfunction out = energyRateOfChange(data, indx, windowSize) halfWindow = floor(windowSize/2); wholeWindow = 1 + 2*halfWindow; t = 1:wholeWindow; p = polyfit(t',data(indx-halfWindow:indx+halfWindow).^2,2); %we estimate the rate of change over the window by multiplying the %quadratic term by windowsize squared and the linear term by the %windowsize. (The idea is inspired by Taylor series expansions) out = 2.8*(abs(p(1))*(wholeWindow/2)^2 + abs(p(2))*(wholeWindow/2));end

87

MASKING FUNCTIONS

ORGANIZATIONAL FILE: MASKTEST3.M

This file doesn’t compute anything. It just organizes several pieces to work in sequence. Although

it belongs to the masking section of the code it also runs the synthesis functions and writes an

output. Since the synthesis is very fast it was more convenient to have it run every time the

masking finished so that its output could be analyzed to verify that the masking functions worked

properly.

%open a status bar windowsbar2=statusbar('Masking...');

maskedData = [];for sampleIndex = 1:length(analyzedData(1,1,:)) %usage: applyMask(data, threshold) maskedData(:, :, sampleIndex) = applyMask(analyzedData(:,:,sampleIndex),.80); %update the status bar and break if the status bar is closed. progress = sampleIndex / length(analyzedData(1,1,:)); if isempty(statusbar(progress,sbar2)) break; endend

%close the status barif ishandle(sbar2) delete(sbar2);end

compressedData = compressZeros(maskedData);

%reconstructedData = testSynthesis3(maskedData, 100, 20, 350);reconstructedData = testSynthesis4(compressedData, 1000,.1);

%plot(reconstructedData);

wavwrite(scaleAudio(reconstructedData,1), 44000, '~/matlab/soundoutput/test.wav');

88

MAIN FILE: APPLYMASK.M

%data should be a two dimensional array of dimensions (2 x numFrequencies)%where each ordered pair contains 1 - samples per period and 2 - amplitude.%%see "aliasAmplitude.m" for an explanation of alpha and ampFactor%threshold - the fraction of the total signal that must be represented in%the output. A value of .9 means that the masker will continue adding%frequencies to the output until 90 precent of the original data is below%the level of the mask.function out = applyMask(data,threshold) %keep everything in column-major order data = data'; maskLevels = zeros(length(data(:, 1)),1); %wavelengths out(1,:) = data(:, 1); %masked output values out(2,:) = zeros(1, length(data(:, 1)));

dataNorm = norm(data(:,2)); maskedData = abs(data(:,2)); [strongestFreqAmp strongestFreqIdx] = max(abs(data(2,:))); strongestFreqWavelength = data(1,strongestFreqIdx); initialGuessAlpha = 100;

windowSize = 35; while (dataNorm ~= 0 && norm(maskedData)/dataNorm > (1-threshold)) [incorrectStrongestFreqAmp strongestFreqIdx] = max(maskedData); strongestFreqWavelength = data(strongestFreqIdx,1); alpha = estimateAlpha([data(:,1) maskedData],windowSize,strongestFreqIdx,strongestFreqWavelength,initialGuessAlpha); %for debugging global ALPHAARRAY ALPHAARRAY(end+1)=alpha;

initialGuessAlpha = alpha; strongestFreqAmp = abs(data(strongestFreqIdx,2)) - maskLevels(strongestFreqIdx); out(2,strongestFreqIdx) = data(strongestFreqIdx,2); maskLevels = maskLevels + singleMaskCurve(strongestFreqAmp, data(strongestFreqIdx,1), data(:,1), alpha); maskedData = chop(abs(data(:,2)) - maskLevels); endend

89

AUXILIARY FILE 1: ESTIMATEALPHA.M

This is the first step toward fitting a resonance response curve to one of the peaks in the data. This

file cuts a section of the data near a peak and sends it to the next function for least squares fitting.

% as usual, the first column of 'data' is the wavelengths, the second column is% for amplitudes.function out = estimateAlpha(data,windowSize,centerFrequencyIndx, centerWavelength, initialGuessAlpha) if mod(windowSize,2) == 0 windowSize = windowSize-1; end

halfWindow = (windowSize-1)/2; beginWindow = centerFrequencyIndx - halfWindow; endWindow = centerFrequencyIndx + halfWindow; dataLength = length(data(:,1));

if beginWindow < 1 windowSize = endWindow; %if we have to shorten the window beginWindow = 1; end

if endWindow > dataLength windowSize = windowSize - (endWindow - dataLength); endWindow = dataLength; end

out = fitMaskCurve(data(centerFrequencyIndx,2),data(centerFrequencyIndx,1),data, initialGuessAlpha);end

AUXILIARY FILE 2: FITMASKCURVE.M

This file works together with the previous one to find an estimate of the parameters for the mask

curve that best fits the data near a peak. This is the function that does the least squares regression.

% data - 1st column: wavelengths, 2nd column: amplitudesfunction out = fitMaskCurve(maskAmplitude,maskWavelength,data,initialGuessAlpha) model = @maskfun; out = fminsearch(model, initialGuessAlpha, optimset('TolX',10));

function sse = maskfun(a) FittedCurve = zeros(length(data(:,1)),1); for idx = 1:length(data(:,1))

90

FittedCurve(idx) = aliasAmplitude3(maskAmplitude, maskWavelength, data(idx,1),a); end ErrorVector = FittedCurve - data(:,2); sse = norm(ErrorVector); endend

AUXILIARY FILE 3: ALIASAMPLITUDE3.M

This function generates a resonance response curve, given appropriate parameters. Basically, it

implements equation (4).

%given a masker of period 'maskPeriod' with amplitude 'maskAmplitude'%this function predicts the amplitude of aliasing onto a resonator with%period 'aliasPeriod'.

%"gamma is a parameter that depends on the amount of damping in an%oscillator. It may also be called the 'linewidth' of the resonator."%Higher values of gamma indicate wider spread of aliasing.

function out = aliasAmplitude3(maskAmplitude, maskWavelength, aliasWavelength,alpha)%rf = resonant frequency%ff = forcing frequency%fa = forcing amplitudefa = abs(maskAmplitude);%avoid division by 0if aliasWavelength == 0 || maskWavelength == 0 out = 0; return;endrf = 1 / aliasWavelength;ff = 1 / maskWavelength;

%we scale the input down by dividing by the wavelength for wavelengths >%80. We account for this by doing the same thing to the mask generated by%this function.avgFctrWvlngthLmt = 80;if (rf > 1/avgFctrWvlngthLmt) aliasRatio = ff*avgFctrWvlngthLmt;else aliasRatio = ff/rf;end

wavelengthDependentAlpha = alpha*aliasRatio;%old formula%out = sqrt(fa * (SEAdjustedGamma / ( (ff - rf).^2 + SEAdjustedGamma.^2)));%new formulaout = fa / sqrt(1+(wavelengthDependentAlpha^2)*(ff-rf)^2);end

91

AUXILIARY FILE 4: SINGLEMASKCURVE.M

Since aliasAmplitude() only computes the resonance response curve for a single point, we

simplify some of the functions that call it for every element in a large vector by calling

singleMaskCurve() instead.

%'wavelengthsList' should be the list of frequencies that are recorded in%the data file we are processing.%%'gamma', 'smoothingError', and 'ampFactor' are as explained in aliasAmplitude.mfunction out = singleMaskCurve(maskAmplitude, maskWavelength,wavelengthList, alpha) for idx = 1:length(wavelengthList); out(idx,1) = aliasAmplitude3(maskAmplitude, maskWavelength, wavelengthList(idx), alpha); endend

SYNTHESIS FUNCTIONS

MAIN FILE: TESTSYNTHESIS.M

%version 4 - a complete revision where synthesis is done over a small%finite number of frequencies. The algorithm follows each frequency%through the time-frequency-amplitude space and smoothly transitions over%both frequency and amplitude between samples.

%'data' should be in the format of compressZeros.%'outputPeriod' is equivalent to the 'outputPeriod' argument from%'testAnalysis()'.function out = testSynthesis4(data, outputPeriod, freqChangeTolerance) [layers numFreqs numSamples] = size(data); out = zeros(1,numSamples*outputPeriod); columnOnes = ones(outputPeriod,1); %this will be used later to sum columns of the matrix containing oscillations as several frequencies prevPhase = zeros(numFreqs,2);

for timeIdx = 2:numSamples %the following arrays will store the information about how the various %frequencies connect with eachother. amps1 = zeros(numFreqs,1); amps2 = zeros(numFreqs,1); wavelengths1 = zeros(numFreqs,1); wavelengths2 = zeros(numFreqs,1); ampFreqIdx = 1; [foo prevNZIndices prevNZAmplitudes] = find(data(2,:,timeIdx-1));

92

[foo currentNZIndices currentNZAmplitudes] = find(data(2,:,timeIdx)); numNZamplitudes = length(currentNZAmplitudes) + length(prevNZAmplitudes); currentNZData = data(:,currentNZIndices,timeIdx); prevNZData = data(:,prevNZIndices,timeIdx-1); if isempty(currentNZData) currentNZData = zeros(2,length(prevNZData)); end if isempty(prevNZData) prevNZData = zeros(2,length(currentNZData)); end while (numNZamplitudes > 0) [currentMax freqIdxMax] = max(currentNZData(2,:)); [previousMax prevFreqIdxMax] = max(prevNZData(2,:)); if abs(currentMax) > abs(previousMax) amps2(ampFreqIdx) = currentNZData(2,freqIdxMax); wavelengths2(ampFreqIdx) = currentNZData(1,freqIdxMax); [wavelengths1(ampFreqIdx) amps1(ampFreqIdx) nearestIdx] = findNearest(currentNZData(:,freqIdxMax),prevNZData,freqChangeTolerance); currentNZData(:,freqIdxMax) = 0; numNZamplitudes = numNZamplitudes - 1; if abs(amps1(ampFreqIdx)) > 0 && nearestIdx > 0 numNZamplitudes = numNZamplitudes - 1; prevNZData(:,nearestIdx) = 0; end else %currentMax <= previousMax amps1(ampFreqIdx) = prevNZData(2,prevFreqIdxMax); wavelengths1(ampFreqIdx) = prevNZData(1,prevFreqIdxMax); [wavelengths2(ampFreqIdx) amps2(ampFreqIdx) nearestIdx] = findNearest(prevNZData(:,prevFreqIdxMax),currentNZData,freqChangeTolerance); prevNZData(:,prevFreqIdxMax) = 0; numNZamplitudes = numNZamplitudes - 1; if abs(amps2(ampFreqIdx)) > 0 && nearestIdx > 0 numNZamplitudes = numNZamplitudes - 1; currentNZData(:,nearestIdx) = 0; end end ampFreqIdx = ampFreqIdx + 1; end exponentMtrx = zeros(numFreqs,outputPeriod); currentPhase = [0 0]; for freqIdx = 1:numFreqs if wavelengths1(freqIdx) ~= 0 currentPhase(freqIdx,1) = wavelengths2(freqIdx); beginPhase = prevPhase(find(prevPhase(:,1) == wavelengths1(freqIdx)),2); if isempty(beginPhase) beginPhase = 0; end %if the same phase appears twice, delete the first one %from the list after using it if length(beginPhase) > 1 doublePhaseIndices = find(prevPhase(:,1) == wavelengths1(freqIdx)); prevPhase(doublePhaseIndices(1),:) = 0; end

93

[exponentMtrx(freqIdx,:) currentPhase(freqIdx,2)]=continuousFadeExps(amps1(freqIdx),amps2(freqIdx),wavelengths1(freqIdx),wavelengths2(freqIdx),outputPeriod,beginPhase(1)); end end prevPhase = currentPhase;

%delete any zero rows from the matrix exponentMtrx(~any(exponentMtrx,2),:)=[];

[expMtrxHeight expMtrxWidth] = size(exponentMtrx); rowOnes = ones(1,expMtrxHeight); %this is elementwise exponential, not matrix exponential out( ((timeIdx-2)*outputPeriod + 1):((timeIdx-1)*outputPeriod)) = rowOnes*real(exp(exponentMtrx)); endend

AUXILIARY FILE 1: FINDNEAREST.M

%searches through WAArray to locate the (wavelength,amplitude)%pair nearest to the one defined my singleWA. (WA stands for Wavelength-Amplitude)%%if the ratio of the nearest frequency to the frequency of singleWA is far%from 1 then it will not connect singleWA to any other frequency. Instead%it tapers the amplitude to zero.%%freqChangeTolerance controls the maximum fraction of frequency change%allowed in one timestep. (So a value of .1 means the frequency can change%+ or - 10% at each timestep.)function [nearestWavelength nearestAmp nearestIdx] = findNearest(singleWA,WAArray, freqChangeTolerance) singleWavelength = singleWA(1); singleAmplitude = singleWA(2); [two WAALength] = size(WAArray); nearestIdx = 0; nearestDistance = 1e9; %if we can't find a better alternative, the amplitude should go to zero %without changing frequency. nearestWavelength = singleWavelength; nearestAmp = 0; for indx = 1:WAALength thisWavelength = WAArray(1,indx); thisAmplitude = abs(WAArray(2,indx)); if thisAmplitude > 0 thisDistance = abs(thisWavelength - singleWavelength); if thisDistance < nearestDistance nearestIdx = indx; nearestDistance = thisDistance; elseif thisDistance == nearestDistance && abs(abs(WAArray(2,nearestIdx)) - abs(singleAmplitude)) > abs(abs(singleAmplitude)-thisAmplitude) nearestIdx = indx; nearestDistance = thisDistance; end

94

end end %We arbitrarily assume that the largest reasonable frequency shift for %a single outputPeriod is +-10% of the wavelength. Therefore if the %nearest frequency is farther than that, we send the amplitude to zero %instead. if nearestIdx ~= 0 && WAArray(1,nearestIdx) ~= 0 && singleWavelength~= 0 && abs(1 - abs(singleWavelength)/abs(WAArray(1,nearestIdx))) < (freqChangeTolerance) nearestWavelength = WAArray(1,nearestIdx); nearestAmp = WAArray(2,nearestIdx); else nearestIdx = 0; endend

AUXILIARY FILE 2: CONTINUOUSFADEEXPS.M

An efficient and numerically stable way of computing the samples of a continuous sinusoid function

that fades smoothly in amplitude and frequency betweena1cos ( ω1 t )anda2cos ( ω2 t )is to compute

the array of real and imaginary arguments to exponential functions that would produce a similar

result, then exponentiate every element in the array and take the real component of the result. This

function generates the array of arguments for the exponential function. The exponentiation is done

at the end of testSynthesis.m.

%taking exp(out) should produce the anlytical signal fading smoothly in%both frequency and amplitude from (amp1,wavelength1) to (amp2,wavelength2)%over a window of 'samplePeriod' samples.function [exps endPhase] = continuousFadeExps(a1,a2,wavelength1,wavelength2,samplePeriod,beginPhase) amp1 = abs(a1); amp2 = abs(a2); freq1 = 2*pi*(1/wavelength1); freq2 = 2*pi*(1/wavelength2);

%We compute the oscillating part of the phase. If freq1 and freq2 are %equal, this is simply an array containing a simple arithmetic sequence %where the common difference is freq1. In the more usual case the %array contains the terms of a geometric series where freq1 is %the scaling factor and (freq2/freq1)^(1/samplePeriod) is the common %ratio. % %for scaleFactor a and ratio r, the nth term of the geometric series %that begins with a*r^0 is a*((1-r^n)/(1-r)). We vectorize this %computation to save time: a = freq1; if (freq1 ~= freq2) r = (freq2/freq1)^(1/samplePeriod);

95

rToTheNPower = r.^[0:samplePeriod-1]; oscillation = a*(1-rToTheNPower)/(1-r); endPhase = mod(beginPhase + a*(1-r^samplePeriod)/(1-r),2*pi); %output else % freq1 == freq2 oscillation = freq1*(0:samplePeriod-1); endPhase = mod(beginPhase + a*samplePeriod,2*pi); %output end

%now we compute the imaginary part of the exponents imaginaryComponent = beginPhase + oscillation;

%we have not yet done anything about the amplitude. We will use the %identity a*exp(i*t) = exp(i*t + ln(a)) to allow us to control the %aplitude of the signal also from the exponents. ampStep = (amp2 - amp1)/samplePeriod; ampArray = amp1 + ampStep*[0:samplePeriod-1]; %replace zeros with -1e15 so that the log function won't complain ampArray(~any(ampArray,1)) = 1e-15; realComponent = reallog(ampArray);

%output exps = realComponent + i*imaginaryComponent;end

AUXILIARY FILE 3: CHOP.M

% a quick function to replace negative values of a (real valued) matrix with zerosfunction out = chop(a) out = (a + abs(a))/2;end

96

BIBLIOGRAPHY

Cohen, L. (1995). Time-Frequency Analysis. Prentice-Hall PTR.

Dwight Brown. (n.d.). BlueMax. Retrieved from BlueMax Universal Internet Calculator: http://www.bluemax.net/techtips/JavaJungleJuice/MotherofAllDownloadCalculators/MotherOfAllDownloadCalculators.htm

Edwards, C. H. (2005). Differential Equations and Linear Algebra. Upper Saddle River, NJ: Pearson Education, Inc.

Fletcher, H. (1940, Janurary). Auditory Patterns. Rev. Mod. Phys. , 47-55.

Goldberg, M. B. (2003). Introduction to Digital Audio Coding and Standards. Boston: Kluwer Academic Publishers.

Gulick, G. a. (1989). Hearing - Physiological Acoustics, Neural Coding, and Psychoacoustics. New York: Oxford University Press.

Hentjeens, G. (1997). Speech Synthesis from a Spectrogram. Retrieved from Penn Engineering: http://www.ese.upenn.edu/sunfest/pastProjects/presentations97/Gavin97/sld001.htm

Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra. Philidelphia: SIAM.

Red Book (audio CD standard). (n.d.). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/CD_Audio

Schottstaedt, B. (n.d.). An Introduction to FM. Retrieved from Center for Computer Research in Musical Acoustics: http://ccrma.stanford.edu/software/snd/snd/fm.html

Stanford University News Service. (1994). Music synthesis approaches sound quality of real instruments. Retrieved from Stanford News: http://www.stanford.edu/dept/news/pr/94/940607Arc4222.html

Von Bekesy, G. (1960). Experiments in Hearing. New York: McGraw-Hill.

Zauderer, E. (1989). Partial Differential Equations of Applied Mathematics. New York: John Wiley & Sons, Inc.

97

a digital audio coder based on a model of human hearingande2213/filez/paper/mainpaper.docx · web...

Documents